resilience4j 重试源码分析以及重试指标采集

前言

需求

为了防止网络抖动问题，需要进行重试处理，重试达到阈值后进行告警通知，做到问题及时响应

技术选型

类型	同步、异步	是否支持声明式调用（注解）	是否支持监控
resilience4j-retry	同步	是	是
Guava Retry	同步	否	否，可通过监听器自行实现监控统计
Spring Retry	同步	是	否，可通过监听器自行实现监控统计

基于以上方案的对比，选择了使用resilience4j-retry，主要基于以下两点：

本身提供了监控数据，可完美接入premethus
resilience4j除了提供重试能力，还具备Hystrix相同的能力，包括断路器、隔断、限流、缓存。提供与Spring Boot集成的依赖，大大简化了集成成本。（后期可考虑从Hystrix迁移到resilience4j）

提出问题

resilience4j-retrry怎么集成到项目中以及怎么使用？
怎样自定义时间间隔？
resilience4j-retry实现原理？
监控数据如何统计以及premethus如何采集？

问题分析

resilience4j-retrry如何使用

maven引入resilience4j-spring-boot2包

<dependency>
		<groupId>io.github.resilience4j</groupId>
		<artifactId>resilience4j-spring-boot2</artifactId>
		<version>1.7.1</version>
</dependency>
1
2
3
4
5

配置重试服务

// 对应@Retry注解的name属性
resilience4j.retry.instances.sendConfirmEmail.max-attempts=3
1
2

在需要重试的方法加上@Retry注解

@Retry(name= "sendConfirmEmail",fallbackMethod = "sendConfirmEmailFallback")
public void sendConfirmEmail(SsoSendConfirmEmailDTO ssoSendConfirmEmail) {
   //省略方法内容
   throw new ServiceException("send confirm email error"); 
}
1
2
3
4
5

定义fallbackMethod
4.1 重要的是要记住，fallbackMethod应该放在同一个类中，并且必须具有相同的方法签名，只需要一个额外的目标异常参数
4.2 如果有多个 fallbackMethod 方法，将调用最接近匹配的方法
```
public void sendConfirmEmailFallback(SsoSendConfirmEmailDTO ssoSendConfirmEmail,ServiceException e){
   //发送邮件通知
}
1
2
3
```

自定义时间间隔

默认按照固定时间间隔重试，但如果现在想做到1s->2s-3s间隔时间逐次递增，这时就需要自定义时间间隔

实现IntervalBiFunction接口，自定义时间间隔类

public class SendEmailIntervalBiFunction implements IntervalBiFunction<Integer> {

    private final Duration waitDuration = Duration.ofSeconds(1);

    @Override
    public Long apply(Integer numOfAttempts, Either<Throwable, Integer> either) {
        return numOfAttempts * waitDuration.toMillis();
	}
}
1
2
3
4
5
6
7
8
9

配置指定自定义时间间隔类
3.1 通过Class.forName去加载自定义时间间隔类

resilience4j.retry.instances.sendConfirmEmail.interval-bi-function=com.xxx.xxx.retry.SendEmailIntervalBiFunction
1

resilience4j-retry源码分析

创建测试方法进行debug

@RunWith(SpringRunner.class)
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
public class RetryTest {

    @Resource
    private UserApiService userApiService;

    @Test
    public void testRetryThreeTimes() throws InterruptedException {
        SsoSendConfirmEmailDTO ssoSendConfirmEmailDTO = null;
        userApiService.sendConfirmEmail(ssoSendConfirmEmailDTO);
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13

定义Retry切面：RetryAspect，对@Retry注解标识的类或者方法进行拦截
2.1 根据@Retry注解的name创建Retry实现类：RetryImpl
2.2 根据@Retry注解的fallbackMethod创建FallbackMethod（根据方法、参数、异常反射获取对应的方法）
2.3 重试处理（最终有重试实现类完成功能：RetryImpl#executeCheckedSupplier）

@Around(value = "matchAnnotatedClassOrMethod(retryAnnotation)", argNames = "proceedingJoinPoint, retryAnnotation")
public Object retryAroundAdvice(ProceedingJoinPoint proceedingJoinPoint,
    @Nullable Retry retryAnnotation) throws Throwable {
    //根据name创建Retry实现类：RetryImpl   ---> Retry retry = retryRegistry.retry(backend)
    io.github.resilience4j.retry.Retry retry = getOrCreateRetry(methodName, backend);
    
    // 根据@Retry注解的fallbackMethod创建FallbackMethod -->FallbackMethod#create
	FallbackMethod fallbackMethod = FallbackMethod
        .create(fallbackMethodValue, method, proceedingJoinPoint.getArgs(),
            proceedingJoinPoint.getTarget());
	
	//重试处理：RetryAspect#proceed  -->最终触发RetryImpl#executeCheckedSupplier
	return fallbackDecorators.decorate(fallbackMethod,
        () -> proceed(proceedingJoinPoint, methodName, retry, returnType)).apply();
}
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

重试处理

核心方法：Retry#decorateCheckedSupplier（do…while(true)）
1.1 获取重试上下文：RetryImpl$ContextImpl
1.2 调用被@Retry修饰的业务方法
1.3 对结果进行处理（以及如果发生异常，对异常进行处理）

static <T> CheckedFunction0<T> decorateCheckedSupplier(Retry retry,
                                                       CheckedFunction0<T> supplier) {
    return () -> {
    	//获取重试上下文：RetryImpl$ContextImpl
        Retry.Context<T> context = retry.context();
        do {
            try {
            	// 调被@Retry修饰的业务方法
                T result = supplier.apply();
                final boolean validationOfResult = context.onResult(result);
                if (!validationOfResult) {
                    context.onComplete();
                    return result;
                }
            } catch (Exception exception) {
                context.onError(exception);
            }
        } while (true);
    };
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

异常后重试处理：RetryImpl$ ContextImpl#onError
2.1 如果异常是可重试的异常，则进行重试处理：RetryImpl$ContextImpl#throwOrSleepAfterException

private void throwOrSleepAfterException() throws Exception {
    int currentNumOfAttempts = numOfAttempts.incrementAndGet();
    Exception throwable = lastException.get();
    // 如果重试次数超过阈值，则抛出异常
    if (currentNumOfAttempts >= maxAttempts) {
        failedAfterRetryCounter.increment();
        publishRetryEvent(
            () -> new RetryOnErrorEvent(getName(), currentNumOfAttempts, throwable));
        throw throwable;
    } else {
    	// 在重试范围内，则sleep间隔时间
        waitIntervalAfterFailure(currentNumOfAttempts, Either.left(throwable));
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14

重试数据采集

数据的作用：通过分析服务重试成功、重试失败、没有重试成功、没有重试成功数据，判断该服务的稳定性

在重试处理时，将统计数据存放在RetryImpl属性上
2.1 在RetryImp$ ContextImpll#onComplete统计succeededAfterRetryCounter、failedAfterRetryCounter、succeededWithoutRetryCounter
2.2 在RetryImp$ ContextImpll#onError统计failedWithoutRetryCounter

//重试后成功次数
private final LongAdder succeededAfterRetryCounter;
// 重试后失败次数（超过阈值后还是失败）
private final LongAdder failedAfterRetryCounter;
// 没有重试就成功的次数
private final LongAdder succeededWithoutRetryCounter;
// 没有重试就失败的次数（不是可重试的异常）
private final LongAdder failedWithoutRetryCounter;
1
2
3
4
5
6
7
8

premethus采集重试数据
3.1 引入premethus采集相关包，暴露采集接口

<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-registry-prometheus</artifactId>
	<version>1.7.1</version>
</dependency>
<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-core</artifactId>
	<version>1.7.1</version>
</dependency>
1
2
3
4
5
6
7
8
9
10

3.2 配置actutor开放premethus采集接口
3.2.1 premethus采集接口：PrometheusScrapeEndpoint#scrape
3.2.2 发送/actutor/prometheus触发收集：AbstractRetryMetrics#registerMetrics

management.server.port=9099
management.endpoint.health.show-details=always
management.endpoints.web.exposure.include=health,prometheus
1
2
3

束语

重试在我理解应该只能解决网络异常，业务异常重试也不能解决
如果是页面交互触发，这样重试方式会导致交互时间拉长（不能接受）
2.1 加@Aync注解将重试方法异步化，避免页面等待（如果此时应用宕机等导致没有执行怎样处理？）
欢迎大家一起讨论，给出好的解决方案

相关阅读:
千字长文 | 学习编程这么多年，聊一聊Java和Go
极智开发 | 讲解 React 组件三大属性之一：state
python 调用钉钉机器人接口案例一则 —— 筑梦之路
 【Linux】线程池
 Springboot毕设项目公共台账管理系统5d5ba（java+VUE+Mybatis+Maven+Mysql）
MySql中mvcc学习记录
 Java JVM中的栈空间怎么释放
 国际版阿里云腾讯云免费开户：服务器怎样转移
 LVS+Keepalived群集
 ubuntu 22 Docker部署Nacos
原文地址：https://blog.csdn.net/weixin_40803011/article/details/125435527