如图:告警504服务的整体调用流程
2022/07/22 14:20:00
事情发生在一个明媚的下午,部门说来了一个大客户需要POC - XX测试,竞品给到他们1千QPS,咋们也需要支持到,我作为XX技术负责(已经骚不动),非常明确的说明,<瓶颈服务>二级子服务-目前最多支撑300QPS,而且是共享的,日常会使用100QPS,所以只有200富余。 一番沟通之后扩容加机器,根据压测报告最后扩容到总体500QPS的量,然后和客户协商后约定限制QPS到300
2022/07/22 15:20:00
QPS限流工单审批通过,客户流量进入,整体QPS出现上升,带宽告警,立即通知运维同学扩带宽,500M->1000M,整理RT趋于稳定,监控30分钟,群内反馈QPS/RT情况,事情真的结束了吗?
2022/07/22 17:05:00
gateway出现大量504告警,持续7s左右后恢复,排查监控
先贴代码:
- for (AbstractModelTask task : tasks) {
- futures.add(MODEL_TASK_EXECUTOR_POOL.submit(task));
- }
-
- long timeout = modelLbUtil.getModelExecuteTimeout();
- long now = System.currentTimeMillis();
-
- for (Future future : futures) {
-
- if (future != null) {
-
- long remain = timeout - (System.currentTimeMillis() - now);
-
- try {
-
- if (remain < 0) {
- LOG.warn("futureGet remain out p={} s={}", partnerCode, seqId);
- break;
- }
-
- future.get(remain, TimeUnit.MILLISECONDS);
-
- } catch (XXXX e) {
- //XXXXXXXXXXXXXX
- } catch (Exception e) {
- LOG.warn("future get error, msg={} p={} s={}", e.getMessage(), partnerCode, seqId);
- } finally {
- future.cancel(true);
- }
-
- }
然后在结合日志
分析发现在 2022/07/22 17:08:06 时间点的时候,线程池future已经报错,但是直到 2022/07/22 17:08:12 时间点才进入到 futureGet remain out 超时判断逻辑中,这中间线程有哪些操作呢?
分析这三个动作
继续分析cancel这个动作,
- //========================= 接口
- /**
- * Attempts to cancel execution of this task. This attempt will
- * fail if the task has already completed, has already been cancelled,
- * or could not be cancelled for some other reason. If successful,
- * and this task has not started when {@code cancel} is called,
- * this task should never run. If the task has already started,
- * then the {@code mayInterruptIfRunning} parameter determines
- * whether the thread executing this task should be interrupted in
- * an attempt to stop the task.
- *
- *
After this method returns, subsequent calls to {@link #isDone} will
- * always return {@code true}. Subsequent calls to {@link #isCancelled}
- * will always return {@code true} if this method returned {@code true}.
- *
- * @param mayInterruptIfRunning {@code true} if the thread executing this
- * task should be interrupted; otherwise, in-progress tasks are allowed
- * to complete
- * @return {@code false} if the task could not be cancelled,
- * typically because it has already completed normally;
- * {@code true} otherwise
- */
- boolean cancel(boolean mayInterruptIfRunning);
-
- //========================= 抽象实现
-
- protected void cancel() {
- canceled = true;
- }
-
- //========================= 实现
- @Override
- protected void cancel() {
- super.cancel();
- if (completed)
- setLocalResult(getEmptyResult());
- }
-
-
- @Override
- protected final Node
getEmptyResult() { - return Nodes.emptyNode(op.getOutputShape());
- }
-
- @Override
- protected void setLocalResult(R localResult) {
- if (isRoot()) {
- if (localResult != null)
- sharedResult.compareAndSet(null, localResult);
- }
- else
- super.setLocalResult(localResult);
- }
-
翻看官方API介绍不会阻塞,进入方法内部,唯一耗时的是compareAndSet() - CAS 但也不会阻塞6秒之久,真相究竟是啥啊!!!
=========================分割线============================
阶段性结论:这时候得到的结论是,线程平白无故阻塞了
回顾职业生涯:之前也有过504的经历,那次是出现了tomcat初始化的连接数不够,在QPS起来之后,频繁创建连接导致资源消耗,部分线程出现阻塞,有没有可能也是这个原因呢??
寻找资料:
发现了疑点,在504故障时间点线程池出现非常明显的波动,本身线程池coreThreadNum = 300, MaxThreadNum = 600
很奇怪为啥会出现这种情况,会不会和线程销毁重建有关呢??
此处本应执行 Jstack pid > stackLog.stack,但是机会错过就不再
jstack脚本-记录工作线程:Model_Task_Executor_Pool的数量变化----观察监控出现波动是,工作线程是否发生变化
但是结果不乐观:并没有关联关系
- #!/bin/bash
-
- if [ $# -eq 0 ];then
- echo "please enter java pid"
- exit -1
- fi
-
- pid=$1
-
-
- #修改成自己安装的JDK配置路径
- jstack_cmd="/usr/bin/jstack"
-
- #line=`top -H -o %CPU -b -n 1 -p $pid | sed '1,/^$/d' | grep -v $pid | awk 'NR==2'`
-
- #line=`top -H -b -n 1 -p $pid | sed '1,/^$/d' | sed '1d;/^$/d' | grep -v $pid | sort -nrk9 | head -1`
- #echo "$line" | awk '{print "tid: "$1," cpu: %"$9}'
- #tid_0x=`printf "%0x" $(echo "$line" | awk '{print $1}')`
- #$jstack_cmd $pid | grep $tid_0x -A20 | sed -n '1,/^$/p'
-
-
- for i in {1..2000};
- do
- $jstack_cmd $pid > stack_thread.log
-
- #Model_Task_Executor_Pool_
-
-
- echo "每5s统计一次,Model_Task_Executor_Pool_ 执行线程数量=========>"
-
- line=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$8}'|uniq -c`
- echo "$line"
-
- line2=`grep 'Model_Task_Executor_Pool' stack_thread.log | awk '{print$4}'|uniq -c`
- echo " 总数: $line2"
- echo "=================>"
-
- sleep 5s
- done