202208-源码解析springbatch的job是如何运行的?
注,本文中的demo代码节选于图书《Spring Batch批处理框架》的配套源代码,并做并适配springboot升级版本,完全开源。
SpringBatch的背景和用法,就不再赘述了,默认本文受众都使用过batch框架。
本文仅讨论普通的ChunkStep,分片/异步处理等功能暂不讨论。
1. 表结构
Spring系列的框架代码,大多又臭又长,让人头晕。先列出整体流程,再去看源码。顺带也可以了解存储表结构。
- 每一个jobname,加运行参数的MD5值,被定义为一个job_instance,存储在batch_job_instance表中;
- job_instance每次运行时,会创建一个新的job_execution,存储在batch_job_execution / batch_job_execution_context 表中;
- 扩展:任务重启时,如何续作? 答,判定为任务续作,创建新的job_execution时,会使用旧job_execution的运行态ExecutionContext(通俗讲,火车出故障只换了车头,车厢货物不变。)
- job_execution会根据job排程中的step顺序,逐个执行,逐个转化为step_execution,并存储在batch_step_execution / batch_step_execution_context表中
- 每个step在执行时,会维护step运行状态,当出现异常或者整个step清单执行完成,会更新job_execution的状态
- 在每个step执行前后、job_execution前后,都会通知Listener做回调。
框架使用的表
batch_job_instance batch_job_execution batch_job_execution_context batch_job_execution_params batch_step_execution batch_step_execution_context batch_job_seq batch_step_execution_seq batch_job_execution_seq
2. API入口
先看看怎么调用启动Job的API,看起来非常简单,传入job信息和参数即可
@Autowired @Qualifier("billJob") private Job job; @Test public void billJob() throws Exception { JobParameters jobParameters = new JobParametersBuilder() .addLong("currentTimeMillis", System.currentTimeMillis()) .addString("batchNo","2022080402") .toJobParameters(); JobExecution result = jobLauncher.run(job, jobParameters); System.out.println(result.toString()); Thread.sleep(6000); }
<batch:job id="billJob"> <batch:step id="billStep"> <batch:tasklet transaction-manager="transactionManager"> <batch:chunk reader="csvItemReader" writer="csvItemWriter" processor="creditBillProcessor" commit-interval="3"> batch:chunk> batch:tasklet> batch:step> batch:job>
org.springframework.batch.core.launch.support.SimpleJobLauncher#run
// 简化部分代码(参数检查、log日志) @Override public JobExecution run(final Job job, final JobParameters jobParameters){ final JobExecution jobExecution; JobExecution lastExecution = jobRepository.getLastJobExecution(job.getName(), jobParameters); // 上次执行存在,说明本次请求是重启job,先做检查 if (lastExecution != null) { if (!job.isRestartable()) { throw new JobRestartException("JobInstance already exists and is not restartable"); } /* 检查stepExecutions的状态 * validate here if it has stepExecutions that are UNKNOWN, STARTING, STARTED and STOPPING * retrieve the previous execution and check */ for (StepExecution execution : lastExecution.getStepExecutions()) { BatchStatus status = execution.getStatus(); if (status.isRunning() || status == BatchStatus.STOPPING) { throw new JobExecutionAlreadyRunningException("A job execution for this job is already running: " + lastExecution); } else if (status == BatchStatus.UNKNOWN) { throw new JobRestartException( "Cannot restart step [" + execution.getStepName() + "] from UNKNOWN status. "); } } } // Check jobParameters job.getJobParametersValidator().validate(jobParameters); // 创建JobExecution 同一个job+参数,只能有一个Execution执行器 jobExecution = jobRepository.createJobExecution(job.getName(), jobParameters); try { // SyncTaskExecutor 看似是异步,实际是同步执行(可扩展) taskExecutor.execute(new Runnable() { @Override public void run() { try { // 关键入口,请看[org.springframework.batch.core.job.AbstractJob#execute] job.execute(jobExecution); if (logger.isInfoEnabled()) { Duration jobExecutionDuration = BatchMetrics.calculateDuration(jobExecution.getStartTime(), jobExecution.getEndTime()); } } catch (Throwable t) { rethrow(t); } } private void rethrow(Throwable t) { // 省略各类抛异常 throw new IllegalStateException(t); } }); } catch (TaskRejectedException e) { // 更新job_execution的运行状态 jobExecution.upgradeStatus(BatchStatus.FAILED); if (jobExecution.getExitStatus().equals(ExitStatus.UNKNOWN)) { jobExecution.setExitStatus(ExitStatus.FAILED.addExitDescription(e)); } jobRepository.update(jobExecution); } return jobExecution; }
3. 深入代码流程
简单看看API入口,子类划分较多,继续往后看
总体代码流程
- org.springframework.batch.core.launch.support.SimpleJobLauncher#run 入口api,构建jobExecution
- org.springframework.batch.core.job.AbstractJob#execute 对jobExecution进行执行、listener的前置处理
- FlowJob#doExecute -> SimpleFlow#start 按顺序逐个处理Step、构建stepExecution
- JobFlowExecutor#executeStep -> SimpleStepHandler#handleStep -> AbstractStep#execute 执行stepExecution
- TaskletStep#doExecute 通过RepeatTemplate,调用TransactionTemplate方法,在事务中执行
- 内部类TaskletStep.ChunkTransactionCallback#doInTransaction
- 反复调起ChunkOrientedTasklet#execute 去执行read-process-writer方法,
- 通过自定义的Reader得到inputs,例如本文实现的是flatReader读取csv文件
- 遍历inputs,将item逐个传入,调用processor处理
- 调用writer,将outputs一次性写入
- 不同reader的实现内容不同,通过缓存读取的行数等信息,可做到分片、按数量处理chunk
JobExecution的处理过程
org.springframework.batch.core.job.AbstractJob#execute
/** 运行给定的job,处理全部listener和DB存储的调用 * Run the specified job, handling all listener and repository calls, and * delegating the actual processing to {@link #doExecute(JobExecution)}. * * @see Job#execute(JobExecution) * @throws StartLimitExceededException * if start limit of one of the steps was exceeded */ @Ovrride public final void execute(JobExecution execution) { // 同步控制器,防并发执行 JobSynchronizationManager.register(execution); // 计时器,记录耗时 LongTaskTimer longTaskTimer = BatchMetrics.createLongTaskTimer("job.active", "Active jobs", Tag.of("name", execution.getJobInstance().getJobName())); LongTaskTimer.Sample longTaskTimerSample = longTaskTimer.start(); Timer.Sample timerSample = BatchMetrics.createTimerSample(); try { // 参数再次进行校验 jobParametersValidator.validate(execution.getJobParameters()); if (execution.getStatus() != BatchStatus.STOPPING) { // 更新db中任务状态 execution.setStartTime(new Date()); updateStatus(execution, BatchStatus.STARTED); // 回调所有listener的beforeJob方法 listener.beforeJob(execution); try { doExecute(execution); } catch (RepeatException e) { throw e.getCause(); // 搞不懂这里包一个RepeatException 有啥用 } } else { // 任务状态时BatchStatus.STOPPING,说明任务已经停止,直接改成STOPPED // The job was already stopped before we even got this far. Deal // with it in the same way as any other interruption. execution.setStatus(BatchStatus.STOPPED); execution.setExitStatus(ExitStatus.COMPLETED); } } catch (JobInterruptedException e) { // 任务被打断 STOPPED execution.setExitStatus(getDefaultExitStatusForFailure(e, execution)); execution.setStatus(BatchStatus.max(BatchStatus.STOPPED, e.getStatus())); execution.addFailureException(e); } catch (Throwable t) { // 其他原因失败 FAILED logger.error("Encountered fatal error executing job", t); execution.setExitStatus(getDefaultExitStatusForFailure(t, execution)); execution.setStatus(BatchStatus.FAILED); execution.addFailureException(t); } finally { try { if (execution.getStatus().isLessThanOrEqualTo(BatchStatus.STOPPED) && execution.getStepExecutions().isEmpty()) { ExitStatus exitStatus = execution.getExitStatus(); ExitStatus newExitStatus = ExitStatus.NOOP.addExitDescription("All steps already completed or no steps configured for this job."); execution.setExitStatus(exitStatus.and(newExitStatus)); } // 计时器 计算总耗时 timerSample.stop(BatchMetrics.createTimer("job", "Job duration", Tag.of("name", execution.getJobInstance().getJobName()), Tag.of("status", execution.getExitStatus().getExitCode()) )); longTaskTimerSample.stop(); execution.setEndTime(new Date()); try { // 回调所有listener的afterJob方法 调用失败也不影响任务完成 listener.afterJob(execution); } catch (Exception e) { logger.error("Exception encountered in afterJob callback", e); } // 写入db jobRepository.update(execution); } finally { // 释放控制 JobSynchronizationManager.release(); } } }
3.2何时调用Reader?
在SimpleChunkProvider#provide中会分次调用reader,并将结果包装为Chunk返回。
其中有几个细节,此处不再赘述。
- 如何控制一次读取几个item?
- 如何控制最后一行读完就不读了?
- 如果需要跳过文件头的前N行,怎么处理?
- 在StepContribution中记录读取数量
org.springframework.batch.core.step.item.SimpleChunkProcessor#process @Nullable @Override public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception { @SuppressWarnings("unchecked") Chunk inputs = (Chunk) chunkContext.getAttribute(INPUTS_KEY); if (inputs == null) { inputs = chunkProvider.provide(contribution); if (buffering) { chunkContext.setAttribute(INPUTS_KEY, inputs); } } chunkProcessor.process(contribution, inputs); chunkProvider.postProcess(contribution, inputs); // Allow a message coming back from the processor to say that we // are not done yet if (inputs.isBusy()) { logger.debug("Inputs still busy"); return RepeatStatus.CONTINUABLE; } chunkContext.removeAttribute(INPUTS_KEY); chunkContext.setComplete(); if (logger.isDebugEnabled()) { logger.debug("Inputs not busy, ended: " + inputs.isEnd()); } return RepeatStatus.continueIf(!inputs.isEnd()); }
3.3何时调用Processor/Writer?
在RepeatTemplate和外围事务模板的包装下,通过SimpleChunkProcessor进行处理:
- 查出若干条数的items,打包为Chunk
- 遍历items,逐个item调用processor
- 通知StepListener,环绕处理调用before/after方法
// 忽略无关代码... @Override public final void process(StepContribution contribution, Chunk inputs) throws Exception { // 输入为空,直接返回If there is no input we don't have to do anything more if (isComplete(inputs)) { return; } // Make the transformation, calling remove() on the inputs iterator if // any items are filtered. Might throw exception and cause rollback. Chunk outputs = transform(contribution, inputs); // Adjust the filter count based on available data contribution.incrementFilterCount(getFilterCount(inputs, outputs)); // Adjust the outputs if necessary for housekeeping purposes, and then // write them out... write(contribution, inputs, getAdjustedOutputs(inputs, outputs)); } // 遍历items,逐个item调用processor protected Chunk transform(StepContribution contribution, Chunk inputs) throws Exception { Chunk outputs = new Chunk<>(); for (Chunk.ChunkIterator iterator = inputs.iterator(); iterator.hasNext();) { final I item = iterator.next(); O output; String status = BatchMetrics.STATUS_SUCCESS; try { output = doProcess(item); } catch (Exception e) { /* * For a simple chunk processor (no fault tolerance) we are done here, so prevent any more processing of these inputs. */ inputs.clear(); status = BatchMetrics.STATUS_FAILURE; throw e; } if (output != null) { outputs.add(output); } else { iterator.remove(); } } return outputs; }
4. 每个step是如何与事务处理挂钩?
在TaskletStep#doExecute中会使用TransactionTemplate,包装事务操作
标准的事务操作,通过函数式编程风格,从action的CallBack调用实际处理方法
- 通过transactionManager获取事务
- 执行操作
- 无异常,则提交事务
- 若异常,则回滚
// org.springframework.batch.core.step.tasklet.TaskletStep#doExecute result = new TransactionTemplate(transactionManager, transactionAttribute) .execute(new ChunkTransactionCallback(chunkContext, semaphore)); // 事务启用过程 // org.springframework.transaction.support.TransactionTemplate#execute @Override @Nullable public T execute(TransactionCallback action) throws TransactionException { Assert.state(this.transactionManager != null, "No PlatformTransactionManager set"); if (this.transactionManager instanceof CallbackPreferringPlatformTransactionManager) { return ((CallbackPreferringPlatformTransactionManager) this.transactionManager).execute(this, action); } else { TransactionStatus status = this.transactionManager.getTransaction(this); T result; try { result = action.doInTransaction(status); } catch (RuntimeException | Error ex) { // Transactional code threw application exception -> rollback rollbackOnException(status, ex); throw ex; } catch (Throwable ex) { // Transactional code threw unexpected exception -> rollback rollbackOnException(status, ex); throw new UndeclaredThrowableException(ex, "TransactionCallback threw undeclared checked exception"); } this.transactionManager.commit(status); return result; } }
5. 怎么控制每个chunk几条记录提交一次事务? 控制每个事务窗口处理的item数量
在配置任务时,有个step级别的参数,[commit-interval],用于每个事务窗口提交的控制被处理的item数量。
RepeatTemplate#executeInternal 在处理单条item后,会查看已处理完的item数量,与配置的chunk数量做比较,如果满足chunk数,则不再继续,准备提交事务。
StepBean在初始化时,会新建SimpleCompletionPolicy(chunkSize会优先使用配置值,默认是5)
在每个chunk处理开始时,都会调用SimpleCompletionPolicy#start新建RepeatContextSupport#count用于计数。
源码(简化) org.springframework.batch.repeat.support.RepeatTemplate#executeInternal
/** * Internal convenience method to loop over interceptors and batch * callbacks. * @param callback the callback to process each element of the loop. */ private RepeatStatus executeInternal(final RepeatCallback callback) { // Reset the termination policy if there is one... // 此处会调用completionPolicy.start方法,更新chunk的计数器 RepeatContext context = start(); // Make sure if we are already marked complete before we start then no processing takes place. // 通过running字段来判断是否继续处理next boolean running = !isMarkedComplete(context); // 省略listeners处理.... // Return value, default is to allow continued processing. RepeatStatus result = RepeatStatus.CONTINUABLE; RepeatInternalState state = createInternalState(context); try { while (running) { /* * Run the before interceptors here, not in the task executor so * that they all happen in the same thread - it's easier for * tracking batch status, amongst other things. */ // 省略listeners处理.... if (running) { try { // callback是实际处理方法,类似函数式编程 result = getNextResult(context, callback, state); executeAfterInterceptors(context, result); } catch (Throwable throwable) { doHandle(throwable, context, deferred); } // 检查当前chunk是否处理完,决策出是否继续处理下一条item // N.B. the order may be important here: if (isComplete(context, result) || isMarkedComplete(context) || !deferred.isEmpty() { running = false; } } } result = result.and(waitForResults(state)); // 省略throwables处理.... // Explicitly drop any references to internal state... state = null; } finally { // 省略代码... } return result; }
总结
JSR-352标准定义了Java批处理的基本模型,包含批处理的元数据像 JobExecutions,JobInstances,StepExecutions 等等。通过此类模型,提供了许多基础组件与扩展点:
- 完善的基础组件
- Spring Batch 有很多的这类组件 例如 ItemReaders,ItemWriters,PartitionHandlers 等等对应各类数据和环境。
- 丰富的配置
- JSR-352 定义了基于XML的任务设置模型。Spring Batch 提供了基于Java (类型安全的)的配置方式
- 可伸缩性
- 伸缩性选项-Local Partitioning 已经包含在JSR -352 里面了。但是还应该有更多的选择 ,例如Spring Batch 提供的 Multi-threaded Step,Remote Partitioning ,Parallel Step,Remote Chunking 等等选项
- 扩展点
- 良好的listener模式,提供step/job运行前后的锚点,以供开发人员个性化处理批处理流程。
2013年, JSR-352标准包含在 JavaEE7中发布,到2022年已近10年,Spring也在探索新的批处理模式, 如Spring Attic /Spring Cloud Data Flow。 https://docs.spring.io/spring-batch/docs/current/reference/html/jsr-352.html
扩展
1. Job/Step运行时的上下文,是如何保存?如何控制?
整个Job在运行时,会将运行信息保存在JobContext中。 类似的,Step运行时也有StepContext。可以在Context中保存一些参数,在任务或者步骤中传递使用。
查看JobContext/StepContext源码,发现仅用了普通变量保存Execution,这个类肯定有线程安全问题。 生产环境中常常出现多个任务并处处理的情况。
SpringBatch用了几种方式来包装并发安全:
- 每个job初始化时,通过JobExecution新建了JobContext,每个任务线程都用自己的对象。
- 使用JobSynchronizationManager,内含一个ConcurrentHashMap,KEY是JobExecution,VALUE是JobContext
- 在任务解释时,会移除当前JobExecution对应的k-v
此处能看到,如果在JobExecution存储大量的业务数据,会导致无法GC回收,导致OOM。所以在上下文中,只应保存精简的数据。
2. step执行时,如果出现异常,如何保护运行状态?
在源码中,使用了各类同步控制和加锁、oldVersion版本拷贝,整体比较复杂(org.springframework.batch.core.step.tasklet.TaskletStep.ChunkTransactionCallback#doInTransaction)
- oldVersion版本拷贝:上一次运行出现异常时,本次执行时沿用上次的断点内容
// 节选部分代码 oldVersion = new StepExecution(stepExecution.getStepName(), stepExecution.getJobExecution()); copy(stepExecution, oldVersion); private void copy(final StepExecution source, final StepExecution target) { target.setVersion(source.getVersion()); target.setWriteCount(source.getWriteCount()); target.setFilterCount(source.getFilterCount()); target.setCommitCount(source.getCommitCount()); target.setExecutionContext(new ExecutionContext(source.getExecutionContext())); }
- 信号量控制,在每个chunk运行完成后,需先获取锁,再更新stepExecution前
- Shared semaphore per step execution, so other step executions can run in parallel without needing the lockSemaphore (org.springframework.batch.core.step.tasklet.TaskletStep#doExecute)
// 省略无关代码 try { try { // 执行w-p-r模型方法 result = tasklet.execute(contribution, chunkContext); if (result == null) { result = RepeatStatus.FINISHED; } } catch (Exception e) { // 省略... } } finally { // If the step operations are asynchronous then we need to synchronize changes to the step execution (at a // minimum). Take the lock *before* changing the step execution. try { // 获取锁 semaphore.acquire(); locked = true; } catch (InterruptedException e) { logger.error("Thread interrupted while locking for repository update"); stepExecution.setStatus(BatchStatus.STOPPED); stepExecution.setTerminateOnly(); Thread.currentThread().interrupt(); } stepExecution.apply(contribution); } stepExecutionUpdated = true; stream.update(stepExecution.getExecutionContext()); try { // 更新上下文、DB中的状态 // Going to attempt a commit. If it fails this flag will stay false and we can use that later. getJobRepository().updateExecutionContext(stepExecution); stepExecution.incrementCommitCount(); getJobRepository().update(stepExecution); } catch (Exception e) { // If we get to here there was a problem saving the step execution and we have to fail. String msg = "JobRepository failure forcing rollback"; logger.error(msg, e); throw new FatalStepExecutionException(msg, e); }