兄弟们,最近处理了一个seata的issue,关于seata分布式事务长期回滚失败后,突然回滚成功了:
这个问题的出现需要以下两个契机:
afterImage
与当前数据不一致,导致回滚失败,此时会一直重试;afterImage
与当前数据一致,此时回滚重试成功,ABA问题产生;为了避免ABA
问题的产生,通过与seata社区的大佬讨论,最终决定在回滚时,如果对比afterImage
与当前数据不一致的情况下,不再尝试回滚重试。这样的话,即使后续通过人工校准后,也不会回滚了。但是这样有另一个问题,就是人工校准后,这个分布式事务就一直遗留在数据库中无法删除了。针对这个问题,seata应该要提供一个restful api
让开发人员在数据校准后能够删除掉对应的分布式事务数据。
在seata源码中,如果校验afterImage
与当前数据不一致后,会抛出SQLException
,最终会被上层代码捕获包装成BranchTransactionException
异常,但是里面的code
属性是BranchRollbackFailed_Retriable
,这也是导致seata一直重试回滚的根本原因:
- Result<Boolean> afterEqualsCurrentResult = DataCompareUtils.isRecordsEquals(afterRecords, currentRecords);
- if (!afterEqualsCurrentResult.getResult()) {
- // 先比较afterImage与当前数据,如果不一致,那么再比较当前数据和beforeImage是否一致
- Result<Boolean> beforeEqualsCurrentResult = DataCompareUtils.isRecordsEquals(beforeRecords, currentRecords);
- // 如果当前数据和beforeImage一致,那么不需要回滚了,因为相当于已经回滚了
- if (beforeEqualsCurrentResult.getResult()) {
- if (LOGGER.isInfoEnabled()) {
- LOGGER.info("Stop rollback because there is no data change " +
- "between the before data snapshot and the current data snapshot.");
- }
- // no need continue undo.
- return false;
- } else {
- // 否则,直接抛出SQLException,并告知undo log脏写了
- if (LOGGER.isInfoEnabled()) {
- if (StringUtils.isNotBlank(afterEqualsCurrentResult.getErrMsg())) {
- LOGGER.info(afterEqualsCurrentResult.getErrMsg(), afterEqualsCurrentResult.getErrMsgParams());
- }
- }
- if (LOGGER.isDebugEnabled()) {
- LOGGER.debug("check dirty data failed, old and new data are not equal, " +
- "tableName:[" + sqlUndoLog.getTableName() + "]," +
- "oldRows:[" + JSON.toJSONString(afterRecords.getRows()) + "]," +
- "newRows:[" + JSON.toJSONString(currentRecords.getRows()) + "].");
- }
- throw new SQLException("Has dirty records when undo.");
- }
- }
- 复制代码
在上层调用代码中,我们可以找到这样一段:
- catch (Throwable e) {
- if (conn != null) {
- try {
- conn.rollback();
- } catch (SQLException rollbackEx) {
- LOGGER.warn("Failed to close JDBC resource while undo ... ", rollbackEx);
- }
- }
- // 包装异常
- throw new BranchTransactionException(BranchRollbackFailed_Retriable, String
- .format("Branch session rollback failed and try again later xid = %s branchId = %s %s", xid,
- branchId, e.getMessage()), e);
-
- }
- 复制代码
根据源码分析,我们发现在数据校验后抛出的SQLException
会被包装成code属性为BranchRollbackFailed_Retriable
的BranchTransactionException
异常,这样会导致seata不断重试回滚操作。
我们需要将这个SQLException
调整为一个更加具体的异常,比如SQLUndoDirtyException
这种能够明确地表示undo log
被脏写的异常,另外我们在上层代码中同样需要针对SQLUndoDirtyException
做特殊处理,比如包装成new BranchTransactionException(BranchRollbackFailed_Unretriable)
不可重试的状态。
先创建自定义的异常:SQLUndoDirtyException
- import java.io.Serializable;
- import java.sql.SQLException;
-
- /**
- * @author zouwei
- */
- class SQLUndoDirtyException extends SQLException implements Serializable {
-
- private static final long serialVersionUID = -5168905669539637570L;
-
- SQLUndoDirtyException(String reason) {
- super(reason);
- }
- }
- 复制代码
调整SQLException
为SQLUndoDirtyException
:
- Result<Boolean> afterEqualsCurrentResult = DataCompareUtils.isRecordsEquals(afterRecords, currentRecords);
- if (!afterEqualsCurrentResult.getResult()) {
- // 先比较afterImage与当前数据,如果不一致,那么再比较当前数据和beforeImage是否一致
- Result<Boolean> beforeEqualsCurrentResult = DataCompareUtils.isRecordsEquals(beforeRecords, currentRecords);
- // 如果当前数据和beforeImage一致,那么不需要回滚了,因为相当于已经回滚了
- if (beforeEqualsCurrentResult.getResult()) {
- if (LOGGER.isInfoEnabled()) {
- LOGGER.info("Stop rollback because there is no data change " +
- "between the before data snapshot and the current data snapshot.");
- }
- // no need continue undo.
- return false;
- } else {
- // 否则,直接抛出SQLException,并告知undo log脏写了
- if (LOGGER.isInfoEnabled()) {
- if (StringUtils.isNotBlank(afterEqualsCurrentResult.getErrMsg())) {
- LOGGER.info(afterEqualsCurrentResult.getErrMsg(), afterEqualsCurrentResult.getErrMsgParams());
- }
- }
- if (LOGGER.isDebugEnabled()) {
- LOGGER.debug("check dirty data failed, old and new data are not equal, " +
- "tableName:[" + sqlUndoLog.getTableName() + "]," +
- "oldRows:[" + JSON.toJSONString(afterRecords.getRows()) + "]," +
- "newRows:[" + JSON.toJSONString(currentRecords.getRows()) + "].");
- }
- // 替换为具体的SQLUndoDirtyException异常
- throw new SQLUndoDirtyException("Has dirty records when undo.");
- }
- }
- 复制代码
这样的话,我们在上层代码中,就可以针对性地处理了:
- catch (Throwable e) {
- if (conn != null) {
- try {
- conn.rollback();
- } catch (SQLException rollbackEx) {
- LOGGER.warn("Failed to close JDBC resource while undo ... ", rollbackEx);
- }
- }
- // 如果捕捉的异常为SQLUndoDirtyException,那么包装为BranchRollbackFailed_Unretriable
- if (e instanceof SQLUndoDirtyException) {
- throw new BranchTransactionException(BranchRollbackFailed_Unretriable, String.format(
- "Branch session rollback failed because of dirty undo log, please delete the relevant undolog after manually calibrating the data. xid = %s branchId = %s",
- xid, branchId), e);
- }
- throw new BranchTransactionException(BranchRollbackFailed_Retriable,
- String.format("Branch session rollback failed and try again later xid = %s branchId = %s %s", xid,
- branchId, e.getMessage()),
- e);
-
- }
- 复制代码
我们在上层调用代码中捕捉指定的SQLUndoDirtyException
,直接包装为BranchRollbackFailed_Unretriable
状态的BranchTransactionException
,这样我们的分布式事务就不会一直重试回滚操作了。下一步就需要开发人员人工介入校准数据后删除对应的undo log
,在一系列操作处理完毕后,另外还需要seata tc端提供对应的restful api
开放对应的手工触发回滚的操作,以便保证校准后的分布式事务正常结束。
我们根据seata使用人员反馈的问题,通过源码分析找到了造成问题的原因:
@GlobalTransactional
注解覆盖到,导致了undo log
被脏写;afterImage
与当前数据不一致进而无法正常回滚,抛出SQLException
,最终包装成BranchRollbackFailed_Retriable
异常,导致seata一直重试回滚;afterImage
一致,此时seata就回滚成功,形成ABA
问题;该pr将在1.6版本后解决seata分布式事务一直尝试回滚的问题,可以避免ABA
问题的产生,后续还需要提供一些其他功能辅助开发人员回滚数据。