GreatSQL社区

uid：10

注册时间：2022-7-25 10:25 上次发表时间：2026-5-20 14:27

好友数 0 | 博客数 382 | 回帖数 47 | 主题数 10

参数配置不当导致 GreatSQL 异步复制 IO 线程中断

GreatSQL社区已有 216 次阅读2026-4-15 14:00 |系统分类:运维实战

参数配置不当导致 GreatSQL 异步复制 IO 线程中断

1.问题发生

某个客户的一套 GreatSQL 数据库，异步复制的 slave 节点的 IO 线程无法启动，报错信息如下：

2026-02-16T12:36:40.086620+08:00 12452 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '':Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5', and the missing transaction is '8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718' (server errno=1236)
2026-02-16T12:36:40.086664+08:00 12452 [ERROR] [MY-013114] [Repl] Slave I/0 for channel: '' Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5', and the missing transaction is '8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718'', Error code: MY-013114

此日志信息显示：由于主库的 binlog 已清除，IO 线程中断。用户现场 DBA 进行了 clone 恢复。

但问题原因究竟是什么呢？为什么中间丢失了一个事务 8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718 呢？

2. 问题分析

2.1 error.log 解读

仔细阅读 error.log 中的报错信息，可以解读出如下信息：

slave 在启动 IO 线程时向 master 节点发送的 GTID 事务集是：

8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 
8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 
8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5'

slave 已经执行的 GTID 事务集不连续，中间缺少了事务 8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718。
slave 在缺少该事务 8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718 后，执行了 257021 个事务。
IO 线程启动失败的原因是：master 节点中包含事务 19789718 的 binlog 已被清理。

2.2 初步判断

检查用户的 binlog_expire_logs_seconds 是 604800（7 天），并不短。

根据 GTID 有空洞的现象，判断与参数 slave_skip_errors 数相关，检查用户的参数配置 slave_skip_errors 是 ALL。

参数 slave_skip_errors 默认（OFF）情况下，当 slave 上发生错误时，复制进程会停止，让 DBA 有机会解决数据中的不一致问题。当配置 slave_skip_errors 配置错误类型时，当 slave 上某条语句返回的错误属于该变量值中列出的任一错误类型，复制 SQL 线程仍会继续执行。当 slave_skip_errors=ALL 时，slave 上发生任何错误，SQL 线程都会继续执行。

3. 实验验证

3.1 实验设计

搭建异步复制的环境，slave 端设置并行复制，参数 slave_skip_errors=all 。
在 slave 端制造事务阻塞，让 slave SQL 线程，执行失败。
观察 slave 端事务的执行情况，检查是否会出现 GTID 空洞。
重启 slave，检查是否出现与用户相同的报错信息。

3.2 实验 1：slave_skip_errors=ALL 模拟问题

搭建异步复制的环境，slave 端设置并行复制，参数 slave_skip_errors=ALL

检查 slave 端的参数

GreatSQL> SELECT @@global.slave_parallel_workers,@@global.innodb_lock_wait_timeout,@@global.slave_skip_errors\G
*************************** 1. row ***************************
  @@global.slave_parallel_workers: 64
@@global.innodb_lock_wait_timeout: 10
       @@global.slave_skip_errors: ALL
1 row in set, 2 warnings (0.01 sec)

在 slave 端，开始事务执行 update 语句(test1 的 id 是主键)，不提交；

BEGIN;
UPDATE test1 SET c='dddddddddddd' WHERE id=2;

在 master 端执行修改相同行的 update 语句，该语句在 slave 端应用时，会被阻塞。

UPDATE test1 SET c='eeeeeeeee' WHERE id=2;

在 master 端使用 sysbench 执行其他事务。
检查 master 端和 slave 端的 gtid_execued

-- master端
GreatSQL> SELECT @@gtid_executed;
+----------------------------------------------------------------------------------------+
| @@gtid_executed                                                                        |
+----------------------------------------------------------------------------------------+
| 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-84055 |
+----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

GreatSQL> SELECT * FROM test1 WHERE id=2;
+----+-------+-----------+-------------------------------------------------------------+
| id | k     | c         | pad                                                         |
+----+-------+-----------+-------------------------------------------------------------+
|  2 | 50248 | eeeeeeeee | 28733802923-10548894641-11867531929-71265603657-36546888392 |
+----+-------+-----------+-------------------------------------------------------------+
1 row in set (0.00 sec)

-- slave端
GreatSQL> SELECT @@gtid_executed;
+--------------------------------------------------------------------------------------------------------------------------------------------+
| @@gtid_executed                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1 |
+--------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

GreatSQL> SELECT * FROM test1 WHERE id=2;
+----+-------+--------------+-------------------------------------------------------------+
| id | k     | c            | pad                                                         |
+----+-------+--------------+-------------------------------------------------------------+
|  2 | 50248 | dddddddddddd | 28733802923-10548894641-11867531929-71265603657-36546888392 |
+----+-------+--------------+-------------------------------------------------------------+
1 row in set (0.00 sec)

可见事务 c12440a7-12c1-11f1-95ad-00163e379f11:40805 没有在 slave 端执行，slave 端的 GTID 出现可空洞。

检查 slave 端的 error.log

在 error.log 中有该事务的如下信息

2026-02-27T09:26:49.530843+08:00 29 [Note] [MY-010584] [Repl] Slave SQL for channel '': Worker 18 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:43805' at master log binlog.000003, end_log_pos 266711465; Could not execute Update_rows event on table testdb.test1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log binlog.000003, end_log_pos 266711465, Error_code: MY-001205

可知由于锁等待超时，事务 c12440a7-12c1-11f1-95ad-00163e379f11:43805 执行失败。

检查 slave 的状态

GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 172.17.138.128
                  Master_User: greatdb
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000003
          Read_Master_Log_Pos: 337004154
               Relay_Log_File: relaylog.000004
                Relay_Log_Pos: 337003634
        Relay_Master_Log_File: binlog.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 337004154
              Relay_Log_Space: 337003837
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1006503306
                  Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055
            Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace: 
1 row in set, 1 warning (0.00 sec)

可见 io 和 sql 线程都正常，数据不一致的隐患未及时发现。

master 端 purge binlog 后，重启 slave，slave 正常。

GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 172.17.138.128
                  Master_User: greatdb
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000004
          Read_Master_Log_Pos: 237
               Relay_Log_File: relaylog.000007
                Relay_Log_Pos: 407
        Relay_Master_Log_File: binlog.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 237
              Relay_Log_Space: 860
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1006503306
                  Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055
            Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace: 
1 row in set, 1 warning (0.00 sec)

slave 端 reset slave 后，再 start slave，io 线程无法启动，报错如下：

2026-02-27T11:04:19.165355+08:00 31406 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel '': Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1', and the missing transactions are 'c12440a7-12c1-11f1-95ad-00163e379f11:43805' (server_errno=1236)
2026-02-27T11:04:19.165486+08:00 31406 [ERROR] [MY-013114] [Repl] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the master's binary log expiration period. The GTID set sent by the slave is '0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1', and the missing transactions are 'c12440a7-12c1-11f1-95ad-00163e379f11:43805'', Error_code: MY-013114

报错信息和用户现场的报错信息相同。

3.3 实验 2：slave_skip_errors=OFF 重复实验

搭建异步复制的环境，slave 端设置并行复制，参数 slave_skip_errors=OFF

检查 slave 端的参数

GreatSQL> SELECT @@global.slave_parallel_workers,@@global.innodb_lock_wait_timeout,@@global.slave_skip_errors\G
*************************** 1. row ***************************
  @@global.slave_parallel_workers: 64
@@global.innodb_lock_wait_timeout: 10
       @@global.slave_skip_errors: OFF
1 row in set, 2 warnings (0.01 sec)

在 slave 端，开始事务执行 update 语句，不提交；

BEGIN;
UPDATE test1 SET c='from slave222' WHERE id=2;

在 master 端执行修改相同行的 update 语句，该语句在 slave 端应用时，会被阻塞。

UPDATE test1 SET c='from master1111' WHERE id=2;

在 master 端使用 sysbench 执行其他事务。
检查 slave 的状态，发现 sql 线程已停止。

GreatSQL> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 172.17.138.128
                  Master_User: greatdb
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000004
          Read_Master_Log_Pos: 61931905
               Relay_Log_File: relaylog.000004
                Relay_Log_Pos: 411
        Relay_Master_Log_File: binlog.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1205
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 43620411
              Relay_Log_Space: 18312445
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1205
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1006503306
                  Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 260227 11:18:58
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:109034-119519
            Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-109033,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
       Master_public_key_path: 
        Get_master_public_key: 0
            Network_Namespace: 
1 row in set, 1 warning (0.00 sec)


GreatSQL> SELECT * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
                                           CHANNEL_NAME: 
                                              WORKER_ID: 1
                                              THREAD_ID: NULL
                                          SERVICE_STATE: OFF
                                      LAST_ERROR_NUMBER: 1205
                                     LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'c12440a7-12c1-11f1-95ad-00163e379f11:109034' at master log binlog.000004, end_log_pos 43620845; Lock wait timeout exceeded; try restarting transaction
                                   LAST_ERROR_TIMESTAMP: 2026-02-27 11:18:58.506075
                               LAST_APPLIED_TRANSACTION: 
     LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
    LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
         LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
           LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000
                                   APPLYING_TRANSACTION: c12440a7-12c1-11f1-95ad-00163e379f11:109034
         APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244
        APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244
             APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2026-02-27 11:16:28.463083
                 LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0
   LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0
  LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: 
LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
                     APPLYING_TRANSACTION_RETRIES_COUNT: 10
       APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 1205
      APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: Lock wait timeout exceeded; try restarting transaction
    APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 2026-02-27 11:18:43.502374
1 row in set (0.00 sec)

可见：由于锁等待超时，事务 c12440a7-12c1-11f1-95ad-00163e379f11:109034 无法执行，从而导致 slave 的 sql 线程异常退出，可以有机会修复异常数据。

slave 端的事务提交或回滚后，阻塞解除。启动 slave，事务开始正常应用。

GreatSQL> START slave;
Query OK, 0 rows affected, 1 warning (0.48 sec)

GreatSQL> SHOW slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 172.17.138.128
                  Master_User: greatdb
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000004
          Read_Master_Log_Pos: 61931905
               Relay_Log_File: relaylog.000004
                Relay_Log_Pos: 875819
        Relay_Master_Log_File: binlog.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
......

-- gtid已追齐
GreatSQL> SELECT @@gtid_executed\G
*************************** 1. row ***************************
@@gtid_executed: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6,
c12440a7-12c1-11f1-95ad-00163e379f11:1-119519,
f8afd36c-12c1-11f1-a6e8-00163e892deb:1
1 row in set (0.00 sec)

-- 数据已同步
GreatSQL> SELECT * FROM test1 WHERE id=2\G
*************************** 1. row ***************************
 id: 2
  k: 50248
  c: from master1111
pad: 28733802923-10548894641-11867531929-71265603657-36546888392
1 row in set (0.00 sec)

总结

在生产数据库上要保持 slave_skip_errors 的默认值 OFF，避免主从数据不一致的风险。
对于常见的错误，也需要仔细观察，从蛛丝马迹中找出问题的根本原因。

收藏 0 邀请举报

GreatSQL社区

参数配置不当导致 GreatSQL 异步复制 IO 线程中断

参数配置不当导致 GreatSQL 异步复制 IO 线程中断

1.问题发生

2. 问题分析

2.1 error.log 解读

2.2 初步判断

3. 实验验证

3.1 实验设计

3.2 实验 1：slave_skip_errors=ALL 模拟问题

3.3 实验 2：slave_skip_errors=OFF 重复实验

总结

全部作者的其他最新博客

评论 (0 个评论)