M_ect 发表于 2024-3-2 17:52:13

求救,MGR集群成员节点偶发性重启这个是bug 吗?如何解决?

环境:

10.192.100.7 [(none)]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME            | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 0662e063-3979-11ed-8f56-fa163ed3cf9f | 10.192.100.7   |      3306 | ONLINE       | PRIMARY   | 8.0.25         |
| group_replication_applier | 5763fc4b-3971-11ed-b618-fa163e7f3a9d | 10.192.100.5   |      3306 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | 78e28a24-3a25-11ed-a2a5-fa163eeffbb7 | 10.192.100.104 |      3306 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | 9249758b-3a26-11ed-b971-fa163ea219af | 10.192.100.105 |      3306 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | b538e8d4-3a23-11ed-a85e-fa163e7241c3 | 10.192.100.100 |      3306 | ONLINE       | SECONDARY   | 8.0.25         |
| group_replication_applier | eb7041cd-3975-11ed-9edf-fa163ece9e4f | 10.192.100.6   |      3306 | ONLINE       | SECONDARY   | 8.0.25         |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+


问题:
MGR集群成员节点偶发性轮动重启,重启时间随机,当主节点重启时,触发切换,2024-02-29前稳定运行,之后每天都会有节点异常crash;

异常日志:


2024-03-02T16:59:54.391324+08:00 5528497 Aborted connection 5528497 to db: 'cosdb' user: 'm_qynacos' host: '10.192.116.17' (The client
was disconnected by the server because of inactivity.).
2024-03-02T17:08:00.902807+08:00 5528827 The wait_timeout period was exceeded, the idle time since last command was too long.
2024-03-02T17:08:00.902961+08:00 5528827 Aborted connection 5528827 to db: 'wmh_nacos' user: 'm_dpqwmh' host: '10.192.134.94' (The client was disconnected by the server because of inactivity.).
2024-03-02T17:12:34.697494+08:00 0 Plugin group_replication reported: ' Failure reading from fd=306 n=18446744073709551615'
2024-03-02T17:12:34.697703+08:00 0 Plugin group_replication reported: ' set CON_NULL for fd:306 in close_connection'
2024-03-02T17:12:34.697813+08:00 0 Plugin group_replication reported: ' set CON_NULL for fd:305 in close_connection'
2024-03-02T17:12:34.741832+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:34.841833+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:34.951165+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.072039+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.204713+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.351111+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.511846+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.688495+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.883306+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
09:12:35 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: 27866eda5d4c17a122ebc5640c8848dc0f598d05
Server Version: 8.0.25-17 GreatSQL (GPL), Release 17, Revision 4733775f703

Thread pointer: 0x7f5fb3727000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f5fa38c7570 thread_stack 0x80000
2024-03-02T17:12:35.984086+08:00 0 Plugin group_replication reported: ' call send_my_view in detector'
2024-03-02T17:12:35.984145+08:00 0 Plugin group_replication reported: ' notify is set true in check_local_node_set'
2024-03-02T17:12:35.984169+08:00 0 Plugin group_replication reported: ' call deliver_view_msg in detector'
2024-03-02T17:12:35.984240+08:00 0 Plugin group_replication reported: ' xcom_receive_local_view is called'
2024-03-02T17:12:35.984308+08:00 0 Plugin group_replication reported: 'on_suspicions is activated'
2024-03-02T17:12:35.984385+08:00 0 Plugin group_replication reported: 'Member with address 10.192.100.100:3306 has become unreachable.'
2024-03-02T17:12:35.984472+08:00 0 Plugin group_replication reported: 'on_suspicions is called over'
2024-03-02T17:12:35.984500+08:00 0 Plugin group_replication reported: ' xcom_receive_local_view return true'
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e)
/usr/sbin/mysqld(handle_fatal_signal+0x3a3)
/lib64/libpthread.so.0(+0x13240)
/usr/sbin/mysqld(Security_context::user() const+0)
/usr/sbin/mysqld(Fill_process_list::operator()(THD*)+0x30)
/usr/sbin/mysqld(Global_THD_manager::do_for_all_thd_copy(Do_THD_Impl*)+0x12b)
/usr/sbin/mysqld()
/usr/sbin/mysqld(do_fill_information_schema_table(THD*, TABLE_LIST*, Item*)+0x76)
2024-03-02T17:12:36.002873+08:00 0 Plugin group_replication reported: ' ::xcom_receive_global_view() is called'
/usr/sbin/mysqld(MaterializeInformationSchemaTableIterator::Init()+0x71)
/usr/sbin/mysqld(TemptableAggregateIterator::Init()+0x61)
/usr/sbin/mysqld(filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x34f)
/usr/sbin/mysqld(SortingIterator::DoSort()+0x5f)
/usr/sbin/mysqld(SortingIterator::Init()+0x21)
/usr/sbin/mysqld(Query_expression::ExecuteIteratorQuery(THD*)+0x28a)
/usr/sbin/mysqld(Query_expression::execute(THD*)+0x2f)
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0x4e6)
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x9f3)
/usr/sbin/mysqld(dispatch_sql_command(THD*, Parser_state*, bool)+0x4c5)
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xf87)
/usr/sbin/mysqld(do_command(THD*)+0x1f0)
/usr/sbin/mysqld()
/usr/sbin/mysqld()
/lib64/libpthread.so.0(+0x8f2b)
/lib64/libc.so.6(clone+0x3f)

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
2024-03-02T16:58:55.643219+08:00 5528456 Aborted connection 5528456 to db: 'cosdb' user: 'm_qynacos' host: '10.172.116.19' (The client
was disconnected by the server because of inactivity.).
2024-03-02T16:59:54.391143+08:00 5528497 The wait_timeout period was exceeded, the idle time since last command was too long.
2024-03-02T16:59:54.391324+08:00 5528497 Aborted connection 5528497 to db: 'cosdb' user: 'm_qynacos' host: '10.172.116.17' (The client
was disconnected by the server because of inactivity.).
2024-03-02T17:08:00.902807+08:00 5528827 The wait_timeout period was exceeded, the idle time since last command was too long.
2024-03-02T17:08:00.902961+08:00 5528827 Aborted connection 5528827 to db: 'wmh_nacos' user: 'm_dpqwmh' host: '10.192.134.94' (The clien
t was disconnected by the server because of inactivity.).
2024-03-02T17:12:34.697494+08:00 0 Plugin group_replication reported: ' Failure reading from fd=306 n=18446744073709551615'
2024-03-02T17:12:34.697703+08:00 0 Plugin group_replication reported: ' set CON_NULL for fd:306 in close_connection'
2024-03-02T17:12:34.697813+08:00 0 Plugin group_replication reported: ' set CON_NULL for fd:305 in close_connection'
2024-03-02T17:12:34.741832+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:34.841833+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:34.951165+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.072039+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.204713+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.351111+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.511846+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.688495+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
2024-03-02T17:12:35.883306+08:00 0 Plugin group_replication reported: ' dial for server:10.192.100.100, port:33061'
09:12:35 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: 27866eda5d4c17a122ebc5640c8848dc0f598d05
Server Version: 8.0.25-17 GreatSQL (GPL), Release 17, Revision 4733775f703

Thread pointer: 0x7f5fb3727000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f5fa38c7570 thread_stack 0x80000
2024-03-02T17:12:35.984086+08:00 0 Plugin group_replication reported: ' call send_my_view in detector'
2024-03-02T17:12:35.984145+08:00 0 Plugin group_replication reported: ' notify is set true in check_local_node_set'
2024-03-02T17:12:35.984169+08:00 0 Plugin group_replication reported: ' call deliver_view_msg in detector'
2024-03-02T17:12:35.984240+08:00 0 Plugin group_replication reported: ' xcom_receive_local_view is called'
2024-03-02T17:12:35.984308+08:00 0 Plugin group_replication reported: 'on_suspicions is activated'
2024-03-02T17:12:35.984385+08:00 0 Plugin group_replication reported: 'Member with address 10.192.100.100:3306 has become unreachable.'
2024-03-02T17:12:35.984472+08:00 0 Plugin group_replication reported: 'on_suspicions is called over'
2024-03-02T17:12:35.984500+08:00 0 Plugin group_replication reported: ' xcom_receive_local_view return true'
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e)
/usr/sbin/mysqld(handle_fatal_signal+0x3a3)
/lib64/libpthread.so.0(+0x13240)
/usr/sbin/mysqld(Security_context::user() const+0)
/usr/sbin/mysqld(Fill_process_list::operator()(THD*)+0x30)
/usr/sbin/mysqld(Global_THD_manager::do_for_all_thd_copy(Do_THD_Impl*)+0x12b)
/usr/sbin/mysqld()
/usr/sbin/mysqld(do_fill_information_schema_table(THD*, TABLE_LIST*, Item*)+0x76)
2024-03-02T17:12:36.002873+08:00 0 Plugin group_replication reported: ' ::xcom_receive_global_view() is called'
/usr/sbin/mysqld(MaterializeInformationSchemaTableIterator::Init()+0x71)
/usr/sbin/mysqld(TemptableAggregateIterator::Init()+0x61)
/usr/sbin/mysqld(filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x34f)
/usr/sbin/mysqld(SortingIterator::DoSort()+0x5f)
/usr/sbin/mysqld(SortingIterator::Init()+0x21)
/usr/sbin/mysqld(Query_expression::ExecuteIteratorQuery(THD*)+0x28a)
/usr/sbin/mysqld(Query_expression::execute(THD*)+0x2f)
/usr/sbin/mysqld(Sql_cmd_dml::execute(THD*)+0x4e6)
/usr/sbin/mysqld(mysql_execute_command(THD*, bool)+0x9f3)
/usr/sbin/mysqld(dispatch_sql_command(THD*, Parser_state*, bool)+0x4c5)
/usr/sbin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0xf87)
/usr/sbin/mysqld(do_command(THD*)+0x1f0)
/usr/sbin/mysqld()
/usr/sbin/mysqld()
/lib64/libpthread.so.0(+0x8f2b)
/lib64/libc.so.6(clone+0x3f)

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f5fb378cbe8): select user as username,db as dbname, substring_index(host, ':',1) as hostname, count(*) as counts from information_schema.processlist where db i
s not null group by username,dbname,hostname order by 4 desc,1 desc,2,3
Connection ID (thread ID): 5531044
Status: NOT_KILLED

Please help us make Percona Server better by reporting any
bugs at https://bugs.percona.com/

You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
2024-03-02T17:12:38.602050+08:00 0 Basedir set to /usr/.
2024-03-02T17:12:38.602068+08:00 0 /usr/sbin/mysqld (mysqld 8.0.25-17) starting as process 1380909
2024-03-02T17:12:38.604612+08:00 0 No argument was provided to --log-bin, and --log-bin-index was not used; so replication may break when t
his MySQL server acts as a master and has his hostname changed!! Please use '--log-bin=sx-zq-qyzx-mysql-1-bin' to avoid this problem.
2024-03-02T17:12:38.609146+08:00 0 Using Linux native AIO
2024-03-02T17:12:38.609981+08:00 0 Plugin 'FEDERATED' is disabled.
2024-03-02T17:12:38.612633+08:00 1 InnoDB initialization has started.
2024-03-02T17:12:38.612724+08:00 1 Atomic write enabled
2024-03-02T17:12:38.612811+08:00 1 PUNCH HOLE support available
2024-03-02T17:12:38.612874+08:00 1 Uses event mutexes
2024-03-02T17:12:38.612925+08:00 1 GCC builtin __atomic_thread_fence() is used for memory barrier
2024-03-02T17:12:38.612997+08:00 1 Compressed tables use zlib 1.2.11
2024-03-02T17:12:38.616821+08:00 1 Number of pools: 1
2024-03-02T17:12:38.617050+08:00 1 Using CPU crc32 instructions
2024-03-02T17:12:38.617633+08:00 1 Directories to scan './'
2024-03-02T17:12:38.617762+08:00 1 Scanning './'
2024-03-02T17:12:38.796306+08:00 1 Completed space ID check of 187 files.
2024-03-02T17:12:38.800618+08:00 1 Initializing buffer pool, total size = 38.000000G, instances = 16, chunk size =128.000000M
2024-03-02T17:12:40.382854+08:00 1 Completed initialization of buffer pool
2024-03-02T17:12:40.567447+08:00 0 If the mysqld execution user is authorized, page cleaner and LRU manager thread priority can be changed.
See the man page of setpriority().
2024-03-02T17:12:40.619047+08:00 1 Using './#ib_16384_0.dblwr' for doublewrite
2024-03-02T17:12:40.626874+08:00 1 Using './#ib_16384_1.dblwr' for doublewrite
2024-03-02T17:12:40.690126+08:00 1 Double write buffer files: 2
2024-03-02T17:12:40.690286+08:00 1 Double write buffer pages per instance: 4
2024-03-02T17:12:40.690407+08:00 1 Using './#ib_16384_0.dblwr' for doublewrite
2024-03-02T17:12:40.690540+08:00 1 Using './#ib_16384_1.dblwr' for doublewrite
2024-03-02T17:12:40.692288+08:00 1 The log sequence number 271787107308 in the system tablespace does not match the log sequence number 428
839438600 in the ib_logfiles!
2024-03-02T17:12:40.692426+08:00 1 Database was not shutdown normally!
2024-03-02T17:12:40.692559+08:00 1 Starting crash recovery.
2024-03-02T17:12:40.736837+08:00 1 Starting to parse redo log at lsn = 428839438351, whereas checkpoint_lsn = 428839438600 and start_lsn =
428839438336
2024-03-02T17:12:40.737029+08:00 1 Doing recovery: scanned up to log sequence number 428839439301
2024-03-02T17:12:40.845062+08:00 1 Log background threads are being started...
2024-03-02T17:12:40.845554+08:00 1 Applying a batch of 6 redo log records ...
2024-03-02T17:12:40.899997+08:00 1 100%
2024-03-02T17:12:41.400390+08:00 1 Apply batch completed!
2024-03-02T17:12:41.402139+08:00 1 Using undo tablespace './undo_001'.
2024-03-02T17:12:41.402139+08:00 1 Using undo tablespace './undo_001'.
2024-03-02T17:12:41.403899+08:00 1 Using undo tablespace './undo_002'.
2024-03-02T17:12:41.405787+08:00 1 Opened 2 existing undo tablespaces.
2024-03-02T17:12:41.405962+08:00 1 GTID recovery trx_no: 1142733369
2024-03-02T17:12:42.287325+08:00 1 Removed temporary tablespace data file: "ibtmp1"
2024-03-02T17:12:42.287539+08:00 1 Creating shared tablespace for temporary tables
2024-03-02T17:12:42.287750+08:00 1 Setting file './ibtmp1' size to 512 MB. Physically writing the file full; Please wait ...
2024-03-02T17:12:42.771859+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 10%
2024-03-02T17:12:43.263933+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 20%
2024-03-02T17:12:43.761537+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 30%
2024-03-02T17:12:44.252156+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 40%
2024-03-02T17:12:44.729861+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 50%
2024-03-02T17:12:45.210180+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 60%
2024-03-02T17:12:45.713742+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 70%
2024-03-02T17:12:46.224233+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 80%
2024-03-02T17:12:46.709271+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 90%
2024-03-02T17:12:47.193564+08:00 1 Setting log file ./ibtmp1 size to 512 MB. Progress : 100%
2024-03-02T17:12:47.196697+08:00 1 File './ibtmp1' size is now 512 MB.
2024-03-02T17:12:47.197084+08:00 1 Scanning temp tablespace dir:'./#innodb_temp/'
2024-03-02T17:12:47.245160+08:00 1 Created 128 and tracked 128 new rollback segment(s) in the temporary tablespace. 128 are now active.
2024-03-02T17:12:47.245736+08:00 0 Page cleaner took 6679ms to flush 0 pages
2024-03-02T17:12:47.246861+08:00 1 Percona XtraDB (http://www.percona.com) 8.0.25-15 started; log sequence number 428839439301
2024-03-02T17:12:47.248688+08:00 1 InnoDB initialization has ended.
2024-03-02T17:12:47.272434+08:00 1 Data dictionary restarting version '80023'.
2024-03-02T17:12:47.555092+08:00 1 Reading DD tablespace files
2024-03-02T17:12:47.862076+08:00 1 Scanned 189 tablespaces. Validated 189.
2024-03-02T17:12:47.904645+08:00 1 Using data dictionary with version '80023'.
2024-03-02T17:12:47.928040+08:00 0 Plugin mysqlx reported: 'IPv6 is available'
2024-03-02T17:12:47.937945+08:00 0 Plugin mysqlx reported: 'X Plugin ready for connections. bind-address: '::' port: 33060'
2024-03-02T17:12:47.938412+08:00 0 Plugin mysqlx reported: 'X Plugin ready for connections. socket: '/var/lib/mysql/mysqlx.sock''
2024-03-02T17:12:47.938836+08:00 0 X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/lib/mysql/mysqlx.sock
2024-03-02T17:12:47.956031+08:00 0 Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2024-03-02T17:12:47.956512+08:00 0 Plugin group_replication reported: 'Current debug options are: 'GCS_DEBUG_NONE'.'
2024-03-02T17:12:47.956940+08:00 0 The syntax 'group_replication_ip_whitelist' is deprecated and will be removed in a future release. Pl
ease use group_replication_ip_allowlist instead.
2024-03-02T17:12:48.016190+08:00 0 Thread priority attribute setting in Resource Group SQL shall be ignored due to unsupported platform or
insufficient privilege.
2024-03-02T17:12:48.029414+08:00 0 Recovering after a crash using sx-zq-qyzx-mysql-1-bin
2024-03-02T17:12:48.216555+08:00 0 Starting XA crash recovery...
2024-03-02T17:12:48.228025+08:00 0 XA crash recovery finished.
2024-03-02T17:12:48.234317+08:00 0 DDL log recovery : begin
2024-03-02T17:12:48.236091+08:00 0 DDL log recovery : end
2024-03-02T17:12:48.236810+08:00 0 Loading buffer pool(s) from /data/mysql/ib_buffer_pool
2024-03-02T17:12:48.252400+08:00 0 Waiting for purge to start
2024-03-02T17:12:48.341684+08:00 0 Buffer pool(s) load completed at 240302 17:12:48
2024-03-02T17:12:48.502083+08:00 0 Found ca.pem, server-cert.pem and server-key.pem in data directory. Trying to enable SSL support using t
hem.
2024-03-02T17:12:48.502453+08:00 0 Skipping generation of SSL certificates as certificate files are present in data directory.
2024-03-02T17:12:48.503553+08:00 0 CA certificate ca.pem is self signed.
2024-03-02T17:12:48.503868+08:00 0 Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel
.
2024-03-02T17:12:48.504185+08:00 0 Skipping generation of RSA key pair through --sha256_password_auto_generate_rsa_keys as key files are pr
esent in data directory.









yejr 发表于 2024-3-4 09:11:49

看起来有可能是InnoDB并行查询特性触发的bug,请您确认下面这个选项的设置如果为ON的话,就改成OFF再观察一阵子。
force_parallel_execute另外,也建议升级到最新的GreatSQL 8.0.32-25版本。

M_ect 发表于 2024-3-8 10:58:47

yejr 发表于 2024-3-4 09:11
看起来有可能是InnoDB并行查询特性触发的bug,请您确认下面这个选项的设置如果为ON的话,就改成OFF再观察一 ...

感谢!

yejr 发表于 2024-3-8 13:54:01

M_ect 发表于 2024-3-8 10:58
感谢!

关闭PQ或升级版本后,如果还有其他情况再请反馈哈

fander 发表于 2024-3-14 14:47:21

黄色亮瞎了我的眼睛
页: [1]
查看完整版本: 求救,MGR集群成员节点偶发性重启这个是bug 吗?如何解决?