测试GreatSQL MGR，从节点延迟太高然后宕机

yejr · 2023-11-23 11:20:08

你看错指标了
mgr监控指标请参考 https://greatsql.cn/docs/8032/us ... r%E7%9B%91%E6%8E%A7

凯文-杜兰特 · 2023-11-23 14:32:11

yejr 发表于 2023-11-23 11:20
你看错指标了
mgr监控指标请参考 https://greatsql.cn/docs/8032/user-manual/6-oper-guide/3-monitoring-a ...

好的谢谢，但是我有个疑问，你看我上面图片的第四列，从节点的“等待被认证的事务队列大小” 已经45w+了。你给的连接里说这个值太大是需要告警的：

“SELECT MEMBER_ID AS id, COUNT_TRANSACTIONS_IN_QUEUE AS trx_tobe_certified, ..." ， ”如果某个节点上的 relaylog_tobe_applied 值特别大，则要引起关注，检查该节点上的业务压力是否过大，或者服务器配置是否有问题。“

yejr · 2023-11-23 15:11:26

凯文-杜兰特发表于 2023-11-23 14:32
好的谢谢，但是我有个疑问，你看我上面图片的第四列，从节点的“等待被认证的事务队列大小” 已经45w+了 ...

烦请认真看文档中的指标，和你查询的指标是否同一个：）

凯文-杜兰特 · 2023-11-23 15:34:34

yejr 发表于 2023-11-23 15:11
烦请认真看文档中的指标，和你查询的指标是否同一个：）

你看sql了吗？你是研发的吗？烦请认真看我帖子的第二张图，我都圈出来了

。

yejr · 2023-11-23 16:26:30

凯文-杜兰特发表于 2023-11-23 15:34
你看sql了吗？你是研发的吗？烦请认真看我帖子的第二张图，我都圈出来了。 ...

请补充几个信息：
1. primary和secondary节点的配置分别是什么
2. 压测参数指标是怎样的

凯文-杜兰特 · 2023-11-23 18:03:38

本帖最后由凯文-杜兰特于 2023-11-24 09:16 编辑

yejr 发表于 2023-11-23 16:26
请补充几个信息：
1. primary和secondary节点的配置分别是什么
2. 压测参数指标是怎样的 ...

配置文件如下：

[mysqld]

port=6144

#mgr settings
loose-plugin_load_add='mysql_clone.so'
loose-plugin_load_add='group_replication.so'
loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaa1"
#MGR本地节点IPORT，请自行替换
loose-group_replication_local_address="xxxx1:19244"
#MGR集群所有节点IPORT，请自行替换
loose-group_replication_group_seeds='xxxx1:19244,xxxx2:19244'
loose-group_replication_start_on_boot = ON
loose-group_replication_bootstrap_group = OFF
loose-group_replication_exit_state_action = READ_ONLY
loose-group_replication_flow_control_mode = "DISABLED"
loose-group_replication_single_primary_mode = ON
loose-group_replication_communication_max_message_size = 10M
loose-group_replication_transaction_size_limit = 3G
loose-group_replication_arbitrator = 0
loose-group_replication_single_primary_fast_mode = 1
loose-group_replication_request_time_threshold = 20000
report_host = "xxxx1"

gtid_mode=ON
enforce_gtid_consistency=ON

core-file
default_authentication_plugin=mysql_native_password

binlog_checksum=NONE
binlog_rows_query_log_events=ON
binlog_stmt_cache_size=32768

#innodb
innodb_adaptive_hash_index=1
innodb_data_file_path=ibdata1:100M;ibdata2:200M:autoextend
innodb_buffer_pool_instances=8
innodb_log_files_in_group=4
innodb_log_file_size=2G
innodb_log_buffer_size=200M
innodb_flush_log_at_trx_commit=1
innodb_max_dirty_pages_pct=60
innodb_io_capacity_max=10000
innodb_io_capacity=6000
innodb_thread_concurrency=64
innodb_read_io_threads=8
innodb_write_io_threads=8
innodb_open_files=615350
innodb_file_per_table=1
innodb_flush_method=O_DIRECT
innodb_change_buffering=none
innodb_adaptive_flushing=1
#innodb_adaptive_flushing_method=keep_average #percona
#innodb_adaptive_hash_index_partitions=1    #percona
#innodb_fast_checksum=1                      #percona
#innodb_lazy_drop_table=0                   #percona
innodb_old_blocks_time=1000
innodb_stats_on_metadata=0
innodb_use_native_aio=1
innodb_lock_wait_timeout=50
innodb_rollback_on_timeout=0
innodb_purge_threads=1
innodb_strict_mode=1
#transaction-isolation=READ-COMMITTED
innodb_disable_sort_file_cache=ON
innodb_lru_scan_depth=2048
innodb_flush_neighbors=0
innodb_sync_array_size=16
innodb_print_all_deadlocks
innodb_checksum_algorithm=CRC32
innodb_max_dirty_pages_pct_lwm=10
innodb_buffer_pool_size=160G
#innodb_adaptive_hash_index=ON
#myisam
concurrent_insert=2
delayed_insert_timeout=300

#replication
slave_type_conversions="ALL_NON_LOSSY"
slave_net_timeout=4
skip-slave-start=OFF
sync_master_info=10000
sync_relay_log_info=1
master_info_repository=TABLE
relay_log_info_repository=TABLE
relay_log_recovery=0
slave_exec_mode=STRICT
slave_parallel_type=LOGICAL_CLOCK
slave-parallel-workers=32

#binlog
server_id=198761
binlog_cache_size=32K
max_binlog_cache_size=2147483648
max_binlog_size=500M
max_relay_log_size=500M
relay_log_purge=OFF
binlog-format=ROW
sync_binlog=1
sync_relay_log=1
log-slave-updates=1
expire_logs_days=0
rpl_stop_slave_timeout=300
slave_checkpoint_group=1024
slave_checkpoint_period=300
slave_pending_jobs_size_max=1073741824
slave_rows_search_algorithms='TABLE_SCAN,INDEX_SCAN'
slave_sql_verify_checksum=OFF
master_verify_checksum=OFF

# parallel replay
binlog_transaction_dependency_tracking = WRITESET
transaction_write_set_extraction = XXHASH64

default-storage-engine=INNODB
character-set-server=utf8mb4
lower_case_table_names=1
skip-external-locking
open_files_limit=615350
safe-user-create
local-infile=1
sql_mode='NO_ENGINE_SUBSTITUTION'
performance_schema=1
log_slow_admin_statements=1
long_query_time=1
slow_query_log=0
general_log=0

table_definition_cache=32768
eq_range_index_dive_limit=200
table_open_cache_instances=16
table_open_cache=32768

thread_stack=1024k
binlog_cache_size=32K
net_buffer_length=16384
thread_cache_size=256
read_rnd_buffer_size=2M
sort_buffer_size=2M
join_buffer_size=2M
read_buffer_size=2M

# skip-name-resolve
max_connections=36000
max_allowed_packet=1073741824
connect_timeout=8
net_read_timeout=30
net_write_timeout=60
back_log=1024

log_queries_not_using_indexes=0
log_timestamps=SYSTEM
innodb_read_ahead_threshold=0

innodb_doublewrite=1

secondary 的配置只有以下三项不同：

loose-group_replication_local_address
report_host
server_id

我的压测场景： TPCC warehouse200 terminals200，服务器性能比较高。我刚又试了下，secondary 节点的回放进度甚至达不到 1/100 ，瓶颈是在等待认证那里吗？

回放监控.png

yejr · 2023-11-26 14:09:35

凯文-杜兰特发表于 2023-11-23 18:03
配置文件如下：

secondary 的配置只有以下三项不同：

1、先修改 innodb_thread_concurrency=64 = > 0。
2、两个节点的服务器硬件配置如何。

凯文-杜兰特 · 2023-11-29 16:11:06

yejr 发表于 2023-11-26 14:09
1、先修改 innodb_thread_concurrency=64 = > 0。
2、两个节点的服务器硬件配置如何。 ...

好的谢谢

，cpu80核，内存300G，硬盘nvme固态，两节点硬件一致的。

yejr · 2023-11-29 16:38:22

凯文-杜兰特发表于 2023-11-29 16:11
好的谢谢，cpu80核，内存300G，硬盘nvme固态，两节点硬件一致的。

看来比较大可能是innodb_thread_concurrency限制住了，先改成0看看。

如果还有瓶颈，请提供vmstat/top/perf top等相关信息。

[已解决] 测试GreatSQL MGR，从节点延迟太高然后宕机

助人雷锋

社区智多星

勤学好问（铜）

好评如潮（铜）

备受瞩目（铜）