|
首先我们先看一下要添加见证节点的install.conf配置文件内容如下:
[shrink]
shrink_type="1" # The node type of standby/witness node, which would be delete from cluster. 0:standby 1:witness
primary_ip="192.168.200.145" # The ip addr of cluster primary node, which need to shrink a standby/witness node.
shrink_ip="192.168.200.9" # The ip addr of standby/witness node, which would be delete from cluster.
node_id="3" # The node_id of standby/witness node, which would be delete from cluster. It does not the same with any one in cluster node
# for example: node_id="3"
## Specific instructions ,see it under [install]
install_dir="/KingbaseES/V9/cluster" # the last layer of directory could not add '/'
ssh_port="22" # the port of ssh, default is 22
Scmd_port="8890"
要注意shrink_type的类型要为“1”,install_dir目录最后一层目录不能加斜杠。由于添加见证节点的配置文件的内容比较简单,就不再介绍了。为了让广大网友实际部署时一次成功,我先把安装过程中遇到过的几个坑一一个整理下来,先给大家上点眼药,大家仔细梳理一下:
1、/KingbaseES/V9/cluster这个安装目录下,必须是空的。如果不是空的,要删除干净。比如 rm -rf /KingbaseES/V9/cluster/kingbase 。记住是要清理安装目录下的内容,不是要删除安装目录。
2、数据目录下也要是空的,不能有文件。如果有也删除干净。如果你的数据目录就在安装目录下面,实际第(1)步就清了。如果你的数据目录不在安装目录下面,是单独另建的,也要单独清理数据目录下的内容。记住是要清理数据目录下的内容,不是要删除数据目录。
3、在部署添加见证节点的过程中,你可能会遇到
[RUNNING] check the sys_securecmdd is running or not...
[ERROR] the sys_securecmdd on "10.10.100.236:8890" is running, please stop it first.
[ERROR] the sys_securecmdd on "10.10.100.235:8890" is running, please stop it first.
这个比较简单,systemctl status securecmdd检查一下securecmdd的状态,然后以root用户关闭即可,命令如是systemctl stop securecmdd。为了养成良好的习惯,一定要用systemctl status securecmdd检查一下配置是否生效。
上面的基本准备工作都做好了,接着我们看一下添加见证节点的部署过程。
[kingbase@lucifer01:/install]$ ./cluster_install.sh expand
[CONFIG_CHECK] will deploy the cluster of
[RUNNING] success connect to the target "192.168.200.9" ..... OK
[RUNNING] success connect to "192.168.200.9" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "192.168.200.145" ..... OK
[RUNNING] success connect to "192.168.200.145" from current node by 'ssh' ..... OK
[RUNNING] Primary node ip is 192.168.200.145 ...
[RUNNING] Primary node ip is 192.168.200.145 ... OK
[CONFIG_CHECK] set install_with_root=1
[RUNNING] success connect to the target "192.168.200.9" ..... OK
[RUNNING] success connect to "192.168.200.9" from current node by 'ssh' ..... OK
[RUNNING] success connect to the target "192.168.200.145" ..... OK
[RUNNING] success connect to "192.168.200.145" from current node by 'ssh' ..... OK
[INSTALL] load config from cluster.....
[INFO] db_user=system
[INFO] db_port=54321
[INFO] use_scmd=1
[INFO] data_directory=/data
[INFO] db_mode=oracle
[INFO] db_auth=scram-sha-256
[INFO] db_case_sentive=off
[INFO] scmd_port=8890
[INFO] recovery=standby
[INFO] use_check_disk=off
[INFO] trusted_servers=192.168.200.1
[INFO] reconnect_attempts=10
[INFO] reconnect_interval=6
[INFO] auto_cluster_recovery_level=1
[INFO] synchronous=quorum
[INSTALL] load config from cluster.....OK
[CONFIG_CHECK] file format is correct ... OK
[CONFIG_CHECK] encoding: UTF8 OK
[CONFIG_CHECK] locale: zh_CN.UTF-8 OK
[CONFIG_CHECK] check database connection ...
[CONFIG_CHECK] check database connection ... OK
[CONFIG_CHECK] expand_ip[192.168.200.9] is not used in the cluster ...
[CONFIG_CHECK] expand_ip[192.168.200.9] is not used in the cluster ...ok
[CONFIG_CHECK] The localhost is expand_ip:[192.168.200.9] ...
[CONFIG_CHECK] The localhost is expand_ip:[192.168.200.9] ...ok
[CONFIG_CHECK] check node_id is in cluster ...
[CONFIG_CHECK] check node_id is in cluster ...OK
[CONFIG_CHECK] current expand_type is witness, check witness node exist in cluster...
[CONFIG_CHECK] current expand_type is witness, witness node not exist in cluster...OK
[RUNNING] check the db is running or not...
[RUNNING] the db is not running on "192.168.200.9:54321" ..... OK
[RUNNING] the install dir is not exist on "192.168.200.9" ..... OK
[RUNNING] check the sys_securecmdd is running or not...
[RUNNING] the sys_securecmdd is not running on "192.168.200.9:8890" ..... OK
[INFO] use_ssl=0
2025-03-09 09:45:12 [INFO] start to check system parameters on 192.168.200.9 ...
2025-03-09 09:45:12 [WARNING] [GSSAPIAuthentication] yes (should be: no) on 192.168.200.9
2025-03-09 09:45:13 [INFO] [UseDNS] is null on 192.168.200.9
2025-03-09 09:45:13 [INFO] [UsePAM] yes on 192.168.200.9
2025-03-09 09:45:13 [INFO] [ulimit.open files] 655360 on 192.168.200.9
2025-03-09 09:45:13 [INFO] [ulimit.open proc] 655360 on 192.168.200.9
2025-03-09 09:45:14 [INFO] [ulimit.core size] unlimited on 192.168.200.9
2025-03-09 09:45:14 [INFO] [ulimit.mem lock] 50000000 on 192.168.200.9
2025-03-09 09:45:15 [INFO] [kernel.sem] 5010 641280 5010 256 on 192.168.200.9
2025-03-09 09:45:15 [INFO] [RemoveIPC] no on 192.168.200.9
2025-03-09 09:45:16 [INFO] [DefaultTasksAccounting] is null on 192.168.200.9
2025-03-09 09:45:16 [INFO] file "/etc/udev/rules.d/kingbase.rules" exists on 192.168.200.9
2025-03-09 09:45:17 [INFO] [crontab] chmod /usr/bin/crontab ...
2025-03-09 09:45:17 [INFO] [crontab] chmod /usr/bin/crontab ... Done
2025-03-09 09:45:17 [INFO] [crontab access] OK
2025-03-09 09:45:18 [INFO] [cron.deny] kingbase not exists in cron.deny
2025-03-09 09:45:18 [INFO] [crontab auth] crontab is accessible by kingbase now on 192.168.200.9
2025-03-09 09:45:19 [INFO] [SELINUX] disabled on 192.168.200.9
2025-03-09 09:45:20 [INFO] [firewall] down on 192.168.200.9
2025-03-09 09:45:20 [INFO] [The memory] OK on 192.168.200.9
2025-03-09 09:45:20 [INFO] [The hard disk] OK on 192.168.200.9
2025-03-09 09:45:21 [INFO] [ping] chmod /bin/ping ...
2025-03-09 09:45:21 [INFO] [ping] chmod /bin/ping ... Done
2025-03-09 09:45:21 [INFO] [ping access] OK
2025-03-09 09:45:21 [INFO] [/bin/cp --version] on 192.168.200.9 OK
2025-03-09 09:45:21 [INFO] [Virtual IP] Not configured on 192.168.200.9
[INSTALL] create the install dir "/KingbaseES/V9/cluster/kingbase" on 192.168.200.9 ...
[INSTALL] success to create the install dir "/KingbaseES/V9/cluster/kingbase" on "192.168.200.9" ..... OK
[INSTALL] try to copy the zip package "/install/db.zip" to /KingbaseES/V9/cluster/kingbase of "192.168.200.9" .....
[INSTALL] success to scp the zip package "/install/db.zip" /KingbaseES/V9/cluster/kingbase of to "192.168.200.9" ..... OK
[INSTALL] decompress the "/KingbaseES/V9/cluster/kingbase" to "/KingbaseES/V9/cluster/kingbase" on 192.168.200.9
[INSTALL] success to decompress the "/KingbaseES/V9/cluster/kingbase/db.zip" to "/KingbaseES/V9/cluster/kingbase" on "192.168.200.9"..... OK
[INSTALL] check license_file "default"
[INSTALL] success to access license_file on 192.168.200.9: /KingbaseES/V9/cluster/kingbase/bin/license.dat
[RUNNING] config sys_securecmdd and start it ...
[RUNNING] config the sys_securecmdd port to 8890 ...
[RUNNING] success to config the sys_securecmdd port on 192.168.200.9 ... OK
successfully initialized the sys_securecmdd, please use "/KingbaseES/V9/cluster/kingbase/bin/sys_HAscmdd.sh start" to start the sys_securecmdd
[RUNNING] success to config sys_securecmdd on 192.168.200.9 ... OK
[RUNNING] success to start sys_securecmdd on 192.168.200.9 ... OK
[INSTALL] success to access file: /KingbaseES/V9/cluster/kingbase/etc/all_nodes_tools.conf
[INSTALL] success to scp the /KingbaseES/V9/cluster/kingbase/etc/repmgr.conf from 192.168.200.145 to "192.168.200.9"..... ok
[INSTALL] success to scp the ~/.encpwd from 192.168.200.145 to "192.168.200.9"..... ok
[INSTALL] success to scp /KingbaseES/V9/cluster/kingbase/etc/all_nodes_tools.conf from "192.168.200.145" to "192.168.200.9" ...ok
[INSTALL] success to chmod 600 the ~/.encpwd on 192.168.200.9..... ok
[INFO] parameter_name=node_id
[INFO] parameter_values='3'
[INFO] [parameter_name] para_exist=1
[INFO] sed -i "/[#]*node_id[ ]*=/cnode_id='3'" /KingbaseES/V9/cluster/kingbase/etc/repmgr.conf
[INFO] parameter_name=node_name
[INFO] parameter_values='node3'
[INFO] [parameter_name] para_exist=1
[INFO] sed -i "/[#]*node_name[ ]*=/cnode_name='node3'" /KingbaseES/V9/cluster/kingbase/etc/repmgr.conf
[INFO] parameter_name=conninfo
[INFO] parameter_values='host
[INFO] [parameter_name] para_exist=1
[INFO] sed -i "/[#]*conninfo[ ]*=/cconninfo='host=192.168.200.9 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000'" /KingbaseES/V9/cluster/kingbase/etc/repmgr.conf
[INFO] parameter_name=ping_path
[INFO] parameter_values='/bin'
[INFO] [parameter_name] para_exist=1
[INFO] sed -i "/[#]*ping_path[ ]*=/cping_path='/bin'" /KingbaseES/V9/cluster/kingbase/etc/repmgr.conf
The database cluster will be initialized with locales
COLLATE: zh_CN.UTF-8
CTYPE: zh_CN.UTF-8
MESSAGES: C
MONETARY: zh_CN.UTF-8
NUMERIC: zh_CN.UTF-8
TIME: zh_CN.UTF-8
The files belonging to this database system will be owned by user "kingbase".
This user must also own the server process.
The default text search configuration will be set to "simple".
The comparision of strings is case-sensitive.
Data page checksums are enabled.
fixing permissions on existing directory /data ... ok
creating subdirectories ... initdb: could not find suitable text search configuration for locale "zh_CN.UTF-8"
ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... PRC
creating configuration files ... ok
Begin setup encrypt device
initializing the encrypt device ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
create security database ... ok
load security database ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/KingbaseES/V9/cluster/kingbase/bin/sys_ctl -D /data -l logfile start
[INSTALL] end to init the database on "192.168.200.9" ... OK
[INSTALL] wirte the kingbase.conf on "192.168.200.9" ...
[INSTALL] wirte the kingbase.conf on "192.168.200.9" ... OK
[INSTALL] wirte the es_rep.conf on "192.168.200.9" ...
[INSTALL] wirte the es_rep.conf on "192.168.200.9" ... OK
[INSTALL] wirte the sys_hba.conf on "192.168.200.9" ...
[INSTALL] wirte the sys_hba.conf on "192.168.200.9" ... OK
[INSTALL] start up the database on "192.168.200.9" ...
[INSTALL] /KingbaseES/V9/cluster/kingbase/bin/sys_ctl -w -t 60 -l /KingbaseES/V9/cluster/kingbase/logfile -D /data start
waiting for server to start.... done
server started
[INSTALL] start up the database on "192.168.200.9" ... OK
[INSTALL] create the database "esrep" and user "esrep" for repmgr ...
CREATE DATABASE
CREATE ROLE
[INSTALL] create the database "esrep" and user "esrep" for repmgr ... OK
[INSTALL] register the witness on "192.168.200.9" ...
[INFO] connecting to witness node "node3" (ID: 3)
[INFO] connecting to primary node
[NOTICE] attempting to install extension "repmgr"
[NOTICE] "repmgr" extension successfully installed
[INFO] witness registration complete
[NOTICE] witness node "node3" (ID: 3) successfully registered
[INSTALL] register the witness on "192.168.200.9" ... OK
2025-03-09 09:46:37 begin to start DB on "[localhost]".
2025-03-09 09:46:39 DB on "[localhost]" already started, connect to check it.
2025-03-09 09:46:41 DB on "[localhost]" start success.
2025-03-09 09:46:41 Ready to start local kbha daemon and repmgrd daemon ...
2025-03-09 09:46:41 begin to start repmgrd on "[localhost]".
[2025-03-09 09:46:42] [NOTICE] using provided configuration file "/KingbaseES/V9/cluster/kingbase/bin/../etc/repmgr.conf"
[2025-03-09 09:46:42] [INFO] creating directory "/KingbaseES/V9/cluster/kingbase/log"...
[2025-03-09 09:46:42] [NOTICE] redirecting logging output to "/KingbaseES/V9/cluster/kingbase/log/hamgr.log"
2025-03-09 09:46:45 repmgrd on "[localhost]" start success.
[2025-03-09 09:46:52] [NOTICE] redirecting logging output to "/KingbaseES/V9/cluster/kingbase/log/kbha.log"
2025-03-09 09:46:55 Done.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 5 | | host=192.168.200.145 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 5 | 0 bytes | host=192.168.200.15 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | witness | * running | node1 | default | 0 | n/a | | host=192.168.200.9 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
仔细分析上面的部署过程输出的文字内容,我们看到是主节点(192.168.200.145)和见证节点(192.168.200.9)进行了数据SSH连接。整个过程中,备节点(192.168.200.15)好像没有和见证节点连接,压根没有参与,成为了路人甲。
我们在任意一个节点查看集群节点信息如下所示:
[kingbase@lucifer01:/home/kingbase]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 5 | | host=192.168.200.145 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 5 | 0 bytes | host=192.168.200.15 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | witness | * running | node1 | default | 0 | n/a | | host=192.168.200.9 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
总之,大家在部署过程遇到其它问题时,要仔细分析、见招拆招,同时参照一下大佬们的技术文章。我在部署过程,就反复不断地阅读三哥和virvle两位大佬在墨天轮社区的文章,在此非常感谢两位大佬的技术分享。
合作电话:010-64087828
社区邮箱:greatsql@greatdb.com