GreatSQL社区

搜索

GreatSQL社区

一次由隐藏大页配置引发的数据库 OOM 故障分析

GreatSQL社区 已有 124 次阅读2025-12-26 14:36 |系统分类:运维实战

一次由隐藏大页配置引发的数据库 OOM 故障分析

一、事故发生

在周日清晨,收到紧急短信告警,数据库实例发生异常重启。首先登录数据库服务器,查看日志记录

2025-12-21T06:54:57.259156+08:00 77 [Note] [MY-010914] [Server] Aborted connection 77 to db: 'unconnected' user: 'root' host: '172.17.139.203' (Got an error reading communication packets).
2025-12-21T06:55:33.224314Z mysqld_safe Number of processes running now: 0
2025-12-21T06:55:33.248143Z mysqld_safe mysqld restarted
2025-12-21T06:55:34.053462+08:00 0 [Warning] [MY-011069] [Server] The syntax '--replica-parallel-type' is deprecated and will be removed in a future release.
2025-12-21T06:55:34.053569+08:00 0 [Warning] [MY-011068] [Server] The syntax '--ssl=off' is deprecated and will be removed in a future release. Please use --tls-version='' instead.

通过该日志内容初步判断重启原因是发生了 OOM 异常,直接观察系统日志/var/log/messages,确认存在 oom 异常信息。

[root@gdb-adm ~]#  grep -inr /var/log/messages
5:Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
11:Dec 21 06:55:33 gdb kernel: [419827.630530]  oom_kill_process+0x24f/0x270
12:Dec 21 06:55:33 gdb kernel: [419827.630532]  ? oom_badness+0x25/0x140
68:Dec 21 06:55:33 gdb kernel: [419827.630752] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
148:Dec 21 06:55:33 gdb kernel: [419827.631062] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-2036.slice/session-6188.scope,task=mysqld,pid=2567710,uid=2032

二、问题分析

1、内存设置检查

服务器物理内存 376G,而 innodb_buffer_pool_size 设置为 200G,占比为 53%,符合预期。

free -h
              total        used        free      shared  buff/cache   available
Mem:          376Gi       267Gi        26Gi       5.0Mi        82Gi        53Gi

2、jemolloc 判断

作为 GreatSQL 数据库或者开源 MySQL 数据库,出现 OOM 的情况,很大可能是由于使用默认的 glibc 内存分配管理,内存使用后释放不完全引起内存泄漏导致,通过命令 lsof -p PID| grep jem 观察内存分配管理方式

[root@gdb ~]# lsof -p 25424 | grep jem
mysqld 25424 mysql  mem       REG                8,2    2136088   2355262 /data/svr/greatsql/lib/mysql/libjemalloc.so.1

从返回可以看出配置正常,基本上可以排除此原因。

3、OOM 日志详细分析

1)完整 OOM 日志

Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Dec 21 06:55:33 gdb kernel: [419827.630499] CPU: 14 PID: 9458 Comm: crontab-1 Kdump: loaded Not tainted 4.19.90-2107.6.0.0227.28.oe1.bclinux.x86_64 #1
Dec 21 06:55:33 gdb kernel: [419827.630500] Hardware name: FiberHome FitServer/FiberHome Boards, BIOS 3.4.V7 02/01/2023
Dec 21 06:55:33 gdb kernel: [419827.630507] Call Trace:
Dec 21 06:55:33 gdb kernel: [419827.630519]  dump_stack+0x66/0x8b
Dec 21 06:55:33 gdb kernel: [419827.630527]  dump_header+0x4a/0x1fc
Dec 21 06:55:33 gdb kernel: [419827.630530]  oom_kill_process+0x24f/0x270
Dec 21 06:55:33 gdb kernel: [419827.630532]  ? oom_badness+0x25/0x140
Dec 21 06:55:33 gdb kernel: [419827.630533]  out_of_memory+0x11f/0x540
Dec 21 06:55:33 gdb kernel: [419827.630536]  __alloc_pages_slowpath+0x9f5/0xde0
Dec 21 06:55:33 gdb kernel: [419827.630543]  __alloc_pages_nodemask+0x2a8/0x2d0
Dec 21 06:55:33 gdb kernel: [419827.630549]  filemap_fault+0x35e/0x8a0
Dec 21 06:55:33 gdb kernel: [419827.630555]  ? alloc_set_pte+0x244/0x450
Dec 21 06:55:33 gdb kernel: [419827.630558]  ? filemap_map_pages+0x28f/0x480
Dec 21 06:55:33 gdb kernel: [419827.630584]  ext4_filemap_fault+0x2c/0x40 [ext4]
Dec 21 06:55:33 gdb kernel: [419827.630588]  __do_fault+0x33/0x110
Dec 21 06:55:33 gdb kernel: [419827.630592]  do_fault+0x12e/0x490
Dec 21 06:55:33 gdb kernel: [419827.630595]  ? __handle_mm_fault+0x2a/0x690
Dec 21 06:55:33 gdb kernel: [419827.630597]  __handle_mm_fault+0x613/0x690
Dec 21 06:55:33 gdb kernel: [419827.630601]  handle_mm_fault+0xc4/0x200
Dec 21 06:55:33 gdb kernel: [419827.630604]  __do_page_fault+0x2ba/0x4d0
Dec 21 06:55:33 gdb kernel: [419827.630609]  ? __audit_syscall_exit+0x238/0x2c0
Dec 21 06:55:33 gdb kernel: [419827.630611]  do_page_fault+0x31/0x130
Dec 21 06:55:33 gdb kernel: [419827.630616]  ? page_fault+0x8/0x30
Dec 21 06:55:33 gdb kernel: [419827.630620]  page_fault+0x1e/0x30
Dec 21 06:55:33 gdb kernel: [419827.630623] Mem-Info:
Dec 21 06:55:33 gdb kernel: [419827.630635] active_anon:50985791 inactive_anon:354 isolated_anon:0#012 active_file:677 inactive_file:0 isolated_file:0#012 unevictable:0 dirty:105 writeback:123 unstable:0#012 slab_reclaimable:20583 slab_unreclaimable:49628#012 m
apped:319 shmem:1323 pagetables:106803 bounce:0#012 free:5313776 free_pcp:5715 free_cma:0
Dec 21 06:55:33 gdb kernel: [419827.630638] Node 0 active_anon:100766572kB inactive_anon:556kB active_file:1384kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:76kB dirty:32kB writeback:0kB shmem:2276kB shmem_thp: 0kB shmem_pmdm
apped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Dec 21 06:55:33 gdb kernel: [419827.630645] Node 1 active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1200kB dirty:388kB writeback:492kB shmem:3016kB shmem_thp: 0kB shme
m_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Dec 21 06:55:33 gdb kernel: [419827.630650] Node 0 DMA free:15892kB min:824kB low:1028kB high:1232kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15976kB managed:15892kB mlocked:0kB kernel_stack:0k
B pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630654] lowmem_reserve[]: 0 1347 191666 191666 191666
Dec 21 06:55:33 gdb kernel: [419827.630661] Node 0 DMA32 free:833940kB min:72972kB low:91212kB high:109452kB active_anon:559420kB inactive_anon:8kB active_file:68kB inactive_file:0kB unevictable:0kB writepending:32kB present:1733384kB managed:1405672kB mlocked:
0kB kernel_stack:52kB pagetables:1084kB bounce:0kB free_pcp:400kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630666] lowmem_reserve[]: 0 0 190319 190319 190319
Dec 21 06:55:33 gdb kernel: [419827.630672] Node 0 Normal free:10117540kB min:10117912kB low:12647388kB high:15176864kB active_anon:100207152kB inactive_anon:548kB active_file:808kB inactive_file:0kB unevictable:0kB writepending:0kB present:198180864kB managed:
194894048kB mlocked:0kB kernel_stack:13504kB pagetables:215840kB bounce:0kB free_pcp:536kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630679] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630683] Node 1 Normal free:10287732kB min:10288284kB low:12860352kB high:15432420kB active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB writepending:880kB present:201326592kB mana
ged:198175752kB mlocked:0kB kernel_stack:11836kB pagetables:210288kB bounce:0kB free_pcp:21924kB local_pcp:332kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630686] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630688] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
Dec 21 06:55:33 gdb kernel: [419827.630694] Node 0 DMA32: 240*4kB (UME) 178*8kB (UME) 140*16kB (UME) 66*32kB (UME) 70*64kB (UME) 53*128kB (UME) 38*256kB (UME) 18*512kB (UE) 3*1024kB (U) 2*2048kB (UE) 193*4096kB (M) = 834640kB
Dec 21 06:55:33 gdb kernel: [419827.630702] Node 0 Normal: 3557*4kB (UE) 1963*8kB (UME) 651*16kB (UME) 1139*32kB (UME) 855*64kB (UME) 572*128kB (UME) 308*256kB (UE) 129*512kB (UME) 50*1024kB (UME) 27*2048kB (UME) 2359*4096kB (UME) = 10118588kB
Dec 21 06:55:33 gdb kernel: [419827.630712] Node 1 Normal: 3636*4kB (UME) 1848*8kB (UME) 2744*16kB (UME) 2139*32kB (UME) 1580*64kB (UME) 1073*128kB (UME) 613*256kB (UME) 280*512kB (UE) 130*1024kB (UE) 81*2048kB (UE) 2273*4096kB (UME) = 10289648kB
Dec 21 06:55:33 gdb kernel: [419827.630731] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630737] Node 0 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB

Dec 21 06:55:33 gdb kernel: [419827.630738] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630741] Node 1 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB
Dec 21 06:55:33 gdb kernel: [419827.630742] 3360 total pagecache pages
Dec 21 06:55:33 gdb kernel: [419827.630744] 0 pages in swap cache
Dec 21 06:55:33 gdb kernel: [419827.630746] Swap cache stats: add 0, delete 0, find 0/0
Dec 21 06:55:33 gdb kernel: [419827.630746] Free swap  = 0kB
Dec 21 06:55:33 gdb kernel: [419827.630747] Total swap = 0kB
Dec 21 06:55:33 gdb kernel: [419827.630748] 100314204 pages RAM
Dec 21 06:55:33 gdb kernel: [419827.630749] 0 pages HighMem/MovableOnly
Dec 21 06:55:33 gdb kernel: [419827.630749] 1691363 pages reserved
Dec 21 06:55:33 gdb kernel: [419827.630750] 0 pages hwpoisoned
Dec 21 06:55:33 gdb kernel: [419827.630750] Tasks state (memory values in pages):
Dec 21 06:55:33 gdb kernel: [419827.630752] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 21 06:55:33 gdb kernel: [419827.630790] [    926]     0   926    72470      811   507904        0          -250 systemd-journal
Dec 21 06:55:33 gdb kernel: [419827.630794] [    960]     0   960     8269     1075    77824        0         -1000 systemd-udevd
Dec 21 06:55:33 gdb kernel: [419827.630798] [   1623]     0  1623      729       28    32768        0             0 mdadm
Dec 21 06:55:33 gdb kernel: [419827.630800] [   1672]     0  1672    23007      217    49152        0         -1000 auditd
Dec 21 06:55:33 gdb kernel: [419827.630803] [   1674]     0  1674     1568       90    36864        0             0 sedispatch
Dec 21 06:55:33 gdb kernel: [419827.630806] [   1712]     0  1712    78709      787    98304        0             0 ModemManager
Dec 21 06:55:33 gdb kernel: [419827.630808] [   1714]     0  1714      571       16    32768        0             0 acpid
Dec 21 06:55:33 gdb kernel: [419827.630811] [   1719]    81  1719     2891      845    49152        0          -900 dbus-daemon
Dec 21 06:55:33 gdb kernel: [419827.630813] [   1727]   992  1727      599       38    32768        0             0 lsmd
Dec 21 06:55:33 gdb kernel: [419827.630815] [   1730]     0  1730      619       33    32768        0             0 mcelog
Dec 21 06:55:33 gdb kernel: [419827.630817] [   1735]   999  1735   743772     1030   229376        0             0 polkitd
Dec 21 06:55:33 gdb kernel: [419827.630820] [   1736]     0  1736    77985      204    90112        0             0 rngd
Dec 21 06:55:33 gdb kernel: [419827.630827] [   1739]     0  1739     2711      421    49152        0             0 smartd
Dec 21 06:55:33 gdb kernel: [419827.630829] [   1741]     0  1741    20070      151    40960        0          -500 irqbalance
Dec 21 06:55:33 gdb kernel: [419827.630831] [   1743]     0  1743     4492      227    61440        0             0 systemd-machine
Dec 21 06:55:33 gdb kernel: [419827.630837] [   1753]     0  1753   114058      472   110592        0             0 abrtd
Dec 21 06:55:33 gdb kernel: [419827.630842] [   1794]     0  1794     4780      468    65536        0             0 systemd-logind
Dec 21 06:55:33 gdb kernel: [419827.630844] [   1830]     0  1830   263593      479   929792        0             0 abrt-dump-journ
Dec 21 06:55:33 gdb kernel: [419827.630846] [   1831]     0  1831   261511      460   925696        0             0 abrt-dump-journ
Dec 21 06:55:33 gdb kernel: [419827.630850] [   2802]     0  2802   199635      606   299008        0             0 esfdaemon
Dec 21 06:55:33 gdb kernel: [419827.630852] [   2803]     0  2803    72799    12101   200704        0             0 bare-agent
Dec 21 06:55:33 gdb kernel: [419827.630855] [   2805]     0  2805    59117      340    86016        0             0 cupsd
Dec 21 06:55:33 gdb kernel: [419827.630856] [   2810]     0  2810   251667      734  1376256        0             0 rsyslogd
Dec 21 06:55:33 gdb kernel: [419827.630863] [   2814]     0  2814     3350      227    53248        0         -1000 sshd
Dec 21 06:55:33 gdb kernel: [419827.630865] [   2815]     0  2815   117707     3324   143360        0             0 tuned
Dec 21 06:55:33 gdb kernel: [419827.630869] [   2828]     0  2828    65710      188    73728        0             0 gssproxy
Dec 21 06:55:33 gdb kernel: [419827.630872] [   2848]     0  2848    53496       92    45056        0             0 init.ohasd
Dec 21 06:55:33 gdb kernel: [419827.630874] [   2890]     0  2890      906       48    32768        0             0 atd
Dec 21 06:55:33 gdb kernel: [419827.630875] [   2896]     0  2896    53748      118    49152        0             0 crond
Dec 21 06:55:33 gdb kernel: [419827.630878] [   3692]     0  3692     3539      148    49152        0             0 xinetd
Dec 21 06:55:33 gdb kernel: [419827.630880] [   3978]     0  3978    10985      242    61440        0             0 master
Dec 21 06:55:33 gdb kernel: [419827.630884] [   4004]    89  4004    11331      527    69632        0             0 qmgr
Dec 21 06:55:33 gdb kernel: [419827.630888] [   4093]     0  4093    43766      216   221184        0             0 sddog
Dec 21 06:55:33 gdb kernel: [419827.630890] [   4112]     0  4112   285705      537   577536        0             0 sdmonitor
Dec 21 06:55:33 gdb kernel: [419827.630891] [   4233]     0  4233   134053      596   466944        0             0 sdcc
Dec 21 06:55:33 gdb kernel: [419827.630895] [   4259]     0  4259   168947     8371   667648        0             0 sdec
Dec 21 06:55:33 gdb kernel: [419827.630897] [   4284]     0  4284   286675     1588   778240        0             0 sdexam
Dec 21 06:55:33 gdb kernel: [419827.630899] [   4310]     0  4310   492216    50216  1331200        0             0 sdsvrd
Dec 21 06:55:33 gdb kernel: [419827.630906] [   4330]     0  4330    29248      278   278528        0             0 udcenter
Dec 21 06:55:33 gdb kernel: [419827.630908] [   8353]     0  8353     2184      321    45056        0             0 dhclient
Dec 21 06:55:33 gdb kernel: [419827.630910] [   9243]  1086  9243     5274      639    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630915] [   9245]  1086  9245     6383     1015    73728        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630918] [   9348]  1086  9348   470112    50291   761856        0             0 java
Dec 21 06:55:33 gdb kernel: [419827.630920] [   9426]     0  9426     2184      323    45056        0             0 dhclient
Dec 21 06:55:33 gdb kernel: [419827.630922] [   9852]     0  9852    53214       26    36864        0             0 agetty
Dec 21 06:55:33 gdb kernel: [419827.630926] [  11463]  1002 11463     5276      639    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630936] [  11465]  1002 11465     6383     1016    73728        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630942] [  11611]  1002 11611 14284908     1404   602112        0             0 agent60
Dec 21 06:55:33 gdb kernel: [419827.630945] [ 137615]     0 137615   136163     3215   147456        0             0 lvmdbusd
Dec 21 06:55:33 gdb kernel: [419827.630950] [ 796407]  2036 796407     5301      649    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630952] [ 796409]  2036 796409    43812     1109    94208        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630954] [ 817343]  2032 817343    53508      130    53248        0             0 mysqld_safe
Dec 21 06:55:33 gdb kernel: [419827.630956] [2270020]  2032 2270020  2778466     1788  1466368        0             0 dbinit
Dec 21 06:55:33 gdb kernel: [419827.630958] [2567710]  2032 2567710 77307141 50817311 424357888        0             0 mysqld
Dec 21 06:55:33 gdb kernel: [419827.630960] [3453494]   998 3453494     1173       50    36864        0             0 chronyd
Dec 21 06:55:33 gdb kernel: [419827.630963] [3621338]    89 3621338    11065      249    65536        0             0 pickup
Dec 21 06:55:33 gdb kernel: [419827.630981] [3662845]     0 3662845     5297      648    73728        0             0 systemd
Dec 21 06:55:33 gdb kernel: [419827.630983] [3662881]     0 3662881    44244     1356    98304        0             0 (sd-pam)
Dec 21 06:55:33 gdb kernel: [419827.630985] [3662906]    89 3662906    11068      242    65536        0             0 trivial-rewrite
Dec 21 06:55:33 gdb kernel: [419827.630987] [3663080]     0 3663080    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630988] [3663097]    89 3663097    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.630990] [3663098]     0 3663098    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630992] [3663108]    89 3663108    11073      242    65536        0             0 bounce
Dec 21 06:55:33 gdb kernel: [419827.630994] [3663141]     0 3663141    10991      235    65536        0             0 local
Dec 21 06:55:33 gdb kernel: [419827.630997] [3663177]    89 3663177    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631003] [3663193]    89 3663193    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631005] [3663201]    89 3663201    11066      242    69632        0             0 flush
Dec 21 06:55:33 gdb kernel: [419827.631007] [3663207]     0 3663207    53463       54    45056        0             0 sh
Dec 21 06:55:33 gdb kernel: [419827.631011] [3663208]     0 3663208   884643     7048   589824        0             0 promtail
Dec 21 06:55:33 gdb kernel: [419827.631019] [3663317]    89 3663317    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631023] [3663318]    89 3663318    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631025] [3663319]    89 3663319    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631026] [3663320]    89 3663320    11131      254    65536        0             0 smtp
Dec 21 06:55:33 gdb kernel: [419827.631028] [3663321]    89 3663321    11064      242    65536        0             0 error
Dec 21 06:55:33 gdb kernel: [419827.631030] [3663322]    89 3663322    11064      242    65536        0             0 error
Dec 21 06:55:33 gdb kernel: [419827.631032] [3663388]     0 3663388    53093       15    40960        0             0 sleep
Dec 21 06:55:33 gdb kernel: [419827.631048] [3663946]     0 3663946     4458       86    61440        0             0 systemd-cgroups
Dec 21 06:55:33 gdb kernel: [419827.631060] [3663947]     0 3663947     4071       84    57344        0             0 systemd-cgroups
Dec 21 06:55:33 gdb kernel: [419827.631062] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-2036.slice/session-6188.scope,task=mysqld,pid=2567710,uid=2032
Dec 21 06:55:33 gdb kernel: [419827.631071] Out of memory: Kill process 2567710 (mysqld) score 516 or sacrifice child
Dec 21 06:55:33 gdb kernel: [419827.632542] Killed process 2567710 (mysqld) total-vm:309228564kB, anon-rss:203269244kB, file-rss:0kB, shmem-rss:0kB

2)发生现象

Dec 21 06:55:33 gdb kernel: [419827.630493] crontab-1 invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Dec 21 06:55:33 gdb kernel: [419827.632542] Killed process 2567710 (mysqld) total-vm:309228564kB, anon-rss:203269244kB, file-rss:0kB, shmem-rss:0kB

上述关键信息为进程 crontab-1 申请新的内存引起 oom-killer,而被 kill 进程为 mysqld 占用内存大小 203269244kB

3) NUMA 占用分析

Dec 21 06:55:33 gdb kernel: [419827.630672] Node 0 Normal free:10117540kB min:10117912kB low:12647388kB high:15176864kB active_anon:100207152kB inactive_anon:548kB active_file:808kB inactive_file:0kB unevictable:0kB writepending:0kB present:198180864kB managed:
194894048kB mlocked:0kB kernel_stack:13504kB pagetables:215840kB bounce:0kB free_pcp:536kB local_pcp:0kB free_cma:0kB
Dec 21 06:55:33 gdb kernel: [419827.630679] lowmem_reserve[]: 0 0 0 0 0
Dec 21 06:55:33 gdb kernel: [419827.630683] Node 1 Normal free:10287732kB min:10288284kB low:12860352kB high:15432420kB active_anon:103176592kB inactive_anon:860kB active_file:1324kB inactive_file:80kB unevictable:0kB writepending:880kB present:201326592kB mana
ged:198175752kB mlocked:0kB kernel_stack:11836kB pagetables:210288kB bounce:0kB free_pcp:21924kB local_pcp:332kB free_cma:0kB

从上述日志,可以看出两个 numa node 的剩余 free 内存均低于了 min 的要求内存。

4) 内存占用统计

根据 OOM 记录的日志信息,内存大概有如下分配(注意,系统日志中 rss 列的单位为页,默认 4k 大小)

进程占用内存
mysqld193G
其他进程641M
NUMA 剩余19.5G

上述内存远低于操作系统内存 376G,缺失近 163G

5) 大页分析

继续查看系统日志

Dec 21 06:55:33 gdb kernel: [419827.630731] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630737] Node 0 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB
Dec 21 06:55:33 gdb kernel: [419827.630738] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Dec 21 06:55:33 gdb kernel: [419827.630741] Node 1 hugepages_total=40960 hugepages_free=40960 hugepages_surp=0 hugepages_size=2048kB

解析为

页类型总页数量空闲页
numanode02M4096040960
numanode01G00
numanode12M4096040960
numanode11G00

可见大页占用了 2M x 40960 x 2=160G 内存,并且没有被使用,刚好和内存统计相近

4、大页配置查看

1) 检查透明大页配置

cat /sys/kernel/mm/transparent_hugepage/enabled,确认是关闭状态

[root@gdb ~]#  cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

2) 检查传统大页配置

sysctl -p | grep vm ,可见并没有相关配置

[root@gdb ~]#  sysctl -p | grep vm
vm.zone_reclaim_mode=0
vm.swappiness=1
vm.min_free_kbytes=20480000

3) 大页特性对比

特性维度传统大页透明大页
检查方式/etc/sysctl.conf 中的 vm.nr_hugepages/sys/kernel/mm/transparent_hugepage/enabled
管理机制静态预分配。在系统启动或配置后,内核立即从物理内存中划出指定数量的大页。这部分内存被“锁定”,专用于大页,不能被挪作他用(如进程的普通小页)。动态分配。内核在运行时根据内存访问模式(如连续的 512 个 4K 页被频繁访问),自动将小页合并成一个大页,或者在不再需要时拆分回小页。这是一个“按需”的过程。
配置方式1. 临时:sysctl -w vm.nr_hugepages=N 2. 永久:在 /etc/sysctl.conf 中添加 vm.nr_hugepages=N,重启或执行 sysctl -p 生效。1. 临时:echo > /sys/kernel/mm/transparent_hugepage/enabled 2. 永久:通过内核启动参数 vi /etc/default/grubGRUB_CMDLINE_LINUX 变量中添加 transparent_hugepage=always,重新生成 GRUB 配置 grub2-mkconfig -o /boot/grub2/grub.cfg
内存使用专用且独占。分配后即使不使用,也会一直占用物理内存,可能导致内存浪费。共享池。使用普通的内存页池,只在需要时才转换,内存利用率更高。
性能特点性能稳定可预测。应用程序(如 Oracle DB, Redis)通过 mmap()shmget() 显式请求大页时,能 100% 保证使用大页,无缺页中断或合并操作开销,性能最优、最稳定。性能有波动风险。虽然大多数情况下能提升性能(减少 TLB Miss),但在内存压力大或碎片化时,内核的合并/拆分操作(khugepaged 进程)会带来不可预测的延迟尖峰,对延迟敏感型应用不利。

根据故障现象及大页特点,猜测应该是由于配置了传统大页,锁定了 160G 内存无法被其他进程使用,但是配置文件中并没有该配置,现象很奇怪

4) 深度搜索

使用命令 grep -R "nr_hugepages" /etc 进行大范围深度搜索,发现了问题所在

[root@gdb ~]#  grep -R "nr_hugepages" /etc
/etc/sysctl.conf.bak-2025-07-13:vm.nr_hugepages=81920

可以看到配置文件在 7 月 13 日进行了备份调整,备份前确实是有传统大页配置,并且配置值和目前系统日志中记录值相同。

5) 配置变更测试

通过测试发现,即使配置文件中去传统大页设置,但是依然是存在大页设置的

[root@qdb -]# cat /etc/sysctl.conf | grep h
kernel.shmall = 41943040
kernel.shmmax = 171798691840
kernel.shmmni=4096
#vm.hugetlb_shm_group=54321
#vm.nr_hugepages = 40960
[root@qdb -]# sysctl -p | grep h
kernel.shmall = 41943040
kernel.shmmax = 171798691840
kernel.shmmi=4096
[root@qdb -]# cat /proc/sys/vm/nr_hugepages
40960

调整配置后如果不重启操作系统,需要手动释放该部分内存

[root@gdb ~]# echo 0 > /proc/sys/vm/nr_hugepages
[root@gdb ~]# cat /proc/sys/vm/nr_hugepages
0

三、原因总结改进

1) 根本原因

大量 HugePages 被预留但数据库未实际使用,导致普通内存不足,引发 OOM

2) 不正常的默认大页配置

在操作系统默认情况下,未配置 nr_hugepages,因此最初分析时未考虑传统大页方向。后经数据对比,发现传统大页存在内存占用异常现象。经后续核实,由于该服务器为利旧使用,残留了 Oracle 相关配置,导致该隐藏问题未被及时发现,又是一个国产化过程的小坑。

3) 后续改进

在基于现有服务器初始化步骤中,增加传统大页的检查设置步骤

sed -i '/huge/d' /etc/sysctl.conf
sysctl -p | grep huge
echo 0 > /proc/sys/vm/nr_hugepages


评论 (0 个评论)

facelist

您需要登录后才可以评论 登录 | 立即注册

合作电话:010-64087828

社区邮箱:greatsql@greatdb.com

社区公众号
社区小助手
QQ群
GMT+8, 2026-1-19 21:09 , Processed in 0.018232 second(s), 9 queries , Redis On.
返回顶部