Kdump is a kernel feature which is used to capture crash dumps when the system or kernel crash. For enabling kdump we have to reserve some portion of physical RAM which will be used to execute kdump kernel in the event of kernel panic or crash.
When a kernel crash or kernel panic occurs then running kernel runs ‘kexec(kdump kernel)‘ and it loads kdump kernel from reserve memory and then contents of RAM and Swap is copied to vmcore file either on local disk or on remote disk and finally reboot the box.
By analyzing the crash dumps we can find the reason or the root case of system failure. If you have OS support then you can share the crash dumps to the vendor for analysis.
In this article we will demonstrate how to enable kdump on RHEL 7 and CentOS 7
Step:1 Install ‘kexec-tools’ and update kernel using yum command
Use the below yum command to install ‘kexec-tools’ package in case it is not installed.
[root@node01 ~]# yum install kexec-tools kernel -y
Step:2 Update the GRUB2 file to Reserve Memory for Kdump kernel
Edit the GRUB2 file (/etc/default/grub), add the parameter ‘crashkernel=<Reserved_size_of_RAM>‘ in the line beginning with ‘GRUB_CMDLINE_LINUX‘
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet"
grub-file-centos7
Execute the below command to regenerate grub2 configuration.
[root@node01 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg
Reboot the box now using below command :
[root@node01 ~]# shutdown -r now
Step:3 Update the dump location & default action in the file (/etc/kdump.conf)
To store crash dump or vmcore file on a local file system, edit the file ‘/etc/kdump.conf‘ and specify the location as per your setup. In my case i am using a separate local file system ( /var/crash). It is recommended that size of file system should be equivalent to the size of your system’s RAM or file system should have free space equivalent to the size of RAM. Kdump allows to compress the dump data using ‘core collector’ option (core_collector makedumpfile -c ) where -c is used for compression.
In case if kdump fails to store the dump file to specified location then default action will be performed which is mention in the default directive. In my case default action is reboot.
Update the below three directives in kdump.conf file.
[root@node01 ~]# vi /etc/kdump.conf
path /var/crash
core_collector makedumpfile -c
default reboot
Step:4 Start and enable kdump service
[root@node01 ~]# systemctl start kdump.service
[root@node01 ~]# systemctl enable kdump.service
Step:4a Kdump utility
[root@node01 ~]# kdumpctl
Usage: /usr/bin/kdumpctl {start|stop|status|restart|propagate|showmem}
[root@node01 ~]#
To check the reserved memory for kdump
[root@node01 ~]# kdumpctl showmem
Reserved 161MB memory for crash kernel
[root@node01 ~]#
To check the status of kdump
[root@node01 ~]# kdumpctl status
Kdump is operational[root@node01 ~]#
[root@node02 ~]# kdumpctl restart
kexec: unloaded kdump kernel
Stopping kdump: [OK]
kexec: loaded kdump kernel
Starting kdump: [OK]
[root@node02 ~]#
Step:5 Now Test Kdump by manually crashing the system
Before crashing your system , please verify whether the kdump service is running or not using below command.
[root@cloud crash]# systemctl is-active kdump.service
[root@cloud crash]# service kdump status
To test our kdump configuration we will manually crash our system with below commands.
[root@node01 ~]# echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
This will create a crash dump file (vmcore ) under ‘/var/crash‘ file system.
[root@node01 ~]# ls -lhR /var/crash/
/var/crash/:
total 0
drwxr-xr-x. 2 root root 44 Nov 22 20:09 127.0.0.1-2019-11-22-20:09:27
/var/crash/127.0.0.1-2019-11-22-20:09:27:
total 37M
-rw-------. 1 root root 37M Nov 22 20:09 vmcore
-rw-r--r--. 1 root root 102K Nov 22 20:09 vmcore-dmesg.txt
[root@node01 ~]#
Step:6 Use ‘crash’ command to analyze and debug crash dumps
Crash is the utility or command to debug and analyze the crash dump or vmcore file.
To use the crash, make sure two packages are installed : ‘crash & kernel-debuginfo‘
[root@node01 ~]# yum install crash
To install ‘kernel-debuginfo’ package , first enable debug repo. Edit the repo file /etc/yum.repos.d/CentOS-Debuginfo.repo
change ‘enbled=0’ to ‘enabled=1’
[root@node01 ~]# vi /etc/yum.repos.d/CentOS-Debuginfo.repo
[root@node01 ~]# yum install kernel-debuginfo
Once the kernel-debuginfo is installed , then try to execute below crash command, it will give us a crash prompt where we can run commands to find process info , list of open files when the system got crashed.
crash /var/crash/127.0.0.1-2019-11-22-20\:09\:27/vmcore /usr/lib/ debug/lib/modules/3.10.0-1062.4.3.el7.x86_64/vmlinux
crash>
Type ‘ps‘ command to list the Process which were running when the system got crashed.
crash> ps
To view the files that were open when system got crashed , type ‘files’ command at crash prompt.
Type ‘sys’ command to list the system info when it got crashed.
crash> sys
KERNEL: /usr/lib/debug/lib/modules/3.10.0-1062.4.3.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2019-11-22-20:09:27/vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Fri Nov 22 20:09:20 2019
UPTIME: 00:03:28
LOAD AVERAGE: 0.28, 0.36, 0.17
TASKS: 114
NODENAME: node01
RELEASE: 3.10.0-1062.4.3.el7.x86_64
VERSION: #1 SMP Wed Nov 13 23:58:53 UTC 2019
MACHINE: x86_64 (4000 Mhz)
MEMORY: 2 GB
PANIC: "SysRq : Trigger a crash"
crash>
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 469416 1.8 GB ----
FREE 385416 1.5 GB 82% of TOTAL MEM
USED 84000 328.1 MB 17% of TOTAL MEM
SHARED 9919 38.7 MB 2% of TOTAL MEM
BUFFERS 527 2.1 MB 0% of TOTAL MEM
CACHED 37747 147.4 MB 8% of TOTAL MEM
SLAB 13367 52.2 MB 2% of TOTAL MEM
TOTAL HUGE 0 0 ----
HUGE FREE 0 0 0% of TOTAL HUGE
TOTAL SWAP 524287 2 GB ----
SWAP USED 0 0 0% of TOTAL SWAP
SWAP FREE 524287 2 GB 100% of TOTAL SWAP
COMMIT LIMIT 758995 2.9 GB ----
COMMITTED 71635 279.8 MB 9% of TOTAL LIMIT
crash>
Type ‘bt’ command to backtraces (read upside-down, from bottom to top) when it got crashed..
crash> bt
PID: 12582 TASK: ffff8890fb6b0000 CPU: 0 COMMAND: "bash"
#0 [ffff8890faf9bac8] machine_kexec at ffffffff81265b24
#1 [ffff8890faf9bb28] __crash_kexec at ffffffff81321ab2
#2 [ffff8890faf9bbf8] crash_kexec at ffffffff81321ba0
#3 [ffff8890faf9bc10] oops_end at ffffffff81984798
#4 [ffff8890faf9bc38] no_context at ffffffff81275bb4
#5 [ffff8890faf9bc88] __bad_area_nosemaphore at ffffffff81275e82
#6 [ffff8890faf9bcd8] bad_area at ffffffff81973104
#7 [ffff8890faf9bd00] __do_page_fault at ffffffff819878b7
#8 [ffff8890faf9bd70] do_page_fault at ffffffff81987975
#9 [ffff8890faf9bda0] page_fault at ffffffff81983778
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffff8166f266 RSP: ffff8890faf9be58 RFLAGS: 00010246
RAX: ffffffff8166f250 RBX: ffffffff81ee54a0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8890ff613898 RDI: 0000000000000063
RBP: ffff8890faf9be58 R8: ffffffff822018bc R9: 0000000000000082
R10: 0000000000000595 R11: 0000000000000594 R12: 0000000000000063
R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8890faf9be60] __handle_sysrq at ffffffff8166fa8d
#11 [ffff8890faf9be90] write_sysrq_trigger at ffffffff8166fef8
#12 [ffff8890faf9bea8] proc_reg_write at ffffffff814c2220
#13 [ffff8890faf9bec8] vfs_write at ffffffff81449f60
#14 [ffff8890faf9bf08] sys_write at ffffffff8144ad7f
#15 [ffff8890faf9bf50] system_call_fastpath at ffffffff8198cede
RIP: 00007f5430755fd0 RSP: 00007ffffb09f318 RFLAGS: 00000246
RAX: 0000000000000001 RBX: 0000000000000002 RCX: ffffffff8198ce21
RDX: 0000000000000002 RSI: 00007f543108f000 RDI: 0000000000000001
RBP: 00007f543108f000 R8: 000000000000000a R9: 00007f5431093740
R10: 00007f5431093740 R11: 0000000000000246 R12: 00007f5430a2e400
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
crash>
Type ‘help’ command to get help of any command on crash prompt.
crash> help
* extend log rd task
alias files mach repeat timer
ascii foreach mod runq tree
bpf fuser mount search union
bt gdb net set vm
btop help p sig vtop
dev ipcs ps struct waitq
dis irq pte swap whatis
eval kmem ptob sym wr
exit list ptov sys q
crash version: 7.2.3-10.el7 gdb version: 7.6
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".
crash>
crash> help
* extend log rd task
alias files mach repeat timer
ascii foreach mod runq tree
bpf fuser mount search union
bt gdb net set vm
btop help p sig vtop
dev ipcs ps struct waitq
dis irq pte swap whatis
eval kmem ptob sym wr
exit list ptov sys q
crash version: 7.2.3-10.el7 gdb version: 7.6
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".
crash>
To store kdump on remote servers.
1. Update the below configuration.
path /tmp
ssh root@192.168.2.100
sshkey /root/.ssh/kdump_id_rsa
core_collector makedumpfile -c -F
default reboot
2. Using below command it will create ssh_key
[root@node02 ~]# kdumpctl propagate
WARNING: '/root/.ssh/kdump_id_rsa' doesn't exist, using default value '/root/.ssh/kdump_id_rsa'
Generating new ssh keys... done.
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/kdump_id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@192.168.2.100's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@192.168.2.100'"
and check to make sure that only the key(s) you wanted were added.
/root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on 192.168.2.100
[root@node02 ~]#
3. Restart the kdumpctl service using the below command.
kexec: unloaded kdump kernel
Stopping kdump: [OK]
kexec: loaded kdump kernel
Starting kdump: [OK]
[root@node02 ~]#
4. Check if the below initramfs is created.
-rw------- 1 root root 16620629 May 1 18:20 /boot/initramfs-3.10.0-957.el7.x86_64kdump.img
[root@node02 ~]#
5. Crash the server and check if kdump is created on 192.168.2.100 server's /tmp folder.
That’s conclude the article, Please don’t hesitate to share it if you have enjoyed.
https://www.slideshare.net/PaulVNovarese/linux-crash-dump-capture-and-analysis
No comments:
Post a Comment