How to enable/analysis Kdump on RHEL 7 and CentOS 7

Kdump is a kernel feature which is used to capture crash dumps when the system or kernel crash. For enabling kdump we have to reserve some portion of physical RAM which will be used to execute kdump kernel in the event of kernel panic or crash.

When a kernel crash or kernel panic occurs then running kernel runs ‘kexec(kdump kernel)‘ and it loads kdump kernel from reserve memory and then contents of RAM and Swap is copied to vmcore file either on local disk or on remote disk and finally reboot the box.

By analyzing the crash dumps we can find the reason or the root case of system failure. If you have OS support then you can share the crash dumps to the vendor for analysis.

In this article we will demonstrate how to enable kdump on RHEL 7 and CentOS 7

Step:1 Install ‘kexec-tools’ and update kernel using yum command
Use the below yum command to install ‘kexec-tools’ package in case it is not installed.

[root@node01 ~]# yum install kexec-tools kernel -y

Step:2 Update the GRUB2 file to Reserve Memory for Kdump kernel
Edit the GRUB2 file (/etc/default/grub), add the parameter ‘crashkernel=<Reserved_size_of_RAM>‘ in the line beginning with ‘GRUB_CMDLINE_LINUX‘

GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet"
grub-file-centos7

Execute the below command to regenerate grub2 configuration.

[root@node01 ~]# grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot the box now using below command :

[root@node01 ~]# shutdown -r now

Step:3 Update the dump location & default action in the file (/etc/kdump.conf)
To store crash dump or vmcore file on a local file system, edit the file ‘/etc/kdump.conf‘ and specify the location as per your setup. In my case i am using a separate local file system ( /var/crash). It is recommended that size of file system should be equivalent to the size of your system’s RAM or file system should have free space equivalent to the size of RAM. Kdump allows to compress the dump data using ‘core collector’ option (core_collector makedumpfile -c ) where -c is used for compression.

In case if kdump fails to store the dump file to specified location then default action will be performed which is mention in the default directive. In my case default action is reboot.

Update the below three directives in kdump.conf file.

[root@node01 ~]#  vi /etc/kdump.conf

path /var/crash
core_collector makedumpfile -c
default reboot

Step:4 Start and enable kdump service
[root@node01 ~]# systemctl start kdump.service
[root@node01 ~]# systemctl enable kdump.service

Step:4a Kdump utility
[root@node01 ~]# kdumpctl
Usage: /usr/bin/kdumpctl {start|stop|status|restart|propagate|showmem}
[root@node01 ~]#

To check the reserved memory for kdump
[root@node01 ~]# kdumpctl showmem
Reserved 161MB memory for crash kernel
[root@node01 ~]#

To check the status of  kdump
[root@node01 ~]# kdumpctl status
Kdump is operational
[root@node01 ~]#

[root@node02 ~]# kdumpctl restart
kexec: unloaded kdump kernel
Stopping kdump: [OK]
kexec: loaded kdump kernel
Starting kdump: [OK]

[root@node02 ~]#

Step:5 Now Test Kdump by manually crashing the system
Before crashing your system , please verify whether the kdump service is running or not using below command.

[root@cloud crash]# systemctl is-active kdump.service
[root@cloud crash]# service kdump status

To test our kdump configuration we will manually crash our system with below commands.

[root@node01 ~]# echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger

This will create a crash dump file (vmcore ) under ‘/var/crash‘ file system.

[root@node01 ~]# ls -lhR /var/crash/
/var/crash/:
total 0
drwxr-xr-x. 2 root root 44 Nov 22 20:09 127.0.0.1-2019-11-22-20:09:27

/var/crash/127.0.0.1-2019-11-22-20:09:27:
total 37M
-rw-------. 1 root root  37M Nov 22 20:09 vmcore
-rw-r--r--. 1 root root 102K Nov 22 20:09 vmcore-dmesg.txt

[root@node01 ~]# 

Step:6 Use ‘crash’ command to analyze and debug crash dumps
Crash is the utility or command to debug and analyze the crash dump or vmcore file.

To use the crash, make sure two packages are installed : ‘crash & kernel-debuginfo‘

[root@node01 ~]# yum install crash

To install ‘kernel-debuginfo’ package , first enable debug repo. Edit the repo file /etc/yum.repos.d/CentOS-Debuginfo.repo

change ‘enbled=0’ to ‘enabled=1’

[root@node01 ~]# vi /etc/yum.repos.d/CentOS-Debuginfo.repo
[root@node01 ~]# yum install kernel-debuginfo

Once the kernel-debuginfo is installed , then try to execute below crash command, it will give us a crash prompt where we can run commands to find process info , list of open files when the system got crashed.

 crash /var/crash/127.0.0.1-2019-11-22-20\:09\:27/vmcore  /usr/lib/ debug/lib/modules/3.10.0-1062.4.3.el7.x86_64/vmlinux

crash>

Type ‘ps‘ command to list the Process which were running when the system got crashed.

crash> ps

To view the files that were open when system got crashed , type ‘files’ command at crash prompt.

Type ‘sys’ command to list the system info when it got crashed.

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1062.4.3.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2019-11-22-20:09:27/vmcore  [PARTIAL DUMP]
        CPUS: 1
        DATE: Fri Nov 22 20:09:20 2019
      UPTIME: 00:03:28
LOAD AVERAGE: 0.28, 0.36, 0.17
       TASKS: 114
    NODENAME: node01
     RELEASE: 3.10.0-1062.4.3.el7.x86_64
     VERSION: #1 SMP Wed Nov 13 23:58:53 UTC 2019
     MACHINE: x86_64  (4000 Mhz)
      MEMORY: 2 GB
       PANIC: "SysRq : Trigger a crash"

crash>

Type ‘kmen -i’ command to list the memory when it got crashed.

crash> kmem -i
                 PAGES        TOTAL      PERCENTAGE
    TOTAL MEM   469416       1.8 GB         ----
         FREE   385416       1.5 GB   82% of TOTAL MEM
         USED    84000     328.1 MB   17% of TOTAL MEM
       SHARED     9919      38.7 MB    2% of TOTAL MEM
      BUFFERS      527       2.1 MB    0% of TOTAL MEM
       CACHED    37747     147.4 MB    8% of TOTAL MEM
         SLAB    13367      52.2 MB    2% of TOTAL MEM

   TOTAL HUGE        0            0         ----
    HUGE FREE        0            0    0% of TOTAL HUGE

   TOTAL SWAP   524287         2 GB         ----
    SWAP USED        0            0    0% of TOTAL SWAP
    SWAP FREE   524287         2 GB  100% of TOTAL SWAP

 COMMIT LIMIT   758995       2.9 GB         ----
    COMMITTED    71635     279.8 MB    9% of TOTAL LIMIT

crash>

Type ‘bt’ command to backtraces (read upside-down, from bottom to top) when it got crashed..

crash> bt
PID: 12582  TASK: ffff8890fb6b0000  CPU: 0   COMMAND: "bash"
 #0 [ffff8890faf9bac8] machine_kexec at ffffffff81265b24
 #1 [ffff8890faf9bb28] __crash_kexec at ffffffff81321ab2
 #2 [ffff8890faf9bbf8] crash_kexec at ffffffff81321ba0
 #3 [ffff8890faf9bc10] oops_end at ffffffff81984798
 #4 [ffff8890faf9bc38] no_context at ffffffff81275bb4
 #5 [ffff8890faf9bc88] __bad_area_nosemaphore at ffffffff81275e82
 #6 [ffff8890faf9bcd8] bad_area at ffffffff81973104
 #7 [ffff8890faf9bd00] __do_page_fault at ffffffff819878b7
 #8 [ffff8890faf9bd70] do_page_fault at ffffffff81987975
 #9 [ffff8890faf9bda0] page_fault at ffffffff81983778
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffff8166f266  RSP: ffff8890faf9be58  RFLAGS: 00010246
    RAX: ffffffff8166f250  RBX: ffffffff81ee54a0  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff8890ff613898  RDI: 0000000000000063
    RBP: ffff8890faf9be58   R8: ffffffff822018bc   R9: 0000000000000082
    R10: 0000000000000595  R11: 0000000000000594  R12: 0000000000000063
    R13: 0000000000000000  R14: 0000000000000004  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8890faf9be60] __handle_sysrq at ffffffff8166fa8d
#11 [ffff8890faf9be90] write_sysrq_trigger at ffffffff8166fef8
#12 [ffff8890faf9bea8] proc_reg_write at ffffffff814c2220
#13 [ffff8890faf9bec8] vfs_write at ffffffff81449f60
#14 [ffff8890faf9bf08] sys_write at ffffffff8144ad7f
#15 [ffff8890faf9bf50] system_call_fastpath at ffffffff8198cede
    RIP: 00007f5430755fd0  RSP: 00007ffffb09f318  RFLAGS: 00000246
    RAX: 0000000000000001  RBX: 0000000000000002  RCX: ffffffff8198ce21
    RDX: 0000000000000002  RSI: 00007f543108f000  RDI: 0000000000000001
    RBP: 00007f543108f000   R8: 000000000000000a   R9: 00007f5431093740
    R10: 00007f5431093740  R11: 0000000000000246  R12: 00007f5430a2e400
    R13: 0000000000000002  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

crash>

Type ‘help’ command to get help of any command on crash prompt.

crash> help

*              extend         log            rd             task
alias          files          mach           repeat         timer
ascii          foreach        mod            runq           tree
bpf            fuser          mount          search         union
bt             gdb            net            set            vm
btop           help           p              sig            vtop
dev            ipcs           ps             struct         waitq
dis            irq            pte            swap           whatis
eval           kmem           ptob           sym            wr
exit           list           ptov           sys            q

crash version: 7.2.3-10.el7   gdb version: 7.6
For help on any command above, enter "help <command>".
For help on input options, enter "help input".
For help on output options, enter "help output".

crash>

To store kdump on remote servers.

1. Update the below configuration. 
[root@node02 ~]# vi /etc/kdump.conf
path /tmp
ssh root@192.168.2.100
sshkey /root/.ssh/kdump_id_rsa
core_collector makedumpfile -c -F
default reboot

2. Using below command it will create ssh_key
[root@node02 ~]# kdumpctl propagate
WARNING: '/root/.ssh/kdump_id_rsa' doesn't exist, using default value '/root/.ssh/kdump_id_rsa'
Generating new ssh keys... done.
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/kdump_id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@192.168.2.100's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@192.168.2.100'"
and check to make sure that only the key(s) you wanted were added.

/root/.ssh/kdump_id_rsa has been added to ~root/.ssh/authorized_keys on 192.168.2.100
[root@node02 ~]# 

3. Restart the kdumpctl service using the below command.

[root@node02 ~]# kdumpctl restart
kexec: unloaded kdump kernel
Stopping kdump: [OK]
kexec: loaded kdump kernel
Starting kdump: [OK]
[root@node02 ~]#

4. Check if the below initramfs is created.

[root@node02 ~]# ls -lrt /boot/initramfs-3.10.0-957.el7.x86_64kdump.img
-rw------- 1 root root 16620629 May  1 18:20 /boot/initramfs-3.10.0-957.el7.x86_64kdump.img
[root@node02 ~]#

5. Crash the server and check if kdump is created on 192.168.2.100 server's /tmp folder. 


That’s conclude the article, Please don’t hesitate to share it if you have enjoyed.



https://www.slideshare.net/PaulVNovarese/linux-crash-dump-capture-and-analysis

No comments:

Post a Comment