良玉的博客 点点滴滴,积水成河_良玉的博客_页游、手游linux运维工程师之路

kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

今天遇到报错,机器死掉了,终端没有报错,直接连不上了。重启来看看message日志:


Jul  9 03:42:44 246 kernel: INFO: task zabbix_agentd:18783 blocked for more than 120 seconds.

Jul  9 03:42:44 246 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Jul  9 03:42:44 246 kernel: zabbix_agentd D 0000000000000013     0 18783   6426 0x00000084

Jul  9 03:42:44 246 kernel: ffff88086e635ce8 0000000000000086 ffff88086e635d98 ffffffffffffffe9

Jul  9 03:42:44 246 kernel: ffff88086e635c88 ffffffffa0111129 ffff881074aa7800 00000101fffffff5

Jul  9 03:42:44 246 kernel: ffff88086ee2d058 ffff88086e635fd8 000000000000fb88 ffff88086ee2d058

Jul  9 03:42:44 246 kernel: Call Trace:

Jul  9 03:42:44 246 kernel: [<ffffffffa0111129>] ? ext4_check_acl+0x29/0x90 [ext4]

Jul  9 03:42:44 246 kernel: [<ffffffffa00d7bf0>] ? ext4_file_open+0x0/0x130 [ext4]

Jul  9 03:42:44 246 kernel: [<ffffffff8150e555>] schedule_timeout+0x215/0x2e0

Jul  9 03:42:44 246 kernel: [<ffffffff8117e434>] ? nameidata_to_filp+0x54/0x70

Jul  9 03:42:44 246 kernel: [<ffffffff812771e9>] ? cpumask_next_and+0x29/0x50

Jul  9 03:42:44 246 kernel: [<ffffffff8150e1d3>] wait_for_common+0x123/0x180

Jul  9 03:42:44 246 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20

Jul  9 03:42:44 246 kernel: [<ffffffff8150e2ed>] wait_for_completion+0x1d/0x20

Jul  9 03:42:44 246 kernel: [<ffffffff8106513c>] sched_exec+0xdc/0xe0

Jul  9 03:42:44 246 kernel: [<ffffffff81189fc0>] do_execve+0xe0/0x2c0

Jul  9 03:42:44 246 kernel: [<ffffffff810095ea>] sys_execve+0x4a/0x80

Jul  9 03:42:44 246 kernel: [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0



从以上的报错信息也给出了简单的解决方案,就是禁止该120秒的超时:echo 0 > /proc/sys/kernel/hung_task_timeout_secs


解决办法:

按照告警里的提示将该提醒disable

echo 0 > /proc/sys/kernel/hung_task_timeout_secs


解释如下:
This is a know bug. By default Linux uses up to 40% of the available memory for file system caching.
After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous.
For flushing out this data to disk this there is a time limit of 120 seconds by default.
In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds.
This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle.
I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10″ in /etc/sysctl.conf.
This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.
告知是linux会设置40%的可用内存用来做系统cache,当flush数据时这40%内存中的数据由于和IO同步问题导致超时(120s),所将40%减小到10%,避免超时。


但是具体原因是因为zabbix_agentd导致的,还是得查看下zabbix,很多机器很多年了,第一次遇到。保险起见,重装。。。

留言列表
发表评论
来宾的头像