mt 内存监控:mcecheck.py
raid监控: check-raid
mcelog 是 x86 的 Linux 系统上用来检查硬件错误,特别是内存和CPU错误的工具。
安装方式
yum install mcelog
运行
mcelog
查看日志方式
/var/log/mcelog
MCE 0
HARDWARE ERROR. This is NOT a software problem!
Please contact your hardware vendor
CPU 1 BANK 8 TSC 1193fd60c6699 [at 2000 Mhz 1 days 18:56:49 uptime (unreliable)]
MISC 8f44960800095840 ADDR 4a9f3b1c0
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 18
Memory transaction Tracker ID (RTId): 40
Memory DIMM ID of error: 1
Memory channel ID of error: 0
Memory ECC syndrome: f449608
STATUS cc0004800001009f MCGSTATUS 0
作为一个企业服务器管理员,面对服务器莫名宕机或者主动重启,历经折磨后判断为内存问题引起,可当看到内存多达几十条时,难道要单条测试?要真这样,估计领导也要废了你吧。有没方便有效的方法去速度定位那个DIMM槽内存或者在日常监测内存正常与否呢?下面介绍下linux系统下的监控方法--MCElog。
What are Machine Check Exceptions (or MCE)?
A machine check exception is an error dedected by your system's processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal execption. The warning will be logged by a "Machine Check Event logged" notice in your system logs, and can be later viewed via some Linux utilities. A fatal MCE will cause the machine to stop responding and the details of the MCE will be printed out to the system's console.
What causes MCE errors?
There most common reason for MCE events to occur are:
1.Memory errors or Error Correction Code (ECC) problems
2.Inadequate cooling / processor over-heating
3.System bus errors
4.Cache errors in the processor or hardware
##一般来说当有错误提示时,需要优先注意内存问题,但由于现在内存控制器是集成在cpu里,所以有个别情况是由CPU问题引起的##
Installmcelog-1.0_pre3_p20120918.tar.gz
Mcelog安装
#tar -zxvf mcelog-1.0_pre3_p20120918.tar.gz 解压出来
#cd andikleen-mcelog-0f5d023 进入解压出来的文件夹
#make
#make install 编译和安装
Mcelog相关文件
/dev/mcelog 设备文件
/var/log/mcelog messages日志文件
/etc/mcelog/mcelog.conf配置文件
/var/run/mcelog.pid
默认故障日志只记录在/var/log/mcelog,并不记录到系统日志中。
如果需要在系统日志中也体现,需修改/etc/mcelog/mcelog.conf文件,将前面#去掉,并保存。

Mcelog相关设置
1.mcelog的随系统启动,查看boot下的config文件,可以看到mce模块随机启动

2.配置mcelog后台运行
#mcelog --daemon
3.查看mcelog日志文件

由于各厂家服务器内存槽位设计可能不同,这边关于错误中的cpu0 bank5内存槽位定位不做讨论。