ESXi主机开启SSH,执行以下命令排查问题
检查 vSAN 物理磁盘状态
检查“IsPDL”(永久设备丢失)参数。如果等于 1,则磁盘丢失。
vdq -qH
示例:
DiskResults:DiskResult[0]:Name: naa.5000039c181a6de9VSANUUID: 527b32db-a6c2-d457-5132-e4c2a2241368State: In-use for VSANReason: NoneStoragePoolState: Ineligible for use by Storage PoolStoragePoolReason:Disk in use by disk groupIsSSD?: 0
IsCapacityFlash?: 0IsPDL?: 0 //如果等于 1,则磁盘丢失Size(MB): 2289272FormatType: 512eIsVsanDirectDisk?: 0
检查磁盘组中是否缺少磁盘。
vdq -iH
示例:
Mappings:DiskMapping[0]:SSD: naa.5002538b225cc2f0MD: naa.5000039c181a6de9MD: naa.5000039c181a707dMD: naa.5000039c181a7001MD: naa.5000039c181a7005MD: naa.5000039c181a6e29MD: naa.5000039c181a7011
检查“In CMMDS”参数。如果为 false,则与磁盘的通信会丢失。
esxcli vsan storage list
示例:
naa.5000039c181a6de9Device: naa.5000039c181a6de9Display Name: naa.5000039c181a6de9Is SSD: falseVSAN UUID: 527b32db-a6c2-d457-5132-e4c2a2241368VSAN Disk Group UUID: 52874f04-d659-0f52-8ac2-35aa05702568VSAN Disk Group Name: naa.5002538b225cc2f0Used by this host: trueIn CMMDS: true //如果为 false,则与磁盘的通信会丢失On-disk format version: 17Deduplication: falseCompression: falseChecksum: 13704721513334665797Checksum OK: trueIs Capacity Tier: trueEncryption Metadata Checksum OK: trueEncryption: falseDiskKeyLoaded: falseIs Mounted: trueCreation Time: Sat Dec 10 17:16:48 2022
使用smart get 命令检查读/写错误。
列出所有硬盘naaesxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa"
示例:
naa.5000039c181a6de9
naa.5000039c181a707d
naa.5002538b225cc2f0
naa.5000039c181a7001
naa.5000039c181a7005
naa.5000039c181a6e29
naa.5000039c181a7011
查看S.M.A.R.T.信息esxcli storage core device smart get -d naa.5000039c181a6de9
示例:
Parameter Value Threshold Worst Raw
----------------- ----- --------- ----- ---
Health Status OK N/A N/A N/A
Write Error Count 0 N/A N/A N/A
Read Error Count 557 N/A N/A N/A
Power Cycle Count 31 N/A N/A N/A
Drive Temperature 27 N/A N/A N/A
检查可用的磁盘组。
esxcli vsan storage list | grep "VSAN Disk Group UUID:" | sort | uniq -c
示例:
7 VSAN Disk Group UUID: 52874f04-d659-0f52-8ac2-35aa05702568
检查是否存在正在进行或停滞的重新同步操作。
while true;do echo " ****************************************** "; echo "" > /tmp/resyncStats.txt ;cmmds-tool find -t DOM_OBJECT -f json |grep uuid |awk -F \" '{print $4}' |while read i;do pendingResync=$(cmmds-tool find -t DOM_OBJECT -f json -u $i|grep -o "\"bytesToSync\": [0-9]*,"|awk -F " |," '{sum+=$2} END{print sum / 1024 / 1024 / 1024;}');if [ ${#pendingResync} -ne 1 ]; then echo "$i: $pendingResync GiB";fi;done |tee -a /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |awk '{sum+=$2} END{print sum}');echo "Total: $total GiB" |tee -aa /tmp/resyncStats.txt;total=$(cat /tmp/resyncStats.txt |grep Total);totalObj=$(cat /tmp/resyncStats.txt|grep -vE " 0 GiB|Total"|wc -l);echo "`date +%Y-%m-%dT%H:%M:%SZ` $total ($totalObj objects)" >> /tmp/totalHistory.txt; echo `date `; sleep 60; done
示例:
Total: 0 GiB
Mon Mar 11 02:14:59 UTC 2024
按Ctrl+C停止命令
检查组件的状态。
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c
正常:状态 7
无法访问:状态 13
不存在或降级:状态 15
示例:
71 state\": 7
确定故障硬盘的位置:
列出所有硬盘naaesxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa"
示例:
naa.5000039c181a6de9
naa.5000039c181a707d
naa.5002538b225cc2f0
naa.5000039c181a7001
naa.5000039c181a7005
naa.5000039c181a6e29
naa.5000039c181a7011
使用naa查看硬盘位置esxcli storage core device physical get -d naa.5000039c181a6de9
示例:
Physical Location: enclosure 0 slot 6
查找已经丢失的硬盘:
使用以下脚本
echo "=============Physical disks placement=============="
echo ""
esxcli storage core device list | grep "naa" | awk '{print $1}' | grep "naa" | while read in; do
echo "$in"
esxcli storage core device physical get -d "$in"
sleep 1
echo "===================================================="
done
未找到的就是故障硬盘,也可以在服务器的iDRAC中查看
示例:
=============Physical disks placement==============naa.5000039c181a6de9Physical Location: enclosure 0 slot 6
====================================================
naa.5000039c181a707dPhysical Location: enclosure 0 slot 2
====================================================
naa.5002538b225cc2f0Physical Location: enclosure 0 slot 0
====================================================
naa.5000039c181a7001Physical Location: enclosure 0 slot 1
====================================================
naa.5000039c181a7005Physical Location: enclosure 0 slot 3
====================================================
naa.5000039c181a6e29Physical Location: enclosure 0 slot 5
====================================================
naa.5000039c181a7011Physical Location: enclosure 0 slot 4
====================================================
相关日志
/var/log/vmkernel.log
读取和写入 vSAN 磁盘、vSAN 主机心跳信号、PDL、SCSI 感知代码和 I/O 请求(读取/写入)以及群集成员身份信息时出现问题。
示例:
2024-03-09T18:50:51.413Z Wa(180) vmkwarning: cpu6:2098013)WARNING: ScsiDeviceIO: 1774: Device naa.5000039c181a7005 performance has deteriorated. I/O latency increased from average value of 11487 microseconds to 7116618 microseconds.
2024-03-09T18:51:06.727Z Wa(180) vmkwarning: cpu61:2098012)WARNING: HPP: HppThrottleLogForDevice:1133: Cmd 0x28 (0x45dbf966c400, 0) to dev "naa.5000039c181a7005" on path "vmhba3:C0:T4:L0" Failed:
2024-03-09T18:51:06.727Z Wa(180) vmkwarning: cpu61:2098012)WARNING: HPP: HppThrottleLogForDevice:1141: Error status H:0x5 D:0x0 P:0x0 . hppAction = 3
/var/log/vobd.log
报告磁盘运行状况、永久设备丢失磁盘 (PDL)、磁盘延迟,并报告主机何时进入和退出维护模式。
示例:
2024-03-09T18:08:10.611Z In(14) vobd[2097697]: [vSANCorrelator] 20883483894278us: [vob.vsan.lsom.devicerepair] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2024-03-09T18:08:10.611Z In(14) vobd[2097697]: [vSANCorrelator] 20883202034798us: [esx.problem.vob.vsan.lsom.devicerepair] Device 5234107b-5200-c452-6c05-99f3bb102a7f is in offline state and is getting repaired.
2024-03-09T18:08:10.621Z In(14) vobd[2097697]: [vSANCorrelator] 20883483904364us: [vob.vsan.pdl.offline] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f has gone offline.
2024-03-09T18:08:10.621Z In(14) vobd[2097697]: [vSANCorrelator] 20883202044628us: [esx.problem.vob.vsan.pdl.offline] vSAN device 5234107b-5200-c452-6c05-99f3bb102a7f has gone offline.
/var/log/vsandevicemonitord.log
它可帮助您确定磁盘是否由于过度日志拥塞或 I/O 延迟而被标记为不正常。
示例:
2024-03-09T18:08:38Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:08:38Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: stderr Errors:
2024-03-09T18:28:49Z In(14)[+] vsandevicemonitord[2100160]: Unable to mount: Disk with vSAN uuid 5234107b-5200-c452-6c05-99f3bb102a7f failed to appear in CMMDS
2024-03-09T18:28:49Z In(14)[+] vsandevicemonitord[2100160]: , stdout from command vsan storage diskgroup mount -d naa.5000039c181a7005.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: Mounting failed on VSAN device naa.5000039c181a7005.
2024-03-09T18:28:49Z In(14) vsandevicemonitord[2100160]: Repair attempt 1 for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:38:50Z In(14) vsandevicemonitord[2100160]: Sample latency intervals for naa.5002538b225cc2f0 are [0, 2, 5, 7, 9, 10].
2024-03-09T18:38:50Z In(14) vsandevicemonitord[2100160]: Resetting repair attempt for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:38:52Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:38:52Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:40:10Z In(14) vsandevicemonitord[2100160]: Mount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:40:10Z In(14) vsandevicemonitord[2100160]: Repair successful for device 5234107b-5200-c452-6c05-99f3bb102a7f
2024-03-09T18:50:13Z In(14) vsandevicemonitord[2100160]: Unmount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:50:13Z In(14) vsandevicemonitord[2100160]: Device naa.5000039c181a7005 was already unmounted.
2024-03-09T18:50:27Z In(14) vsandevicemonitord[2100160]: Mount succeeded on VSAN device naa.5000039c181a7005.
2024-03-09T18:50:27Z In(14) vsandevicemonitord[2100160]: Repair successful for device 5234107b-5200-c452-6c05-99f3bb102a7f
来源:
https://www.dell.com/support/kbdoc/en-us/000209262/vsan-physical-disk-troubleshooting-guide?lang=zh
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/942179.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!