Quantcast
Channel: VMware Communities : Discussion List - All Communities
Viewing all articles
Browse latest Browse all 180259

ESXI 5.1: Host stops unexpectedly

$
0
0

Hi everyone,

 

I recently built a system to run ESXi and host a bunch of virtual servers.

 

The ESXi host seems to run very well, but about once every 24 hours it appears as if it loses its ability to use the locally attached disks. When the problem occurs, I am still able to ping the host, and to SSH into it, but the moment I do anything that requires disk IO, my SSH session hangs.

 

I have spent a lot of time trying to make sure it's not a heat or power issue. I have a multimeter with a temperature probe connected to the cooling fins of the RAID controller's main processor and the raid chip's surface never gets above 39 degrees celcius. Everything else in the host is nice and cool and never above 35 degrees celcius surface temperatures, measured using an infrared thermometer while the system is running.

 

The "hang" seems to happen regardless of the load I put on the host.

 

A bit of system info:

 

CPU: Intel i7-3930K

64 GB RAM

Adaptec 6405 RAID controller

4x Western Digital 1 TB drives in a RAID5 array with 2 volumes: 1 for booting ESXi, 1 to host my VM's.

750 watts power supply

 

Yesterday, I had the system running for a few hours booted from a CD with the Memtest86 tool, I wanted to try and see if I had flaky RAM. The tests ran just fine, and given the symptom (suddenly can't access the RAID controller) i doubt it's a RAM issue.

 

I grabbed vmkernel.log (file is attached) and I see sporadic messages along the lines of:

 

2013-03-25T11:48:56.666Z cpu8:34375)ScsiDeviceIO: 2316: Cmd(0x4124007addc0) 0x85, CmdSN 0xe0 from world 5150 to dev "mpx.vmhba3:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-03-25T11:48:56.666Z cpu8:34375)ScsiDeviceIO: 2316: Cmd(0x4124007addc0) 0x4d, CmdSN 0xe1 from world 5150 to dev "mpx.vmhba3:C0:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2013-03-25T11:48:56.666Z cpu0:5150)WARNING: ScsiDeviceIO: 6678: IEC page to device "mpx.vmhba3:C0:T0:L0" has bad pagecode: 0x30
2013-03-25T11:48:56.671Z cpu8:32976)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124007addc0, 5150) to dev "mpx.vmhba3:C0:T0:L0" on path "vmhba3:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

 

vmhba3 is my adaptec RAID controller.

 

I have no idea what this means, or if it's even important, but I thought I'd throw it out there since it seems somehow related to the RAID controller.

 

My ESXi is using what I believe to be the most recent Adaptec driver.

 

I have run ESXi for a few yars and it's always been rock solid so I don't know much about how to troubleshoot these types of issues I'm afraid. Any suggestions or advice would be greatly appreciated.


Viewing all articles
Browse latest Browse all 180259

Trending Articles