The Situation:
1. We have two SuperMicro servers running ESXi 5.1, and do cross-backups each night between the servers using vkmfstools. Each server has Windows 2003 Server VM running Windows Services for UNIX Version 3.5, each with a system virtual hard drive, and a data virtual hard drive, which is an NTFS compressed volume that is shared NFS. All VMs are thin. This has combination has worked fine since 2008. Using this method I can store backups in 1/5th of their THIN size.
2. Recently, I did a hardware refresh making both servers the same model as the most powerful server we had before, except I added a second 6-core processor, doubled the memory to 12 GB, changed to a RAID-10 with 2 TB of drive space, and moved from ESXi 4.0 to 5.1. I expanded the NFS volumes / vmdk on both servers to 500 GB to allow for more backups.
The Problem:
1. When Server2 backs up fine to the NFS volume on Server1, the clones go fine. However, when Server1 backs up to Server2, the largest VM (55GB thin and 80GB declared, fails at ~90% point on each attempt, even if I clear space, which it shouldn't need as the drive only has about 51GB of data on a 500 GB NFS volume, and the ESXi VMware Client shows the same amount of free space no matter which server I check it from.
Destination disk format: VMFS thin-provisioned
Cloning disk '/vmfs/volumes/datastore1/my_vm/my_vm.vmdk'...
Clone: 90% done.Failed to clone disk: There is not enough space on the file system for the selected operation (13).
All of the other VMs after it backup fine. It only fails on the larger VM, but I can back it up fine to a local directory.
Observations:
- I made a new secondary vhd at 500 GB. It worked for a while, but when I went to add more I ran into the same problem.
- I made a new secondary vhd at 1024 GB and the same thing happened.
- I ran chkdsk /r on it and it showed bad clusters. Then I did it on the macnine that works fine, and it shows the same thing. I can run chkdsk /r 50 times on each machine and it makes no difference. The only place it shows bad clusters is with the thin .vmdk backup files.
- After running chkdsk /r, I can backup again, even though the size used on disk doesn't change. However, one of the features of chkdsk /r is to re-calculate free space.
Thoughts:
- It's odd that one machine has the problem and not the other, and why I haven't had the problem for the past 4 1/2 years. The only notable difference is one of them has a 55 GB thin .vmdk, which is somewhat larger than the others. Other than that, the two Windows machines are the same, in fact when set up, one was a copy of the other. Even the hardware on the machine that has the problem is unchanged other than larger drives, more memory, and a second processor. The big change was from ESXi 4.0 to 5.1.
- It seems when I copy thin .vmdks to the Windows 2003 NFS volume, it confuses it Windows. I can see where it might since it provisions a larger size than what it actually uses. However, I didn't have this problem before. When I run chkdsk /r, the last thing it does is recompute free space. That is probably why I can backup again.
Question: Does anyone know if there is a definitive answer somewhere that explains the cause of this phenomenon and if there is a work-around?