2010-03-17

A running VM that did not exist

This was a weird one that hit me today.

I had a performance issue on a server.

esxtop is the first I thing I looked at and got this:

ID    GID NAME NWLD   %USED    %RUN    %SYS   %WAIT    %RDY   
69     69 VSE  5      83.98    85.74   0.00   380.16   22.14 

So I looked to which machine it was:

[root@dmz1 root]# vmware-cmd -l | grep VSE
[root@dmz1 root]#

And the result I got was nuddah!!

So next

[root@dmz1 root]# vm-support -x | grep VSE
vmid=1428       VSE

[root@dmz1 /]# ps -efww | grep VSE
root      4426     1  0 Feb23 ?        00:00:06 /usr/lib/vmware/bin/vmkload_app ……   …… a/VSE/VSE.vmx
root      4476  3756  0 12:27 pts/2    00:00:00 grep VSE

So there was a running VM – or so it seemed.

I ran the same steps the other host in cluster

ID    GID NAME NWLD   %USED    %RUN    %SYS   %WAIT    %RDY   

59     59 VSE  5      11.74    11.96   0.00  487.23    6.94

[root@dmz2 root]# vm-support -x | grep VSE
vmid=1410       VSE

[root@dmz2 root]# vmware-cmd -l | grep VSE
[root@dmz2 root]#

OK so what was going on here? Looking at the details of the machine – I saw that the name of the VM had no correlation to the actual folder it was in

image

Looking for the machine again

[root@dmz2 root]# vmware-cmd -l | grep CSG1
/vmfs/volumes/…………a/CSG1/CSG1.vmx

[root@dmz1 /]# vmware-cmd -l | grep CSG1
[root@dmz1 /]#

OK. So I now have found the machine Named VSE running on dmz2 but I still had a process running on dmz1 that was taking up CPU

[root@dmz1 /]# ps -efww | grep VSE
root      4426     1  0 Feb23 ?        00:00:06 /usr/lib/vmware/bin/vmkload_app ……   …… a/VSE/VSE.vmx
root      4476  3756  0 12:27 pts/2    00:00:00 grep VSE


I looked into the folder itself

[root@dmz1 /]# ls -la /vmfs/volumes/……a/VSE/
total 23413952
drwxr-xr-x    1 root     root          980 Mar 17 11:39 .
drwxr-xr-t    1 root     root         2380 Mar 15 11:21 ..
-rw-------    1 root     root         2510 Mar 17 11:35 vmdumper.png
-rw-------    1 root     root     23573652480 Feb 23 23:53 VSE_1-flat.vmdk
-rw-------    1 root     root     268435456 Feb 23 23:53 VSE-6785c36f.vswp
-rw-------    1 root     root     131604480 Feb 23 23:53 VSE-flat.vmdk
-rwxr-xr--    1 root     root         1960 Feb 24 02:04 VSE.vmx
[root@ilesxdmz1 /]#

As you can see all the files were old and this looked like a Phantom machine

Time to kill the process on dmz1

I have the wid (WorldID) from before – 1428

[root@dmz1 /]# less /proc/vmware/vm/1428/cpu/status

You will find the master world ID for this process will be in the output after the vm.XXXX
(the 4 digits - in my case it was 1427)

Then kill the process

[root@dmz1 /]# /usr/lib/vmware/bin/vmkload_app -k 9 1427
Warning: Mar 17 12:37:04.706: Sending signal '9' to world 1427.

Process was gone and not using a full proc on nothing

[root@dmz1 /]# ps -efww  | grep VSE
root      4785  3756  0 12:37 pts/2    00:00:00 grep VSE

Just to be on the safe side I took a vm-support snapshot of the VMID before the whole process – maybe I can find something out about the problem later on.

How the phantom happened I am still not sure. What worries me more – is how this can be detected in the future and I do not have to wait for a problem to arise to find these things out.

I would be interested in hearing your comments or suggestions as to how to address the above question.