2009-07-31

Happy Sysadmin Day

Very quick post for this morning

Show your appreciation for SysAdmin - http://www.sysadminday.com

 

Lots more here

2009-07-28

Install ESX on your Laptop - I had a Crazy Idea

And so this started today with a Twitter post. If you know it or not I am a big enthusiast of trying to install ESX on all kinds of hardware - especially Whiteboxes that are not on the HCL. I have tested it on a number of HP, Dell and IBM desktops. The great thing about this is - it mostly works, it is completely not supported, but a lot of fun to do. But to lug around with a desktop to present demos, is not always the most convenient thing in the world - to put it mildly.

What-if (or as you would say it in Powershell -whatif) you could create a system, that has a full demonstration environment of multiple VM's - and all of this on your LAPTOP!

So I looked around on the Web and found a few mentions of people who have done this. What kind of problems they ran into and what was possible or not. The consensus is to Install a Base OS, Workstation on top of that, ESX as a VM, and then VM's onto that ESX VM. The consensus about this as well was "IT IS AS SLOW AS A TORTISE!". ESX on bare metal should be much much faster,

So my adventure started with this:

  1. Lenovo T400 Laptop 267812G - Intel Core2 Duo T9400 2.53GHz with 2GB RAM
  2. ESX4i build 171294

The laptop:

DSC00145

Started out in the BIOS, enabled Intel VT

DSC00147

SATA was set as AHCI

DSC00148

And off we go

DSC00149 DSC00150 DSC00151 DSC00152

Install Screen

DSC00153

Recognizes the Disk

DSC00154

And 4 minutes later

DSC00156 

All hardware detected out of the box - Network card included

Next was to connect to the laptop with the VI client.

2009-07-28_1611  
2009-07-28_1614_002 2009-07-28_1616 2009-07-28_1614 2009-07-28_1614_001

Now all I have to do is find out why I cannot power on a machine. Every time I started a VM the laptop froze - completely! Hard reboot and the machine came back up OK but the VM was no longer registered.

Have to look into that further

Hope you enjoyed the ride.

2009-07-27

VI client Install in Disconnected Environments

Yes disconnected environments do exist! I mean completely and totally disconnected.
NO INTERNET!!!

Well I had one of those today. My customer has a network which is completely and physically disconnected from the corporate LAN and therefore also not connected to the internet. This because of the nature of the information that is on this secluded network, that no option for anything to go in or out over the wire.

All fine an Dandy! Installed an new ESX4i machine there today. I then wanted to install the new VI client on the users PC. Pretty straight forward - or so you would think..

Opened up the web browser and pointed it to https://ESX-HOST/client/VMware-viclient.exe and ran the exe file.

Next -> Next -> Next -> skipped the host Update utility, Waited, waited, waited
and then ………… BOINK!!!!

Installation failed ……..    returned error code 1603. And of course no Vi client.

Hmmm. Maybe something was wrong with the .net Framework on the machine - checked it and all seemed to be kosher.

Tried the installation again, and guess what? Same story! Tried it on another machine - you guessed right - Same story!

I love a challenge and solving puzzles - so this was one for me :)

I unpacked the VMware-viclient.exe and received this

image

So you would think that the package has all the goodies it needs in order to install. Nope..

Looking into the netfx.log which was located in the %TEMP% directory i noticed that during the installation of the .Net Framework - the Installation was looking for some file on the internet and in a local path. Internet of course would not work here - remember? Disconnected network! - and local path did not have the file either.

Looked again at the folder sizes - 2.4MB seems a bit small don't you think? Microsoft offers Microsoft .NET Framework 3.0 Service Pack 1 available here - but again 2.4MB in size. So I gathered I need the redistributable package (which should include all that I need) - again Microsoft .NET Framework 3.0 Redistributable Package here (this time a 50MB file). Got both files, moved them to my USB key, and tried the installation of .Net Framework - and guess what?

Exactly the same story!!! failed installation - looking in the logs I still see that it still wanted something from the internet.

Got fed up with this and downloaded the redistributable package of Microsoft .NET Framework 3.5 Service Pack 1 from here (this time a full package of 230MB), and moved it also onto my USB key.

Fired up the installer - click click - next next, and went to get my self a cup of water (sorry don't like coffee)

5 minutes (or so) later .Net was installed.

Started the VMware-viclient.exe.

Next -> Next -> Next -> skipped the host Update utility, waited, and then ………… it went onto the next stage of installing the Visual J# 2.0 and then waited, waited, waited and it completed the installation!!(which makes me very happy :) )

Lessons learned from this episode:

  1. Microsoft should not always presume that everyone has access to the Internet.
  2. Re-distributable packages should contain everything inside - but not always is that true.
  3. VI client works flawlessly with .Net Framework 3.5 SP1
  4. Log files can be boring - but when something does not work the way it should - they will be your aid in finding out why.

Hope you enjoyed the ride!

2009-07-24

How Heavy is your ESX Load?

Well ok.. This could be taken the wrong way (and all of your with the dirty minds should be ashamed of yourselves - ha ha). On one of my previous posts - How Much Ram per Host - a.k.a Lego - I gave a hypothetical scenario of 40 1 vCPU VM's on a single host as opposed to 80 VM's on one host. There was one thing I neglected to mention, and because of a issue with a client this week, I feel it is important to point out.

CPU Contention. For those of  you who do not know what the issue is about, a brief explanation. If you have too many VM's competing for CPU resources to work, then your VM's will stop behaving and start to crawl.

So here was the story - a client called me with an issue, all his VM's had started to crawl - EVERYTHING was running slowly!

Troubleshooting walkthrough:

  1. Log into the VI Client - and check the resource utilization of the Host - CPU, RAM, Network, Disk.
    Ok I did that - absolutely nothing!
    CPU - 40%
    RAM - 50%
    NIC - 5-10mb/s utilization
    Disk - This was NFS no disk statistics - so I looked at the VMNIC of the VMKernel - and also nothing!

    On to the next step..
  2. top on the ESX host
    ssh'd into the ESX host and looked at the resources with top. I do this first before even going into the ESX statistics. I looked to see if any the iowait was high or if there was any processes stealing up too many resources and the state of the RAM on the host.

    14:21:34  up 2 days, 20:57,  1 user,  load average: 1.06, 0.92, 0.75
    286 processes: 284 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total    0.9%    0.0%    0.0%   0.0%     0.0%    45.8%   94.1%
    Mem:   268548k av,  256560k used,   11988k free,       0k shrd,   21432k buff
                        189028k actv,   29240k in_d,    3232k in_c
    Swap: 1638620k av,   251022k used, 1541988k free                   74068k cached

    If you notice on the last line

    Swap: 1638620k av,   251022k used, 1541988k free                   74068k cached

    Why was it swapping - that is not normal. Quick check on the Vi Client how much RAM was allocated,

    image

    So there was only 272 (default) allocated, someone had done the proper work of creating the SWAP of 1600MB (double the max. of 800) - well done! - but had not restarted the host! So effectively the host was still set for 272. Now of course the load on the machine high enough causing the host to run out of RAM. anything that was done on the host was working slowly

    Vmotioned the VM's off and restarted the host which cam back with the full amount this time

    image

    Swap: 1638588k av,       0k used, 1638588k free       155924k cached

    Ahh much better - no more swapping. Vmotioned the machines back, and at a certain stage all VM's started to crawl again.

    Looked into top again

    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total    0.3%    0.0%    0.0%   0.0%     0.9%     38.9%   59.6%

    Whoa! that is also extremely high!

  3. esxtop
    shift + V to show only VM's, shift + R to sort by ready time

    ID    GID NAME      NWLD   %USED    %RUN    %SYS   %WAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD
    21     21 MEM_ABC_STB_     5   49.84   50.08    0.04  393.77   54.77   0.00    0.34    0.00   51.17

    There were something like 10 VM's with $RDY times of over 10% constantly.

Here you have a perfect case of CPU contention. The host is a Dual Quad x5320. Machine was running 44 VM's

image

Ratio of vm's per core is high - but achievable. I then looked to to see what the amount of vCPU's there were on the host, approximately 10 vm's had 2 or more vCPU's.

image
This brought the ratio of vCPU's per core to 6.75 vCPU's per core. And this is what was killing the host.

Even though the ratio of vm:core was 5.5:1 the vCPU:core ratio was much higher and therefore causing the contention throughout the server.

Of course the client did not understand why all of these VM's should not be configured with anything less than 2 vCPU's - "because that is what you get with any desktop computer.."

It took an incident like this for the client to understand that there is no reason to configure the machine with more than 1 vCPU unless it really needs (and knows how) to use it.

We bought all the machines back down to 1 vCPU and

ID    GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT    %RDY
128    128 RPM_Tester_Arie     5    4.70    4.70    0.03  487.45    7.35
118    118 RHEL4_6             5    5.33    5.37    0.00  488.44    5.67
108    108 STBinteg3           5    1.68    1.70    0.00  495.00    2.78
112    112 STBinteg2           5   11.20   11.25    0.00  485.50    2.74
  21     21   MEM_ABC_STB_  5    6.90    6.92    0.02  490.19    2.35

And all was back to normal!

Lessons learned from this episode:

  1. First thing you do when installing the host - SWAP should be 1600MB and the Service Console RAM to the Maximum of 800MB
  2. Reboot the Host after that!!!
  3. Remember to always check your resources, CPU/RAM/NIC/DISK usage are not the only bottlenecks which can cause performance issues.
  4. 80 vCPU's might not be actually possible - it will depend on the workload that is running on the host - but hey this was a hypothetical scenario anyway.

Invaluable resources for troubleshooting performance:

Checking for resource starvation of the ESX Server service console

Ready Time

CPU Performance Analysis and Monitoring

Hope you enjoyed the ride..

2009-07-16

Promotional Drawing for Free VMworld Pass

No I am not offering one - VMware is.

Hot off the press from Twitter - starts now until Midnight - July 24, 2009.

The idea is for you to register for the conference and therefore become eligible to win the pass for free.

If you read the fine print though on the Terms and Conditions you will find that there is a shorter route.

Good luck!!

VMware Workstation & Unity

One of the little known features but, at least I think so, the coolest gems in the last few versions of VMware Workstation is Unity Mode.

Access applications within virtual machines as if they were part of the host operating system desktop with “Unity” view

Two perfect use cases:

  1. In most big corporate environments - Microsoft Exchange and Microsoft Outlook are the de-facto tools used for email. If you are like some of those who refuse to use Microsoft OS - then you have to resort to using Evolution, which does not always work well.
    Unity to the rescue, I can run whatever OS I would like and run a windows virtual machine with for running Outlook, and all I need is to activate Unity and you will have an window (almost like any other) in you Linux OS.
  2. Some organizations (do use Exchange) did not move to LCS or OCS. Why? A number of reasons. Too much money for the CAL's. No need. It works. So for one we are still using Windows Messenger 5.1 for our internal messenger software. Starting from Windows Vista - every time you open a chat window and the screen scrolls, the Application crashes. This has been documented numerous amounts of times on the web, with no solution. 
    Unity to the rescue. I can now run  Windows Vista / 7 /Server 2008 and have a VM with my Messenger client open without having to run another whole desktop for this purpose

A short demo about Unity

And yeah yeah, I know. Time to change the Messenger client, I hear ya!

2009-07-15

VCP4 Beta Exam - Why I will not take the Exam.

I was invited (amongst a good number of others that were in the Beta) to sit the Beta Exam. I have decided not to take the opportunity. Only two days left by the way.

Why you should?

  • Beta Participants receive a good discount on the exam
  • The privilege of becoming one of the 1st few to achieve the VCP4 Certification
  • The privilege of contributing to the testing process for the rest of those that will take the exam in the future

Why you should not?

  • You have a a lot of questions to answer in a very short time (270 in 4.5 hours = 1 question/minute)
  • Not all of these questions will be in the GA exam
  • You will not receive your results after the exam, it can actually take something like 6-8 weeks

Personally none of the cons mentioned above were the reason for my decision. I will not be taking it because the only VUE testing center in Israel that I could schedule the exam was, available only on one date, three hours drive away from where I live/work, and the slot was at 08.30 in the morning. So I will pass. Pity, but when the exam becomes available, I will definitely book a more suitable slot.

Thank you VMware anyway, for giving me the opportunity though.

Presentations from a Israel VMUG

I was waiting for these to come in they have arrived, and I think that you all could benefit from these presentations.

vSphere, What's New? - Technical Overview - Ofir Zamir (Team Leader SEs, VMware Israel)

and

vSphere Upgrade and Best Practices - Ben Hagai (VMUG Leader) and Yaniv Weinberg (

Senior Consultant at VMware)

Good presentation from all three of them. Enjoy!

2009-07-09

How much RAM for an ESX server - a.k.a. Lego Blocks

I started to read the sample chapters that Scott Lowe released from his upcoming book, and one of the parts were about the subject of scaling up vs. scaling out.

A slight bit more of an explanation as to what I mean by this. Should I buy bigger more monstrous servers, or a greater number of smaller servers?

Let us take a sample case study. We have an environment that has sized the following:

hardware

On this hardware an organization has sized their server's capacity as:

load

The estimate of 40 Virtual Machines per host is pretty conservative, but for arguments sake let's say that is the requirements that came from the client. The projected amount of VM's - up to 200.

Which hardware should be used to to host these virtual machines? I am not talking about if it should be a Blade or a Rack mount, and also not which Vendor, IBM,HP,Dell or other.I am talking about more about what should go into the hardware for each server. And in particular for this post what would be the best amount of RAM per server.

From my experience of my current environment that I manage, the bottleneck we hit first is always RAM. Our servers are performing at 60%-70% utilization of RAM, but only 30%-40% CPU utilization per server. And from what I have been hearing from the virtualization community - the feeling is generally the same. I wanted to compare what would be the optimal configuration for a server. Each server was a 2U IBM x3650 with 2 72GB Hard disks (for ESX OS), 2 Power supply's, 2 Intel PRO/1000T Dual NIC adapters. Shared Storage is the same for both Servers, so that is not something that I take into the equation here.The only difference between them was the amount of RAM in the servers.All the prices and part numbers are up to date from IBM, done with a tool called the
IBM Standalone Solutions Configuration Tool (SSCT). The tool is updated once/twice a month and is extremely useful in configuring and pricing my servers.

64gb

128gb

Now the first thing that hit me was - the sheer amount of difference in Server price. I mean I added 100% more RAM to the server, but the price of the server went up by almost 300%. That is because the 8GB chips are so expensive. Now I took building blocks of 40 VM's. To each block I added an ESX Ent. Plus License - an additional cost of $7,000. I assumed that vCenter was already in place, so this was not a factor in my calculations.The table below compares the two servers in blocks of 40 VM's.

comparison

Now you can always claim that a server with 80 VM's use a lot more CPU than a server with only 40. But you were paying attention the beginning of the post, the load on a server with 40 VM's was going to be 30-40%, and therefore doubling it would bring the load up to 60-80%. which was well in acceptable limits. As you can see from the table above - the 128GB server came out cheaper on every level that could be compared to the 64GB server. Now of course I am not mentioning the savings in the reduction of the Physical Hardware, rack space, electricity, cooling - we all know the benefits.

So what did I learn from this exercise?

  • Even though the price might seem scary - and a lot to pay for one server - in some cases it does pay off, if you do your calculations and planning.
  • 8GB Ram is expensive!
  • Every situation is different - so you have to do your planning per your requirements.
  • I liked the idea of building blocks
  • You have other considerations such as HA that will determine how many VM's you can cram onto one host - see Duncan's post on this subject

Now of course you can project this kind of calculation on any kind of configuration be it difference in RAM, HBA's, NIC's etc.

Your thoughts and comments are always welcome.

2009-07-01

vMotion issues (78%)

In the current series of posts I am writing on running a vSphere lab on ESX 1 2 and 3, I wanted to set up an NFS shared storage between my 2 ESX hosts to test vMotion.

I ran into an interesting issue which I could hardly find any mention of on the web.

We all know that there are countless amount of posts about vMotion failing at 10% or failing at 90% but not anything about 78%. Well I hate to be picky, but this one was baffling me a bit. I found only only mention of this on the communities, but nothing else.

A bit more detail. I had connected two ESX hosts to an NFS share from Openfiler. There was no problem at all. Both hosts saw the storage. Created machines without any issues on both hosts. Only vMotion would fail – with a very ambiguous error.

vmotion_fail

Every single time at 78%. At first I though it was because Promiscuous mode was not enabled on the NIC and on the vSwitch, so I changed that to enabled

Promiscuous

Did not help.

I tried to get information of of the VMware.log file of the VM, but the only things I could see were these:

Jun 30 13:34:04.615: vmx| Running VMware ESX in a virtual machine or with some other virtualization products is not supported and may result in unpredictable behavior.  Do you want to continue?---------------------------------------

So maybe that was the issue? I asked hany_michael and the_crooked_toe if they had any issues with vMotion like this but they did not, even though running a similar environment to mine. This line above was because I was running ESX as a VM, I would get it as well when powering on a VM but it would succeed.

esx_in_vm

I tried to go through the logs of the VM’s and was not getting more information from it either besides that it could not find the file on the new host.

Turned on verbose logging on the vCenter

verbose

Did not get much either.

[2009-06-30 13:56:47.886 03756 error 'App'] [MIGRATE] (1246359385573990) error while tracking VMotion progress (RuntimeFault)

Since this was NFS I started to dive into the vmkernel logs of the ESX hosts at /var/log/vmkernel and found this:

ESX4-1

Jun 30 12:24:40 esx4-2 vmkernel: 0:12:18:19.207 cpu1:4396)WARNING: Swap: vm 4396: 2457: Failed to open swap file '/volumes/c31eba3f-9dca625f/win2k3/win2k3-4aed76bf.vswp': Not found

Jun 30 13:34:05 esx4-2 vmkernel: 0:13:27:43.732 cpu1:4433)WARNING: VMotion: 3414: 1246358033497547 D: Failed to reopen swap on destination: Not found

 

ESX4-2

Jun 30 13:14:55 esx4-1 vmkernel: 0:14:45:00.526 cpu1:4462)WARNING: Swap: vm 4462: 2457: Failed to open swap file '/volumes/c861a58d-45816333/win2k3_b/win2k3_b-65841149.vswp': Not found

Jun 30 13:14:55 esx4-1 vmkernel: 0:14:45:00.526 cpu1:4462)WARNING: VMotion: 3414: 1246356880465431 D: Failed to reopen swap on destination: Not found

Jun 30 13:14:55 esx4-1 vmkernel: 0:14:45:00.526 cpu1:4462)WARNING: Migrate: 295: 1246356880465431 D: Failed: Not found (0xbad0003)@0x41800da0e0d5

Now why would it not find the swap file? I mean both of the hosts are connected to the same storage.

Or were they??

Look at the log again

ESX4-1 - Failed to open swap file '/volumes/c861a58d-45816333/win2k3/win2k3-4aed76bf.vswp'

ESX4-2 - Failed to open swap file '/volumes/c31eba3f-9dca625f/win2k3/win2k3-4aed76bf.vswp'

See the difference? But how could that be? I remembered that I had run into this issue once before. Let me explain what was happening here. During vMotion the memory state of the VM is transferred from one ESX to the other. In the vmx file of the VM there is a configuration setting as to where the swap is located

sched.swap.derivedName = "/vmfs/volumes/c31eba3f-9dca625f/win2k3/win2k3-4aed76bf.vswp"

When the receiving host is ready to finalize the transfer, it has to take this file to read the swap memory of the VM. This is the only hard-coded path in a VM configuration file, and since the hosts were not seeing the same path, the machine would not migrate.

How did this happen?

When creating the datastores I did one from the GUI,

Add_Storage_3a

and one from the command line.

add_nfs2a

Subtle difference of a / but that is what made all the difference.

I removed the volume from one ESX server, created it again and now the output from both hosts

[root@esx4-1 ~]# ls -la /vmfs/volumes/
total 1028
drwxr-xr-x 1 root root  512 Jul  1 16:27 .
drwxrwxrwt 1 root root  512 Jun 30 14:16 ..
drwxr-xr-t 1 root root 1120 Jun 21 12:16 4a3dfa4c-17137398-672b-000c299e8aed
drwxrwsrwx 1   96   96 4096 Jul  1 00:16 c31eba3f-9dca625f
lrwxr-xr-x 1 root root   35 Jul  1 16:27 Local-ESX4-1 -> 4a3dfa4c-17137398-672b-000c299e8aed
lrwxr-xr-x 1 root root   17 Jul  1 16:27 nfs_fs1 -> c31eba3f-9dca625f


[root@esx4-2 win2k3]# ls -la /vmfs/volumes/
total 1028
drwxr-xr-x 1 root root  512 Jul  1 16:27 .
drwxrwxrwt 1 root root  512 Jun 30 00:07 ..
drwxr-xr-t 1 root root 1120 Jun 21 14:05 4a3e1408-450c30b1-ab94-000c293f26d7
drwxrwsrwx 1   96   96 4096 Jul  1 00:16 c31eba3f-9dca625f
lrwxr-xr-x 1 root root   35 Jul  1 16:27 Local-ESX4-2 -> 4a3e1408-450c30b1-ab94-000c293f26d7
lrwxr-xr-x 1 root root   17 Jul  1 16:27 nfs_fs1 -> c31eba3f-9dca625f

2 lessons I learned from this.

  1. Sometimes you cannot find an answer for everything on the internet – especially if you are using a new product (ESX4) and no-one has had these problems before
  2. Automate! Automate! Automate! When doing things with scripts, then you are less prone to errors like the one I made above.

Hope you enjoyed the ride!