Tuesday, 31 October 2017

Part 1 - vSAN ESXI host Upgrade Precheck Tasks

This post is part of vSAN ESXi host upgrade process.


Follow below steps to perform precheck before you start upgrading ESXi hosts used for vSAN. for detailed information please refer VMware documents.



Login to vCenter web client.
Select vSAN cluster and go to Monitoring tab.
Select vSAN Tab, Click on Health 
Click on Retest and Retest with online health to test the health of vSAN cluster. 

Once Health check test completes, see if Cluster is reporting any warning or errors in Health plugin or under issues tab.

Login to Ruby vSphere Console (RVC) . Leave this session open, we will come back to it later.

Check for possible inaccessible objects and VMs in the cluster

Under Health plugin, Click on Data and select vSAN object health.
In object health overview, all object should be healthy. If any object has non-zero value then fix the issue before proceeding with upgrade.

Check overall status of all objects

Use vsan.check_state command to get all object status.
/localhost/datacenter01/computers/ vsan01> vsan.check_state .
2017-10-17 15:39:18 +0000: Step 1: Check for inaccessible vSAN objects
Detected 0 objects to be inaccessible

2017-10-17 15:39:18 +0000: Step 2: Check for invalid/inaccessible VMs

2017-10-17 15:39:18 +0000: Step 3: Check for VMs for which VC/hostd/vmx are out of sync
Did not find VMs for which VC/hostd/vmx are out of sync

Virtual machine compliance status


Make sure all Vms are Complaint as per VM storage policy

Go to vSphere web client, select vSAN Cluster and VMs view. Make sure all VMs are compliant to VM Storage Policy.


How to test VM storage Policy Compliance?

From web client select VM and go to Monitoring tab.
Click on Green button to check compliance of VM with VM storage Policy.
If you see any errors or if any VM is not compliant with Policy then fix the issues before proceeding.

Check utilization of vSAN datastore

Make sure you have 30% free Space available on vSAN Datastore.  In case ESXi hosts fails during upgrade process then you should have enough capacity to rebuild components.

What if one Host fails?

From RVC console, see what capacity remains if one hosts fail.
/localhost/datacenter01/computers/vsan01> vsan.whatif_host_failures .
Simulating 1 host failures:
+-----------------+------------------------------+-----------------------------------+
| Resource        | Usage right now              | Usage after failure/re-protection |
+-----------------+------------------------------+-----------------------------------+
| HDD capacity    |  47% used (24442.63 GB free) |  56% used (16820.20 GB free)      |
| Components      |   1% used (53259 available)  |   2% used (44259 available)       |
| RC reservations |   0% used (0.01 GB free)     |   0% used (0.01 GB free)          |
+-----------------+------------------------------+-----------------------------------+


From Health Plugin Check Limits after 1 additional host failure.

Proactive Disk rebalances

From vsan health plugin in web client see if any disks need to be rebalanced.
If Disk data rebalance is needed then do the proactive disk rebalance before starting ESXi upgrade.

Run vsan.proactive_rebalance_info  command to see if proactive disk rebalance is needed. If required run vsan.proactive_rebalance command to do the proactive disk rebalance. Proactive rebalance will take up to 24 hours to complete.

/localhost/datacenter01/computers/vsan01> vsan.proactive_rebalance_info .
2017-10-17 15:50:17 +0000: Retrieving proactive rebalance information from host dd-bld01-vsan01-04.d2d.mydomain.local ...
………………output truncated…….
Proactive rebalance is not running!
Max usage difference triggering rebalancing: 30.00%
Average disk usage: 47.00%
Maximum disk usage: 75.00% (62.00% above minimum disk usage)
Imbalance index: 34.00%
Disks to be rebalanced:
+----------------------+---------------------------------------+----------------------------+--------------+
| DisplayName          | Host                                  | Disk usage above threshold | Data to move |
+----------------------+---------------------------------------+----------------------------+--------------+
| naa.500a075112cbe790 | dd-bld01-vsan01-02.d2d.mydomain.local | 8.00%                      | 33.8774 GB   |
| naa.500a075112cbe785 | dd-bld01-vsan01-02.d2d.mydomain.local | 9.00%                      | 42.3468 GB   |
| naa.500a075112cbe779 | dd-bld01-vsan01-02.d2d.mydomain.local | 32.00%                     | 237.1421 GB  |
| naa.500a075112cbe309 | dd-bld01-vsan01-02.d2d.mydomain.local | 10.00%                     | 50.8162 GB   |
| naa.500a075112cbe2f5 | dd-bld01-vsan01-02.d2d.mydomain.local | 32.00%                     | 237.1421 GB  |
+----------------------+---------------------------------------+----------------------------+--------------+
………………output truncated…….

Check Physical Disk Utilization

From RVC console, run vsan.disks_stats . command to get all disks usages and health status.
If any disk is being used more than 80% then do the proactive disk rebalance.
e.g. Output.
/localhost/datacenter01/computers/vsan01> vsan.disks_stats .
2017-10-26 15:52:01 +0000: Fetching vSAN disk info from dd-bld01-vsan01-02.d2d.mydomain.local (may take a moment) ...
………………output truncated…….
2017-10-26 15:52:06 +0000: Done fetching vSAN disk infos
+----------------------+---------------------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+
|                      |                                       |       | Num  | Capacity  |         |          | Physical   | Physical | Physical | Logical    | Logical | Logical  | Status  |
| DisplayName          | Host                                  | isSSD | Comp | Total     | Used    | Reserved | Capacity   | Used     | Reserved | Capacity   | Used    | Reserved | Health  |
+----------------------+---------------------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+
| naa.500a075112c9eaca | dd-bld01-vsan01-01.d2d.mydomain.local | SSD   | 0    | 558.91 GB | 0.00 %  | 0.00 %   | N/A        | N/A      | N/A      | N/A        | N/A     | N/A      | OK (v3) |
| naa.500a075112cbe775 | dd-bld01-vsan01-01.d2d.mydomain.local | MD    | 7    | 846.94 GB | 56.34 % | 24.21 %  | 3387.74 GB | 56.34 %  | 7.03 %   | 8942.50 GB | 8.98 %  | 2.29 %   | OK (v3) |
| naa.500a075112cbe7b8 | dd-bld01-vsan01-01.d2d.mydomain.local | MD    | 6    | 846.94 GB | 56.34 % | 1.31 %   | 3387.74 GB | 56.34 %  | 7.03 %   | 8942.50 GB | 3.61 %  | 0.12 %   | OK (v3) |
| naa.500a075112cbe77e | dd-bld01-vsan01-01.d2d.mydomain.local | MD    | 8    | 846.94 GB | 56.34 % | 1.31 %   | 3387.74 GB | 56.34 %  | 7.03 %   | 8942.50 GB | 11.51 % | 0.12 %   | OK (v3) |
| naa.500a075112cbe77d | dd-bld01-vsan01-01.d2d.mydomain.local | MD    | 7    | 846.94 GB | 56.34 % | 1.31 %   | 3387.74 GB | 56.34 %  | 7.03 %   | 8942.50 GB | 11.48 % | 0.12 %   | OK (v3) |
+----------------------+---------------------------------------+-------+------+-----------+---------+----------+------------+----------+----------+------------+---------+----------+---------+
………………output truncated…….

VM Backups


As a best practice make sure you have backup copies of Vms/data. In real life, it may not be possible to backup all VMs, however you should have backups of critical VMs.

Hardware Compatibility 


Check if Server hardware does support new ESXi and vSAN version. All server components should be on ESXi and vSAN HCL. Make sure you have required firmware’s and drivers installed.
Check and Verify Current Driver and firmware for I/O controller, Disk Firmware (Cache and capacity) and network card drivers and firmware.

List ESXi SCSI Devices

Login to ESXi host using SSH and run esxcfg-scsidevs -a command get all scsi devices.
[root@dd-bld01-vsan01-01:~] esxcfg-scsidevs -a
vmhba0  fnic              link-n/a  fc.20000025ef98a005:20000025ef9aa005    (0000:0a:00.0) Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba1  ahci              link-n/a  sata.vmhba1                             (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba2  fnic              link-n/a  fc.20000025ef98a005:20000025ef9bb005    (0000:0b:00.0) Cisco Systems Inc Cisco VIC FCoE HBA Driver
vmhba3  megaraid_sas      link-n/a  unknown.vmhba3                          (0000:03:00.0) LSI / Symbios Logic MegaRAID SAS Invader Controller
vmhba32 usb-storage       link-n/a  usb.vmhba32                             () USB
vmhba33 ahci              link-n/a  sata.vmhba33                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba34 ahci              link-n/a  sata.vmhba34                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba35 ahci              link-n/a  sata.vmhba35                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba36 ahci              link-n/a  sata.vmhba36                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba37 ahci              link-n/a  sata.vmhba37                            (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller

Get raid controller driver version


[root@dd-bld01-vsan01-01:~] vmkload_mod -s megaraid_sas | grep Version
Version: Version 6.606.06.00.1vmw, Build: 1331820, Interface: 9.2 Built on: Nov 26 2014


Get the Device Vendor ID

In below example vmhba3 is my raid controller device.
Get the Vendor ID (VID), Device ID (DID), Sub-Vendor ID (SVID), and Sub-Device ID (SDID) using the vmkchdev command.
[root@dd-bld01-vsan01-01:~] vmkchdev -l | grep vmhba3
0000:03:00.0 1000:005d 1137:00db vmkernel vmhba3

Check Vmware vSAN Compatibility

With above information in hand, visit Vmware hardware compatibility page to see if your device is compatible with vsan / esxi or not.
Easiest way to perform hardware compatibility check is using cpatin-vsan.com site.  
Visit https://hcl.captain-vsan.com/ Enter VID, DID, SVID, SDID and click on go, this will redirect you to VMware’s vSAN compatibility information URL for this Device.
As per below images, Cisco 12G SAS Raid controller is supported with ESXi 6.5u1 and other versions.


As per Vmware Compatibility Guide, Cisco 12G SAS Raid controller is supported by vSAN All Flash and Hybrid configuration. Also, you can see the support Device drivers and firmware version.  

Disk Firmware Check

Verify the current Firmware against the drives in use for vSAN, if they also need any upgrade

[root@dd-bld01-vsan01-01:~] esxcli storage core device list | egrep 'Display Name:|Size:|Model:|Revision:'
  Display Name: Local ATA Disk (naa.500a075112cbe774)
  Has Settable Display Name: true
  Size: 915715
  Model: MICRON_M510DC_MT
  Revision: 0013
  Queue Full Sample Size: 0
……output truncated…………….


Note down the Disk model and visit Vmware HCL site to find if this type of disks is compatible with
vSAN/ESXi version to which you are upgrading ESXi hosts. Visit VMware HCL site.


As per below Image Cisco Disks is compatible with vSAN All Flash.

vSAN default repair delay ClomRepairDelay

Change vSAN default repair delay ClomRepairDelay settings from 60 minutes to 120 or 180 minutes. So that during ESXi upgrade process vSAN do not start object rebuild process.
This VMware vSAN advanced setting specifies the amount of time vSAN waits before rebuilding a disk object after a host is either in a failed state or in Maintenance Mode. By default, the repair delay value is set to 60 minutes; this means that in the event of a host failure, vSAN waits 60 minutes before rebuilding any disk objects located on that particular host. This is because vSAN is not certain if the failure is transient or permanent.
In this case, as we know Host is down due to upgrade process so we just don’t want object rebuild process to kick in.
Follow above KB article to change ClomRepairDelay value.
Also, you can do below steps
Using SSH Login to each ESXi hosts which is part of vSAN Cluster.
Run below command to set required value and restart clomd service.
#esxcli system settings advanced set -o /VSAN/ClomRepairDelay -i 120 ; /etc/init.d/clomd restart ; date
with this settings vSAN will not try to rebuild vsan objects for 120 minutes.
Once you complete upgrading all Hosts, reset above value to default 60 minutes.




Monday, 30 October 2017

Ruby vSphere Console (RVC) vSAN Admins Friend

Ruby vSphere Console (RVC) is console interface, it help to manage, monitor vSAN environment.
RVC comes bundled with vCenter and no additional installation is required.
Its is available for Windows based vCenter and vCenter Appliance too.

RVC will help you manage your VSAN servers, it will provide you detailed view of VSAN servers, you can also run VSAN observer service using RVC.

How to login to RVC?

1. Enable SSH on vCenter Appliance server.

2. Login to vCenter using SSH

3. Enter shell command to go to bash shell on vCenter
     >shell

4. On bash prompt Enter below command to login to RVC console and hit Enter and then Enter SSO Administrator user password to login.
# rvc Administrator@vsphere.local@localhost

e.g. 
root@dc1-vcenter01 [ ~ ]# rvc Administrator@vsphere.local@localhost
Install the "ffi" gem for better tab completion.
password:
0 /
1 localhost/
>

Navigating using RVC

#cd command to change directory, select the object number to cd it.
e.g. cd 1
#ls command to display items and then use cd command.
#mark - using mark command you create alias of any object.
e.g. in below command am creating ~cluster alias for my vsan cluster.

#mark cluster /localhost/AllenDC-VA/computers/vsan01

to use alias, you need to add ~ in front of alias/mark
#vsan.check_state ~cluster

root@dc1-vcenter01 [ ~ ]# rvc Administrator@vsphere.local@localhost
Install the "ffi" gem for better tab completion.
password:
0 /
1 localhost/
> cd 1
/localhost> ls
0 ChiDC-IL (datacenter)
1 MalDC-PA (datacenter)
2 VanDC, BC (datacenter)
3 Decomission (datacenter)
4 DCtown-MO (datacenter)
5 DCeigh-NC (datacenter)
6 DCton-MA (datacenter)
7 AllenDC-VA (datacenter)
/localhost> cd 7
/localhost/AllenDC-VA> ls
0 storage/
1 computers [host]/
2 networks [network]/
3 datastores [datastore]/
4 vms [vm]/
/localhost/AllenDC-VA> cd 1
/localhost/AllenDC-VA/computers> ls
0 gla-vsan01 (cluster): cpu 40 GHz, memory 215 GB
/localhost/AllenDC-VA/computers> cd 0
/localhost/AllenDC-VA/computers/gla-vsan01> ls
0 hosts/
1 resourcePool [Resources]: cpu 40.61/40.61/normal, mem 215.54/215.54/normal
/localhost/AllenDC-VA/computers/gla-vsan01> mark cluster /localhost/AllenDC-VA/computers/gla-vsan01
/localhost/AllenDC-VA/computers/gla-vsan01> vsan.check_state ~cluster

or you can also below command. ( notice . after command ) 
/localhost/AllenDC-VA/computers/gla-vsan01> vsan.check_state .


Useful VSAN Commands

vsan.disks_stats - management of disk groups and monitoring the health of physical disks
vsan.check_state - troubleshooting data unavailability situations and understanding object health in the VSAN cluster
vsan.resync_dashboard - Check data resync progress, visibility into data resync when changing storage policies
vsan.whatif_host_failures - understanding VSAN’s ability to tolerate node failures
vsan.proactive_rebalance - start proactive data re-balance on physical disks.
vsan.proactive_rebalance_info - check status of proactive data re-balance.
vsan.check_limits - check vsan object limits
vsan.cluster_info  - get detailed information about vsan cluster
vsan.disk_object_info - get information about all vSAN objects on a given physical disk
vsan.disks_info - get physical disk information
vsan.vm_object_info - get vsan information of a VM
vsan.vmdk_stats - get vmdkdetails.
vsan.observer ~cluster --run-webserver --force  - start web service, it collect VSAN performance statistics data. you can access vsan observer performance data by vising to https://vCenterServer_hostname_or_IP_Address:8010 URL.




How to Upgrade vSAN ESXi Hosts

ESXi hosts participating in VSAN Cluster can be upgrading using Update manager or using ESXCLI as you would do any other ESXi host upgrade. However, VSAN Cluster need special attention before you plan upgrade.

You need to be more careful while upgrading ESXi Hosts and make sure you have supported Hardware Raid Controller, you install supported firmware/drivers, supported disk firmware…etc.
If you are not sure about anything do not procced with upgrade and Create a Support request with Vmware Support Team.

You can perform vSAN health check using vSAN health plugin and RVC console.
With new version of vSphere, Vmware has integrated update manager with vSAN. Based on your vSAN configuration, update manager creates patch baseline and assign it to vSAN cluster.

Do the upgrade precheck before upgrading each vSAN ESXi host. You can do the precheck using web client and also using RVC.

I have divided upgrade process in four parts. see the step by step process to do the vSAN cluster ESXi hosts upgrade.

How to Upgrade vSAN ESXi Hosts




Sunday, 29 October 2017

How to Get ESXi Boot Device

How to get ESXi boot device?

If you need to see the current boot device of ESXi or from which device ESXi is running then follow below steps.


1. Enable SSH on ESXi host.

2. Login to ESXi server using root and its password.

3. Run below Command to get boot volume

#ls -l /bootbank | awk -F"-> " '{print $2}'

E.g output
[root@esxi01:~] ls -l /bootbank | awk -F"-> " '{print $2}'
/vmfs/volumes/50705ae3-b08ce25f-a1b8-bc56258253ad

4. Copy vmfs volume and run below command to get more details about volume.

#vmkfstools -P vmfsVolumeID

[root@esxi01:~] vmkfstools -P /vmfs/volumes/50705ae3-b08ce25f-a1b8-bc56258253ad
vfat-0.04 (Raw Major Version: 0) file system spanning 1 partitions.
File system label (if any):
Mode: private
Capacity 261853184 (63929 file blocks * 4096), 90677248 (22138 blocks) avail, max supported file size 0
UUID: 50705ae3-b08ce25f-a1b8-bc56258253ad
Partitions spanned (on "disks"):
       eui.00a0504658335330:5
Is Native Snapshot Capable: NO

5. From above command output copy the disks/partition UID and run below command to get device details.

#esxcli storage core device list | grep -A27 deviceID

[root@esxi01:~] esxcli storage core device list | grep -A27 eui.00a0504658335330
eui.00a0504658335330
  Display Name: Local USB Direct-Access (eui.00a0504658335330)
  Has Settable Display Name: false
  Size: 30436
  Device Type: Direct-Access
  Multipath Plugin: NMP
  Devfs Path: /vmfs/devices/disks/eui.00a0504658335330
  Vendor: Cypress
  Model: SDRAID
  Revision: 0000
  SCSI Level: 2
  Is Pseudo: false
  Status: on
  Is RDM Capable: false
  Is Local: true
  Is Removable: true
  Is SSD: false
  Is VVOL PE: false
  Is Offline: false
  Is Perennially Reserved: false
  Queue Full Sample Size: 0
  Queue Full Threshold: 0
  Thin Provisioning Status: unknown
  Attached Filters:
  VAAI Status: unsupported
  Other UIDs: vml.0000000000766d68626133323a303a30
  Is Shared Clusterwide: false
  Is Local SAS Device: false
  Is SAS: false
  Is USB: true
  Is Boot USB Device: true
  Is Boot Device: true
  Device Max Queue Depth: 1
  No of outstanding IOs with competing worlds: 32

In above command you can see, this ESXi host is booted and running from USB device.

Above process looks simple and useful.
If you have large number of ESXi hosts and want to export all ESXi hosts boot device then you can automate it using PowerShell and bash scripting.

How to Export ESXi Boot Device?

  1. Enable SSH on all ESXi hosts for which you want to get boot device. You can use PowerShell script to automate enabling SSH service on many esxi hosts at once.

  1. Create a text file with below command and save it as c:\temp\ESXi-boot-device-cmd.txt

b=`ls -l /bootbank | awk -F"-> " '{print $2}'`
d=`vmkfstools -P $b | egrep 'mpx|naa|eui' | awk -F":" '{print $1 }' | sed -e 's/^[ \t]*//'`
esxcli storage core device list | grep -A27  $d | egrep 'Display Name:|Size:|Multipath Plugin:|Devfs Path:|Vendor:|Model:|Revision:|Is RDM Capable:|Is Local:|Is Removable:|Is Shared Clusterwide:|Is USB:|Is Boot USB Device:|Is Boot Device:' | awk -F":" '{print $2}' |sed  's/ //g' | awk -vRS="" -vOFS=',' '$1=$1'

3. Download plink.exe and save it in c:\temp\ folder

https://the.earth.li/~sgtatham/putty/latest/w32/plink.exe
Plink is a command line connection tool similar to unix SSH.

What is plink?

4. Below is a PowerShell script to Export boot device of all ESXi hosts.

Change below variable in script -
$folder - folder where you have saved esxi command file, plink.exe
$vcenter - change vcenter name
$password - enter your ESXi password.

This script connect to all ESXi hosts which are connected in vCenter, then it connect to each ESXi hosts using SSH and execute ESXi command and finally it exports boot device to $vcenter-boot-device-info.csv file.


$folder = "C:\temp\"
$vcenters = "vcenter01"
$password = 'EsxiPassword'

foreach ( $vcenter in $vcenters ) {

   connect-viserver $vcenter

   $vmhosts = get-vmhost | where { $_.connectionState -ne "NotResponding" } | sort Name

$report = @()

foreach ( $vmhost in $vmhosts.name ) {

   $details = " " | select HostName,DeviceName,SettableDisplay,Size,MPP, DevfSPath, Vendor,Model,Revision,RDM,Local,Removable,QueueSize,isShared,isUSB,BootUSB,isBootDevice

   $details.Hostname = $vmhost

   echo y | $folder\plink.exe  -ssh root@$vmhost -pw $password hostname
   $hostdetails = $folder\plink.exe  -ssh root@$vmhost -pw $password -m $folder\cmd.txt
   $details.devicename = $hostdetails

   $report += $details
   }
   $report | export-csv -NoTypeInformation -path $folder\$vcenter-boot-device-info.csv
}

Once you get CSV, you would need to open it in Excel and select all cell from B2 and do text to column and separate all cells from B2 text with comma as delimiter.   

E.g. output