Custom vSphere Template import into AWS as AMI

My current customer asked if they could use the same vSphere template as an AWS AMI. The current vSphere template has a custom disk layout to help them troubleshoot issues. The default single disk layout for AMI’s actually hinders their troubleshooting methodology.

Aside from the custom disk layout, I know VMtools would have to be replaced with cloud-init. Sure no problem. RIGHT! Well actually it wasn’t that hard.

Well I was finally able get it to work, and learned a bunch along the way. Those lessons include,

The RHEL default DHCP client is incompatible with AWS.
EFI bios is only supported in larger, more expensive instances.
AWS VM image import.
Make sure to enable ‘disable_vmware_customization’, if that made sense.

Requirements

AWS roles, policies and permissions per this document.
S3 bucket (packer-import-example) to store the VMDK until it is imported.
Basic IAM user (packer) with the correct permissions assigned (see above).
vSphere environment to build the image.
A RHEL 8.x DVD ISO for installation.
HTTP repo to store the kickstart file.

Now down to brass tacks. To be honest it took lots of trial and error (mostly error) to get this working right. For example, on one pass Cloud-Init wouldn’t run on the imported AMI. After looking at cloud.cfg I noticed ‘disable_vmware_customization’ was set to false instead of ‘true’. Another error occurred when my first import attempt failed as the machine did not have a ‘DHCP client’. That was odd as it booted up fine in vSphere and got an IP Address. Apparently AWS only supports certain DHCP clients. Go figure.

Eventually the machine booted properly in AWS, with the user-data applied correctly. The working user-data is in the repo’s cloud-init directory.

And my super simple vRAC blue print even worked. This simple BP adds a new user, assigns a password, and grants it SUDO permissions.

A couple of notes on the packer amazon-import post processor. Those include,

The images are encrypted by default, even tho the default for ‘ami_encrypt’ is false by default.
‘ami_name’ requires the AWS permission of ec2:CopyImage on the policy for the import role.
Don’t use the default encryption key if you wish to share this. You’ll need a Customer Managed Key (CMK). The import role (vmimport) will need to be a key user. You can set this with ‘ami_kms_key’ set to the Id of the CMK (i.e., ebea!!!!!!!!-aaaa-zzzz-xxxxxxxxxxxxxx)
The CMK needs to be shared with the target customer before sharing the AMI. ‘ami_org_arns’ allows you to set the organizations you’d like to share the AMI with.
There are lots of import options, you can check them out here.

This working example, plus others I’ve been working are available in this github repo.

Now onto another vRAC adventure.

Packer HCL and PVSCSI drivers

Just this last week I was updating an old Packer build configuration from JSON to HCL. But for the life of me could not get a new vSphere Windows 2019 machine to find a disk attached to Para Virtualized disk controller.

I repeatedly received this error after the machine new machine booted.

In researching error 0x80042405 in C:\Windows\pather\setuperr.log, I found it simply could not find the attached disk.

After some research I determined the PVSCSI drivers added to the floppy disk where not being discovered. Or more specifically the new machine didn’t know to search the floppy for additional drivers.

I finally found a configuration section for my autounattend.xml file which would fix it after an almost exhaustive online search.

The magic section reads as follows.

<unattend xmlns="urn:schemas-microsoft-com:unattend">
    <settings pass="windowsPE">        
       <component name="Microsoft-Windows-PnpCustomizationsWinPE" processorArchitecture="amd64" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" xmlns:wcm="http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <DriverPaths>
                <PathAndCredentials wcm:action="add" wcm:keyValue="A">
                    <!-- pvscsi-Windows8.flp -->
                    <Path>A:\</Path>
                </PathAndCredentials>
            </DriverPaths>
        </component>
...
    </settings>

After adding this section, the new vSphere Windows machine easily found the additional drivers.

This was tested against Windows 2019 in both AWS and vSphere deployments.

The vSphere deployment took an hour, mostly waiting for the updates to be applied. AWS takes significantly less time as I’m using the most recently updated image they provide.

The working files are located in the packer-hcl-vsphere-aws github repo.

Code Stream Nested Esxi pipeline Part 2

In this second part, I’ll discuss the actual Code Stream pipeline.

As stated before, the inspiration was William Lams wonderful Power Shell scripts to deploy a nested environment from a CLI. His original logic was retained as much as possible, however due to the nature of K8S a few things had to be changed. I’ll try to address those as they come up.

After some thought I decided to NOT allow the requester to select the amount of Memory, vCPU, or VSAN size. Each Esxi host has 24G of Ram, 4 vCPU, and contributes a touch over 100G to the VSAN. The resulting cluster has 72G of RAM, 12 vCPUs and a roughly 300G VSAN. Only Standard vSwitches are configured in each host.

The code, pipeline and other information is available on this github repo.

Deployment of the Esxi hosts is initiated by ‘deployNestedEsxi.ps1’. There are few changes from the original script.

The OVA configuration is only grabbed once. Then only the specific host settings (IP Address and Name are changed.
The hosts are moved into a vApp once built.
The NetworkAdapter settings are performed after deployment.
Persisted the log to /var/workspace_cache/logs/vsphere-deployment-$BUILDTIME.log.

Deployment of the vCSA is handled by ‘deployVcsa.ps1’ Some notable changes from the original code include.

Hardcoded the SSO username to administrator@vsphere.local.
Hardcoded the size to ‘tiny’.
Save the log file to /var/workspace_cache/logs/NestedVcsa-$BUILDTIME.log.
Save the configuration template to /var/workspace_cache/vcsajson/NestedVcsa-$BUILDTIME.json.
Move the VCSA into the vApp after deployment is complete.

And finally ‘configureVc.ps1’ sets up the Cluster and VSAN. Some changed include.

Hardcoded the Datacenter name (DC), and Cluster (CL1).
Import the Esxi hosts by IP (No DNS records setup for the hosts or vCenter).
Append the configuration results to /var/workspace_cache/logs/vsphere-deployment-$BUILDTIME.log.

So there you go, down and simple Code Stream pipeline to deploy a nested vSphere environment in about an hour.

Stay tuned. The next article will include an NSX-T deployment.

Code Stream Nested Esxi pipeline Part 1

Been a while since my last post. Over the last couple of months I’ve been tinkering with using Code Stream to deploy a Nested Esxi / vCenter environment.

My starting point is William Lams excellent PowerShell script (vsphere-with-tanzu-nsxt-automated-lab-deployment). I also wanted to use the official vmware/poweclicore docker image.

Well let’s just say it’s been an adventure. Much has been learned through trial and (mostly) error.

For example in Williams script, all of the files are located on the workstation where the script runs. Creating a custom docker image with those files would have resulted in a HUGE file, almost 16GB (Nested ESXi appliance, vCSA appliance and supporting files, and NSX-T OVA files). As one of my co-worker says, “Don’t be that guy”.

At first I tried cloning the files into the container as part of the CI setup. Downloading the ESXi OVA worked fine, but failed when I tried copying over the vCSA files. I think it’s just too much.

I finally opted to use a Kubernetes Code Stream instead of a Docker pipeline. This allowed me to use a Persistent Volume Claim.

Kubernetes setup

Some of the steps may lack details, as this has been an ongoing effort and just can’t remember everything. Sorry peeps!

Create two Name Spaces, codestream-proxy and codestream-workspace. Codestream-proxy is used by Code Stream to host a Proxy pod.

Codestream-workspace will host the containers running the pipeline code.

Next came the service account for Code Stream. The path of least resistance was to simply assign ‘cluster-admin’ to the new service account. NOTE: Don’t do this in a production environment.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cs-cluster-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: codestream
  apiGroup: ""
  namespace: default

Next came the Persistent Volume (pv) and Persistent Volume Claim (pvc). My original pv was set to 20GI, which after some testing was determined too small. It was subsequently increased it to 30GI. The larger pv allowed me to retain logs and configurations between runs (for troubleshooting).

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
  name: cs-persistent-volume-cw
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 30Gi
  hostPath:
    path: /mnt/nested
  persistentVolumeReclaimPolicy: Retain

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: cs-pvc-cw
  namespace: codestream-workspace
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 30Gi
  volumeName: cs-persistent-volume-cw

The final step in k8s is to get the Service Account token. In this example the SA is called ‘codestream’ (So creative).

k get secret codestream-token-blah!!! -o jsonpath={.data.token} | base64 -d | tr -d "\n"

eyJhbGciOiJSUzI1NiIsImtpZCI6IncxM0hIYTZndS1xcEdFVWR2X1Z4UFNLREdQcGdUWDJOWUF1NDE5YkZzb.........

Copy the token, then head off to Code Stream.

Codestream setup

There I added a Variable to hold the token, called DAG-K8S-Secret.

Then went over to Endpoints, where I added a new Kubernetes endpoint.

Repo setup

The original plan was to download the OVA/OVF files from a repo every time the pipeline ran. However an error would occur on every VCSA file set download. Adding more memory to the container didn’t fix the problem, so I had to go in another direction.

The repo is well connected to the k8s cluster, so the transfer is pretty quick. Here is the directory structure for the repo (http://repo.corp.local/repo/).

NOTE: You will need a valid account to download VCSA and NSX-T.

NOTE: NSX-T will be added to the pipeline later.

Simply copying the files interactively on the k8s node seemed like the next logical step. Yes the files copied over nicely, but any attempt to deploy the VCSA appliance would throw a python error complaining about a missing ‘vmware’ module.

However I was able to run the container manually, copy the files over and run the scripts successfully. Maybe a file permissions issue?

Finally I ran the pipeline with a long sleep at the beginning. Using an interactive session, and copied the files over. This fixed the problem.

Here are the commands I used to copy the files over interactively.

k -n codestream-workspace exec -it po/running-cs-pod-id bash
wget -mxnp -q -nH http://repo.corp.local/repo/ -P /var/workspace_cache/ -R "index.html*"
# /var/workspace_cache is the mount point for the persistent volume
# need to chmod +x a few files to get the vCSA to deploy
chmod +x /var/workspace_cache/repo/vcsa/VMware-VCSA-all-7.0.3/vcsa/ovftool/lin64/ovftool*
chmod +x /var/workspace_cache/repo/vcsa/VMware-VCSA-all-7.0.3/vcsa/vcsa-cli-installer/lin64/vcsa-deploy*

This should do it for now. The next article will cover some of the pipeline details, and some of the changes I had to make to William Lams Powershell code.

Happy holidays.

vRA Cloud Day 2 Resource Action using a Polyglot workflow

One of my peers came up with an interesting use case today. His customer wanted to mount an existing disk on a virtual machine using a vRA Cloud day 2 action.

I couldn’t find an out of the box workflow or action on my vRO, which meant I had to do this thing from scratch.

After a quick look around I found a PowerCli cmdlet (New-Hardisk) which allowed me to mount an existing disk.

My initial attempts to just run it as a scriptable task resulted in the following error.

Hmm, so how do you increase the memory in a scriptable task? Simple, you can’t. Thus I had to move the script into an action, which does allow me to increase the memory. After some tinkering I found that 256M was sufficient to run the code.

function Handler($context, $inputs) {
    # $inputs:
    ## vmName: string
    ## vcName: string (in configuration element)
    ## vcUsername: string (in configuration element)
    ## vcPassword: secureString (in configuration element)
    ## diskPath: string. Example in code. 
    # output:
    ## actionResult: Not used
    $inputsString = $inputs | ConvertTo-Json -Compress

    Write-Host "Inputs were $inputsString"

    $output=@{status = 'done'}

    # connect to viserver
    Set-PowerCLIConfiguration -InvalidCertificateAction:Ignore -Confirm:$false
    Connect-VIServer -Server $inputs.vcName -Protocol https -User $inputs.vcUsername -Password $inputs.vcPassword

    # Get vm by name
    Write-Host "vmName is $inputs.vmName"
    $vm = Get-VM -Name $inputs.vmName

    # New-HardDisk -VM $vm -DiskPath "[storage1] OtherVM/OtherVM.vmdk"
    $result = New-HardDisk -VM $vm -DiskPath $inputs.diskPath 
    Write-Host "Result is $result"

    return "It worked!"
}

Looking at the code, you will notice an input of vmName (used by PS to find the VM). Getting the vmName is actually pretty stupid simple using JavaScript. My first task in the WF takes care of this.

// get the vmName
// $inputs.vm
// output: vmName
vmName = vm.name

The next step was to setup a resource action. The settings are shown in the following snapshot. Please note the setting within the green box. ‘vm’ is set with a binding action.

Changing the binding is fairly simple. Just click the binding link, then change the value to ‘with binding action’. The default values work just fine.

The disk I used in the test was actually a copy of another VM boot disk. It was copied over to another datastore, then renamed to ‘ExistingDisk2.vmdk’. The full diskPath was [dag-nfs] ExistingDisk/ExistingDisk2.vmdk.

Running the day 2 action on deployed machine seemed to work, as the WF logs show.

So there you have a basic PolyGlot vRO workflow using PowerCli and JavaScript.

I trust this quick blog was helpful in some small way.

Changing vRealize Automation Cloud Proxy internal network ranges

My current customer needs to use 172.18.0.0/16 for their new VMWare Cloud on AWS cluster. However we tried this in the past and were getting a “NO ROUTE TO HOST” error when trying to add the VMC vCenter as a cloud account.

The problem was eventually traced back to the ‘on-prem-collector’ (br-57b69aa2bd0f) network in the Cloud Proxy which also uses the same subnet.

Let’s say the vCenters IP is 172.18.32.10. From inside cloudassembly-sddc-agent container, I try to connect to the vCenter. Eventually getting a ‘No route to host’ error. Can anyone say classic overlapping IP space?

We reached out to our VMWare Customer Success Team and TAM, who eventually provided a way to change the Cloud Proxy docker and on-prem-collector subnets.

Now for the obligatory warning. Don’t try this in production without having GSS sign off on it.

In this example I’m going to change the docker network to 192.168.0.0/24 and the on-prem-collector network to 192.168.1.0/24.

First to update the docker interface range.

Add the following two lines to /etc/docker/daemon.json. Don’t forget to add the necessary comma(s). Then save and close.

{
  "bip": "192.168.0.1/24",
  "fixed-cidr": "192.168.0.1/25"
}

Restart the docker service.

# systemctl restart docker

Now onto the on-prem-collector network.

Check to see which containers are using this network with docker network inspect on-prem-collector. Mine had two, cloudassembly-sddc-agent, cloudassembly-cmx-agent.

# docker network inspect on-prem-collector
[
    {
        "Name": "on-prem-collector",
        "Id": "57b69aa2bd0f694d76cc553769321deebcdb79e009e0964c4b5cc47aadb14684",
        "Created": "2021-02-10T16:05:21.953266873Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "05105324cff757d76de9e2f535cfb72d2e96094a630561aa141a40aa04095f00": {
                "Name": "cloudassembly-cmx-agent",
                "EndpointID": "8f6717a969b5a1edfea37b9e3d77565c38419de18774bebf4c3981e41c1ad017",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            },
            "b227cf1add6caca415b88f927fb10982b0cd846f71548f95071b65330e4024e1": {
                "Name": "cloudassembly-sddc-agent",
                "EndpointID": "4f802a81e0a5dfe50ca39675a5b5106a5fb647198f3bfa898f4f62793baad448",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Disconnect those two machines from the on-prem-collector network.

# docker ps
CONTAINER ID        IMAGE                                                                          COMMAND                  CREATED             STATUS              PORTS                      NAMES

05105324cff7        symphony-docker-external.jfrog.io/vmware/cloudassembly-cmx-agent:207           "./run.sh --lemansDa…"   4 days ago          Up 5 minutes        127.0.0.1:8004->8004/tcp   cloudassembly-cmx-agent

b227cf1add6c        symphony-docker-external.jfrog.io/vmware/cloudassembly-sddc-agent:4cda576      "./run.sh --lemansDa…"   4 days ago          Up 5 minutes        127.0.0.1:8002->8002/tcp   cloudassembly-sddc-agent

# docker network disconnect on-prem-collector b227cf1add6c
# docker network disconnect on-prem-collector 05105324cff7
# docker network inspect on-prem-collector
[
    {
        "Name": "on-prem-collector",
        "Id": "57b69aa2bd0f694d76cc553769321deebcdb79e009e0964c4b5cc47aadb14684",
        "Created": "2021-02-10T16:05:21.953266873Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {}
    }
]

Delete the on-prem-collector network, then re-add using the new subnet (using 192.168.1.0/24)

# docker network rm on-prem-collector
on-prem-collector
# docker network create --subnet=192.168.1.0/24 --gateway=192.168.1.1 on-prem-collector
47e3d477a87c4459f57e3a7305754b1d91e4d13e645ad4c160de5b8e64fede1a

Reconnect the two containers to the new docker network.

# docker network connect on-prem-collector 05105324cff7
# docker network connect on-prem-collector b227cf1add6c
# 
# docker network inspect on-prem-collector
[
    {
        "Name": "on-prem-collector",
        "Id": "47e3d477a87c4459f57e3a7305754b1d91e4d13e645ad4c160de5b8e64fede1a",
        "Created": "2021-05-18T15:58:55.019732144Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "192.168.1.0/24",
                    "Gateway": "192.168.1.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "05105324cff757d76de9e2f535cfb72d2e96094a630561aa141a40aa04095f00": {
                "Name": "cloudassembly-cmx-agent",
                "EndpointID": "34df13b0accf2f561e0226918a7e84d02995a25f4cc3969758a913a3f6c4e8bb",
                "MacAddress": "02:42:c0:a8:01:02",
                "IPv4Address": "192.168.1.2/24",
                "IPv6Address": ""
            },
            "b227cf1add6caca415b88f927fb10982b0cd846f71548f95071b65330e4024e1": {
                "Name": "cloudassembly-sddc-agent",
                "EndpointID": "405e7e8e1a4ad09b4cc99b0661454a4b0f32687152ca2346daf72f5a424dcd4d",
                "MacAddress": "02:42:c0:a8:01:03",
                "IPv4Address": "192.168.1.3/24",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {}
    }
]

Reboot and do the happy dance.

Happy not-coding.