vRAC Ansible Control Host Lessons Learned

Now that I’ve spent a couple of months with vRealize Automation Cloud (vRAC) (AKA VMware Cloud Assembly) I figured it would be a good time to jot down some lessons learned from deploying and using several Ansible Control Hosts (ACH).

My first ACH was an Ubuntu machine deployed on our vSphere cluster. This deployment model requires connectivity to a Cloud Proxy. All of my ACH’s since then have been on an AWS t2.micro Amazon Linux instance. The remaining blog focuses on that environment.

First you should assign an Elastic IP if you plan on shutting it down when not in use. Failing to do so makes your vRAC integrated control host unreachable when it reboots, as the IP will change.

This leaves you with an unusable ACH, with no option but to replace it with a new one. A word of warning here, make sure to delete any deployments that used that ACH. Failing to do so will leave you with an undeletable deployment. Plus you may not even be able to delete the ACH from the integrations page. Oh joy!

Next is the AWS security group. The off shore developers cannot provide a list of source IP’s for any vRAC call into the ACH. Or in other words you have to open your server up to the world. I’ve been told they are working on this, but do not know when they’ll have a fix.

So, I’ve been experimenting with ConfigServer Security and Firewall (CSF). My current test is pretty out of the box and is using two block lists in csf.blocklists. So far so good, as I was able to add the test ACH without an issue, and it is blocking tons of IP’s. The host is using about 180K of RAM with the blocklists in place.

Changing the SSH server port to something other than port 22 doesn’t work right now. I brought this up on my weekly VMware call, so hopefully they’ll get it fixed in the near future.

I’m using an S3 bucket to backup my playbooks once a day. I did have to create an IAM Role with S3 permissions and assigned it to my ACH. I’ll eventually the files pushed up to a repository.

Now on to troubleshooting. One of my beefs is the naming convention of the log files in the ACH users home directory (var/tmp/vmware/provider/user_defined_script/). The only way to figure out which directory to look in is to list them by date (ls -alt) or do a find on the machine IP.

The ACH playbook log directory contains different files depending on the run state. A running deployment contains additional information like what the actual ansible command was called (exec). Exec disappears when the deployment is complete or it errors out, making it almost impossible to look at the ansible command and figure out why it failed. My work around is to quickly jump into the directory and print out ‘exec’.

The ansible command isn’t the only thing that is deleted when a deployment fails, they also delete the host and user variables under /etc/ansible/host_vars/{ip address}. I just keep two windows open, and print out the contents of anything beginning with ‘vra*’. However, ‘log.txt’ in the deployment log folder does contain the extra host variables, but does not contain the user variables.

I’m still figuring out how much space I need in the users home directory for log files. vRAC doesn’t delete folders when destroying a deployment. I suspect an active ACH will eventually fill up the ACH user log files (var/tmp/vmware/provider/user_defined_script/). Right now I’m seeing an average folder size of a bit less than 20k for completed deployments, or about 1G per 60 deployments. And no this isn’t on my grip list yet, but will be.

That’s it for now. Come back soon.

vRAC AWS Windows Domain Join using Ansible

My most recent vRealize Automation Cloud (vRAC) task was to leverage Ansible to join a new AWS Windows machine to an Active Directory domain.

First off I needed to figure out how to get WinRM working in an AWS AMI. Initially I just deployed a Windows instance, installed WinRM, added a new account admin account, then created a private AMI. This did work, sometimes, but really wasn’t the best solution for the customer. What I really needed was a way to install WinRM and create the new user in an EC2 instance deployed from a publicly available AMI.

The dots finally connected last week when it dawned on me that vRAC cloudConfig equals AWS instance User Data. Yes it was that simple.

After reviewing Running Commands on your Windows Instance at Launch, and some tinkering I came up with this basic vRAC cloudConfig PowerShell script. It adds the new user in the local Administrators group, then installs and configures WinRM. This leaves me a clean Ansible ready machine.

  cloudConfig: |
    <powershell>
    # Add new user for ansible access
    $password = ConvertTo-SecureString ${input.new_user_password} -AsPlainText -Force
    $newUser = New-LocalUser -Name "${input.new_user_name}" -Password $password -FullName "Ansible Remote User" -Description "Ansible remote user" 
    Add-LocalGroupMember -Group "Administrators" -Member "${input.new_user_name}"    
    # Setup WinRM
    Invoke-Expression ((New-Object System.Net.Webclient).DownloadString('https://raw.githubusercontent.com/ansible/ansible/devel/examples/scripts/ConfigureRemotingForAnsible.ps1'))
    </powershell>

The resulting instance User Data includes the expanded variables as seen below.

<powershell>
# Add new user for ansible access
$password = ConvertTo-SecureString VMware123! -AsPlainText -Force
$newUser = New-LocalUser -Name "ansibleuser" -Password $password -FullName "Ansible Remote User" -Description "Ansible remote user" 
Add-LocalGroupMember -Group "Administrators" -Member "ansibleuser"    
# Setup WinRM
Invoke-Expression ((New-Object System.Net.Webclient).DownloadString('https://raw.githubusercontent.com/ansible/ansible/devel/examples/scripts/ConfigureRemotingForAnsible.ps1'))
</powershell>

The playbook turned out to be fairly simple. It waits for 5 minutes, points DNS to the DC (also running DNS), renames the machine, and joins it to the domain.

- hosts: win
  gather_facts: no
  tasks:

  - name: Pause for OS
    pause:
      minutes: 5

  - name: Change DNS to DC
    win_dns_client:
      adapter_names: '*'
      ipv4_addresses:
        - 10.10.0.100

  - name: Rename machine
    win_hostname:
      name: "{{ hostname }}"
    register: res

  - name: Reboot if necessary
    win_reboot:
    when: res.reboot_required

  - name: Wait for WinRM to be reachable
    wait_for_connection:
      timeout: 900

  - name: Join to "{{ domain_name }}"
    win_domain_membership:
      hostname: "{{ hostname }}"
      dns_domain_name: "{{ domain_name }}"
      domain_admin_user: "{{ domain_user }}"
      domain_admin_password: "{{ domain_user_password }}"
      domain_ou_path: "{{ domain_oupath }}"
      state: domain
    register: domain_state

  - name: Wait for 2 minutes
    pause:
      minutes: 2

  - name: reboot if necessary
    win_reboot:
      post_reboot_delay: 120
    when: domain_state.reboot_required

Blending existing AWS User Data with vRAC cloudConfig finally provided a clean solution without having to write a super complex ansible playbook. Keeping it simple once again pays off.

The blueprint and playbook referenced in this article are available this github repo.

vRealize Automation Cloud Ansible Enhancements

VMware released some Ansible enhancements within the last couple of weeks.

First is the ability to use the private IP of the deployed machine.  Prior the this fix, disabling the public IP threw and error and the deployment failed.

To disable the assignment of a pubic IP (default), simply add ‘assignPublicIpAddress: false‘ in the network properties.

Cloud_Machine_1:
  type: Cloud.Machine
  properties:
    remoteAccess:
      keyPair: id_rsa
      authentication: keyPairName
      image: CentOS 7
      flavor: generic.tiny
      attachedDisks:
        - source: '${resource.Cloud_Volume_1.id}'
      networks:
        - network: '${resource.Cloud_Network_1.id}'
          assignPublicIpAddress: false

By default, vRAC will use the private ip address of the first NIC on the machine.

Just a few things about the placement of the machines.  First my Ansible Control Host (ACH) is on a Public AWS subnet.  My first attempt to install NGINX on a machine deployed to the same subnet failed as it could not find the repo.  After some troubleshooting I determined the new machine needs to be deployed on a private subnet, with a NAT Gateway.  Oh and make sure the ACH can connect to the deployed machine on TCP port 22 (SSH).

The second was having the ability to send extra variables to the ACH.  Here the use case is to join an AWS backed Windows server to a domain using an ansible playbook.

Ansible extra variables can be added under the properties in the Ansible component.  Here I’m going to add several just to demonstrate what it looks like.

Cloud_Ansible_1:
  type: Cloud.Ansible
  properties:
    host: '${resource.Cloud_Machine_1.*}'
    osType: linux
    account: ansible-control-host
    username: centos
    privateKeyFile: /home/ansibleoss/.ssh/id_rsa
    playbooks:
    provision:
      - /home/ansibleoss/playbooks/centos-nginx/playbook.yml
    groups:
      - linux
    hostVariables:
      bluePrintName: BP- ${env.blueprintName}
      message: Hello World
      domain: corp.local
      orgUnit: ou=sample,dc=corp,dc=local
      disks:
        disk1:
          size: '${resource.Cloud_Volume_1.capacityGb}'
          label: '${input.disk1_label}'
        disk2:
          size: 20
          label: Fake disk

These variables are stored in /etc/ansible/host_vars/vra_user_host_vars.yml.

This is the resulting YAML file for this blueprint request.

vra_user_host_vars

They also changed the default connection type to winrm (default is SSH) if the osType is set to ‘windows’.

This will be the topic of my next article.

Stay tuned.

CentOS Image for Cloud-init on VMWare Cloud Assembly Services

My current customer is looking at using VMware Cloud Assembly Services (CAS) for their next generation SDDC.  It looked like CAS would be able to address some of their Ansible and other OS customization use cases.

Ubuntu cloud ready OVA works great, but unfortunately a cloud ready CentoOS OVA was not available (they use RHEL and CentOS as their primary Linux Distro).

Well it took me a bit, but was able to build an OVA that worked.  Here is how I did it.

First I built a clean CentOS 7.x image using the minimal install ISO.  I’m not going through this step by step as its been well documented else where.

Secondly, make sure to change the CD ROM back to client after reboot.  Do not leave it pointed at one on a datastore, even if it is not connected at power on.  Why? Well a Cloud-Init ISO is mounted on the machine when it powers up.  CI failed to run when I left the CD pointed at and ISO on a datastore.

After the initial reboot, I simply updated the machine, and installed open-vm-tools and cloud-init.

#yum update -y

#yum install -y open-vm-tools cloud-init

Then cleaned up the machine.

#cloud-init clean --logs

#sys-unconfig

This last command will return the machine to an uninstalled state, and shut it down.

Next, within vCenter I enabled the vApp Options.

vAppOptions.jpg

Then gave the appliance a name and added a few properties.

vAppProperties.jpg

And finally enable ISO as the environment transport.

vAppTransport

After saving the settings, I converted it into a template and imported it into CAS.

From there I created a new Image Mapping, and gave it a try in CAS.

CentosCloudInitSuccess

The blueprint and Ansible playbook can be found at this github repository.

 

 

vRA Custom Forms Regex

A recent customer request was to have the linux machine names entered manually, always be in lower case, and be 13 characters long.

Initially I tinkered with a basic Event Broker Service (EBS) Subscription to simply change the ‘Hostname’ to lower case, but then decided to enforce it in a custom form.

My first task was finding the correct expression using some search sites. After a few clicks I was able to find what I was looking for.

My working expression ended up looking like this.

^[a-z0-9]{13}$

Click on the Hostname field on your custom form, then the Constraints tab and paste the expression in Regular expression, and provide a useful error in the Validation error message text box.

hostnameregex

Make sure to click the form Save button before moving to another field.  I failed to do that a few times, only to discover a blank field when troubleshooting.

savecustomform

While the customer didn’t have a specific requirement for the other fields, I decided to research and test some other expressions for future reference. I’m sure some additional tweaking will be needed, but here goes.

IP Address / DNS Server / Gateway

^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$

Subnet Mask

^(((255\.){3}(255|254|252|248|240|224|192|128|0+))|((255\.){2}(255|254|252|248|240|224|192|128|0+)\.0)|((255\.)(255|254|252|248|240|224|192|128|0+)(\.0+){2})|((255|254|252|248|240|224|192|128|0+)(\.0+){3}))$

Domain name

(?:[a-z0-9](?:[a-z0-9-]{0,61}[a-z0-9])?\.)+[a-z0-9][a-z0-9-]{0,61}[a-z0-9]

There you go. As always your mileage may vary.

 

VMware PKS Bosh CLI client SSL trust

This past week or so has been spent deploying VMware PKS Enterprise in my lab.  My main installation guide was provided by Pivotal’s Installing Enterprise PKS on vSphere with NSX-T.  

All was going well until I tried to deploy a cluster.  I could see the machines being deployed in vCenter and various NSX-T components being deployed.  However, the cluster deployment failed with the following error.

Name:                     cluster-03
Plan Name:                small
UUID:                     ad4ba957-35bc-4500-ace8-1cfda0238a83
Last Action:              CREATE
Last Action State:        failed
Last Action Description:  Instance provisioning failed: 
..... task-id: 289, ... result: 2 of 7 pre-start scripts failed. Failed Jobs: ...

As you can see task 289 failed.  Now how the heck do I get the details of the failed task?

The bosh cli client appeared to be the answer. Reading further it looked like I needed to set some environment variables to make it work properly.

After reading a few online documents, I was able to find the Bosh Command Line Credentials (Actually the bosh environment variables) by clicking on the Bosh Tile in Operations Manager, clicking on the Credentials tab, then clicking the link next to Bosh Commandline Credentials.

boshCreds

The provided BOSH_CA_CERT path and file do not exist on my jump machine.  I was able to download the root CA following these steps. (Installing uaac is beyond the scope of this document).

uaac target https://opsman.corp.local/uaa --skip-ssl-validation

uaac token owner getClient ID: opsman
Client secret:
User name: admin
Password: *******

uaac contexts

Copy the admin bearer token from the client_id section (the token is actually called access_token).

[0]*[https://pks.pks.corp.local:8443]

  skip_ssl_validation: true

  ca_cert: root_ca_certificate

  [0]*[admin]

      client_id: admin

      access_token: eyJhbGci .....

      token_type: bearer

Finally downloading the certificate to my jump machine.

curl https://opsman.corp.local/api/v0/security/root_ca_certificate -X GET -H "Authorization: Bearer eyJhbGci ....." -k > root_ca_certificate

My reformatted bosh environment settings, along with the correct path to my certificate ended up like this.

export BOSH_CLIENT=ops_manager 
export BOSH_CLIENT_SECRET=MP0................Blah! 
export BOSH_CA_CERT=/root/root_ca_certificate <--- Correct path and file
export BOSH_ENVIRONMENT=bosh.corp.local

After pasting the variables into my console, I attempted to get the details from the failed task.

bosh task 289 
Validating Director connection config:
  Parsing certificate 1: Missing PEM block
Exit code 1

What the heck?  Apparently the downloaded certificate is actually in JSON format, AND it includes ‘\n’ as line returns.

{"root_ca_certificate_pem":"-----BEGIN CERTIFICATE-----\nMIIDUDCC...
...
....
cQswzKxnm8ZfedoVheV9OBnYQyrHV2ePG/W+kfCoqXD\n ....
CeEzZD6ZicGuv7KcYNP\n...\n-----END CERTIFICATE-----\n"}

Using Notepad ++ I replaced all of the ‘\n’ with a line return.

nppFandR

Then I removed the quotes, brackets , root_ca_certificate.pem section, and deleted all of the other newlines leaving me with a clean certificate (Each line needs to be 64 characters long).

formattedCert

After saving this on my machine, I attempted to run the command again, this time using the –ca-cert option pointing to the new certificate.

bosh task 289 --ca-cert root_ca_2.pem 
Using environment 'bosh.corp.local' as client 'ops_manager'

Task 289
....

Task 289 | 16:58:24 | Updating instance master: master/51431548-e35a-471b-853f-26dc7eca9f7c (0) (canary) (00:02:06)

                    L Error: Action Failed get_task: Task ...
Task 289 | 17:00:30 | Error: Action Failed get_task: ...
Task 289 Started  Wed May  8 16:55:34 UTC 2019
Task 289 Finished Wed May  8 17:00:30 UTC 2019
Task 289 Duration 00:04:56
Task 289 error

Capturing task '289' output:
  Expected task '289' to succeed but state is 'error'
Exit code 1

Success!

Now all all I need to do is figure out the error.  Oh joy!

 

vRA Custom Form CSS rendering issues

I ran into some interesting issues when trying to develop a custom form in vRealize Automation.

The first one was noticed when I tried to apply font-size to a field.  After doing some research and watching a few videos, it looked like some Field ID’s simply could not be mapped in CSS.  The main issue is the Field ID has a tilde ‘~’ in it, which is not a valid CSS ID.

tildeIssue

A quick search resulted in an amazingly simple solution.  Escape the tilde with a backslash ‘\’.  Stupid easy!

My CSS up to this point contained the following;

body {
font-size: 18px;
}

#vSphere__vCenter__Machine_1\~size {
font-size: 14px; 
}

#integerField_cda957b7 {
font-size: 20px;
font-style: italic;
}

#vSphere__vCenter__Machine_1\~VirtualMachine.Disk1.Size {
font-size: 32px;
}

Which produced this form.

improperlyFormattedForm

The integerField and Size (Component Profile) updated properly, but the Requested Disk Size didn’t.  (This field is actually updated using an external vRO action to change the Integer into a String, then applied to a custom property).  Ok, what I was actually after was getting rid of the Label word wrap, but stumbled across the font-size issue along the way.

After digging through some debugging info I ran into two classes, one was ‘grid-item’, and the second was ‘label.field-label.required’.

label-field-required

After updating CSS, I was finally able to change the Requested Disk Size font, as well as fixed the word wrap issue.

My final result is a good starting point.  Both Size and Requested Disk Size fields had a font size of 18, and the integer field had an italicized font with the correct size.  The shadow on the Size Label was just for fun 🙂

FormattedCustomForm

My final CSS looks like this;

body {
font-size: 18px;
}

.label.field-label.required {
width: 80%!important;
}

.grid-item {
width: 80%!important;
}

#vSphere__vCenter__Machine_1\~size {
/* font-size: 14px; */
color: blue;
text-shadow: 3px 2px red;
}
#integerField_cda957b7 {
font-size: 20px;
font-style: italic;
}
#vSphere__vCenter__Machine_1\~VirtualMachine.Disk1.Size {
/* does not work with custom properties */
/* need to dig into label.field.label.* */
font-size: 32px;
}

Well its’ about the time of day. Be well.

Migrating from NSX-V to NSX-T Error

The project this week was to convert my lab environment from NSX-V to NSX-T to support some upcoming projects.

Apparently unconfiguring the hosts through the NSX-V interface didn’t work as expected, and in fact didn’t remove all of the NSX-V components.

When trying to install NSX-T (2.3.1) on my vSphere 6.7 hosts, I received the error ‘File path of /etc/vmsyslog.conf.d/dfwpktlogs.conf is claimed by multiple none-overlay VIBS’.

vmsyslogError

After logging into one of the hosts and running ‘esxcli software vib list | grep -i nsx’ I discovered a stray NSX package called esx-nsxv.

#esxcli software vib list | grep -i nsx
esx-nsxv 6.7.0-0.0.7563456 VMware VMwareCertified 2019-03-08

This was quickly rectified by placing the host in maintenance mode, then running the following command.

#esxcli software vib remove -n esx-nsxv

Which spit out the following.

Removal Result
Message: Operation finished successfully.
Reboot Required: false
VIBs Installed:
VIBs Removed: VMware_bootbank_esx-nsxv_6.7.0-0.0.7563456
VIBs Skipped:

After exiting maintenance mode I retried the installation by selecting ‘Resolve’ in NSX-T.

vmsyslogResolve

NSX-T installation on the host then completed successfully.

Nested NSX-T cluster on vSphere 6.7U1

I took some time this week to update William Lams Nested vSphere 6.5 with NSX-T to Nested vSphere 6.7U1/NSX-T 2.3.1, to kickstart a new customer project.

This version updated some of the vCSA OVF JSON fields, and added support for PowerShell 6.1. It was tested against PowerCLI 11.2 on a Windows 10 machine.

The original post can be found at virtuallyghetto.

You can find the new, updated file(s) at nested-nsxt231-vsphere67u1.

 

NvIPAM is Open Source

NvIPAM is open source as of today.

After some significant work over the past few weeks, I’m pleased to announce the publication of NvIPAM on Github.

This is really an Alpha release, as I’m sure lots of things will change over the course of time.

The installation is handled by an Ansible playbook, which will install and configure the CentOS machine.  Once installed you’ll need to initialize the database, then start the service.

My plan is to start documenting the project on the Wiki page as I have time.  These articles will cover installing and configuring the vRO plugin, integrating it with vRA, deploying an external vRO appliance (for vSphere IPAM integration).

You can down load your copy at https://kovarus.github.io/NvIPAM/.

Please feel free to add any issues, or even branch it and take it in your own direction.

Stay tuned.