I’ve spent one week messing around with Docker on AWS. “Docker on AWS” is a joint effort between AWS & Docker provide a quick start for Docker Data Center (Universal Control Plane & Docker Trusted Registry) deployed in a single click via a CloudFormation template. It’s a sort of reference deployment implementation. Personally, I was quite excited for this since I spent a week or two hacking on my own CloudFormation template to bootstrap a UCP system. I got a trial license immediately when I realized invites where no longer required. This blog post summarizes my experience.

My use case is somewhat simple. I want a swarm cluster for the team to deploy an internal docker-compose based app too. I’m not interested in DTR because we have a paid account on the official registry. The organization has development teams in India and Sweden. Thus I wanted one UCP installation in eu-west-1 and ap-south-1. Then I wanted to automate CloudFormation deploys using our Ansible playbooks. My initial spike requirements where:

  1. Deploy the official CloudFormation stack with Ansible
  2. Configure CloudFlare DNS records for UCP and DTR as part of the Ansible playbook.
  3. Get a “client bundle” from UCP
  4. Load that and run docker-compose commands against UCP

I’m sad to report the first week has been fraught with many problems. I’ve documented by problems as GitHub issues and submitted pull requests where possible. I was not able to get a complete success for the reasons outlined below. The majority of issues are reported in this long github issue.

  1. Stack timeouts are two low. This took a long time diagnose because how how long the stack takes to create. The stack enforces at 15 timeout on bootstrap/install/configure operations. That is installing all package, starting, docker, and doing UCP/DTR operations must complete within 15 minutes or the deploy fails. The repository does have CI so in theory should be enough. However all my deploys to ap-southeast-1 where constantly taking ~20 minutes. I’ve submitted a patch to bump the timeout to 30 minutes. This should cover the stack in all regions.
  2. The default settings (which most users deploy with) do not provide enough disk space. The default instance type for controller, UCP nodes, and replicas is m3.medium. This comes with a 4GB root volume which leaves ~3GB when everything is installed. Our test application has ~25 containers. My initiall pull filled up one of the nodes after a few images. The stack accepts instance types via Parameters but that is a losing battle. You should not need to move up to some xxlarge instance just to get more disk space. I’ve submitted a patch to attach a 120GB SSD to each node.
  3. m4.* not supported. I bumped the instance type and deployed. There was a bug in the template mapping where m4.* instances were allowed parameters, but not mapped to an AMI. I reported an issue which was fixed by the maintainers. The fix is out in the wild.
  4. DTR replicas simply do not bootstrap correctly. This has been a losing battle. I’ve given up trying to debug this issue. All the logs and intermediate failures are documented in my main issue. I was only able to continue my spike by copying official CloudFormation template and removing everyting related to DTR. This was only possible because we do not need DTR. I’ve submitted a patch to include --debug flags on all UCP/DTR commands to help users in similar scenarios.
  5. Occasional ERROR: No elected primary cluster manager when pulling images to the swarm. I am uncertain why this happens and do not know where to look or how to debug this issue.
  6. The official stack template is poorly formatted with incorrect and inconsistent indentation. This made it impossible for me to read and study their implementation. I’ve submitted a patch to correct the mistakes.

Luckily I could continue my spike by removing the DTR parts. Now I can continue one with playing around with swarm/UCP itself. That being said there are still larger open issues with the reference deployment. Each point is also documented in a Github Issue.

  1. You cannot add your own SSL certs at deploy time. You can add them afterwards. I can understand why they not mandatory for the quick start (how many people have SSL certs on hand for trial purposes?). This does create a confusion where the UCP and DTR urls are only acccesible over HTTPS and on untrusted certs. It would be a nice to allow user specified certs or fallback to generated ones.
  2. No SSH access. This has been a real annoyance in debugging why things break. Hopefully future versions of the stack creat a public bastion (or “jump host”) to access the private nodes. I added the security group rules and created a new instance on the public subnet every time the stack deploy failed (this was quite tiring as you can imagine). This can be done easily in the CloudFormation template. I vote SSH access should be considered in the reference deployment.
  3. All region support. The current stack does not support ap-south-1. A few other regions are unsupported also. This is fixed by updating the inbuilt mapping for reach region. Luckily the stack uses the official Ubuntu 14.04 AMIs which are available in all regions. I’ve sumitted a patch to add ap-south-1 support.

My investigation will continue into the coming weeks. I’ll continue to open issue and submit pull requests where possible. Stay tuned for more information.