by Michael Schwartz & David Kimery
With today’s worldwide cloud-based computing solutions, downtime is not an option! Not even for scheduled system maintenance and updates. Customers want ZERO downtime, no exceptions.
But how can we, as software developers, keep a system up and running while we are updating the code to add features or resolve bugs? The solution is not easy, but it is possible.
Ever since I learned the Intuit team was developing online tax solutions that could post an update while customers were using their software, without shutting down the servers, I said this was something I wanted my software to do. So we started investigating the solution and the requirements.
The first thing we had to do was completely re-write the user authentication layer and services architecture to the software. Without getting too technical, most web applications running in HTTPS maintain a secure encrypted connection between the web browser and the server. This is usually done through session state.
If you have ever been using your browser and all of a sudden the server says you need to log back in, chances are you have lost the session state/secure communications between the server and your browser. This was the first problem we had to solve, allowing the session state to move securely between one server to the next. This is not a new technology; many of the larger websites/services can already do this, it was just a matter of learning how to do this in our software.
We decided to go with Amazon Web Services AWS, implementing a load balancer and the option of one server or multiple servers, each load balanced based on the site traffic. Hence, the load balancer’s job is to balance the workload between the servers. The first step to make this work, we have to move the user’s session state from the web server application to a database. This allows the load balancer to move a user from server A to server B without breaking the session state.
Once this task is complete, then some continuous integration magic can start to happen. With a little bit of scripting, we can now test new features and bug fixes in a live environment via our test server before deploying to production. When the update is ready, the Code Deploy Agent’s Application Specification file takes over and manages the deployment scripts via event hooks. The Agent spins up a new instance on AWS, installs the update, runs our builds, and starts up the web server application. If the deployment succeeded with no errors, traffic to the old instance is blocked and traffic to the new instance is allowed through via the Load Balancer.
Once the deployment is complete, the old instance will be terminated and removed from the server farm. If a deployment fails for any reason, a rollback will be performed so the prior instance will still be in service.
Next, we set our system up with AWS Autoscaling. This allows us to add and remove servers based on CPU usage. For example, if the server(s) are running at 70% or higher usage then AWS will automatically spin up a new server, launch the last successful Code Deployment, and balance the load across the instances. Then if we post a new update, AWS will populate the update across the deployment groups one instance at a time.
So, the next question is “Really… Does it actually work?” The answer is “Hell Yea!” That is why I am writing this automation corner! When I was a calibration technician, I used to hate that I couldn’t run a calibration procedure overnight. The IT group had to shut down the database to perform a full backup. If my procedure lost connection to the database, all my calibration data was lost and I had to start over.
Last week I was in the process of doing two demos. In one of the demos, the customer was going to calibrate an HP 34401A. While the calibration was running David sent me a WhatsApp message. He just fixed an error the customer found in the previous demo. Should he post the update or wait until I was done with the demo?
I said, “Post it!” The deployment groups posted the update to our test and production servers then updated the load balancer. Everything ran perfectly! The calibration we were running didn’t lose a single test point!
Links: