Recovering from a production site outage during peak hours can be a daunting task. While everyone wants a 100% uptime, it can be near impossible because of things out of our hands. However, we can plan for these potential outages and architect the application and infrastructure to allow for a quick recovery.
Notifications and Logging
Clients do not want to be the one notifying us of an issue on their sites. We need to know the status of the application and be able to track down root causes of issues.
We have logs in place that devs and ops can consult when debugging an issue and help identify the root cause of an outage. The two most utilized are Serilog (with logging to Seq) and Application Insights.
In Application Insights we also have alerts set up at different thresholds to notify us of the overall application health. A few that we make sure to set up include:
- When the application is experiencing an unusual amount of errors
- When the site availability tests start failing indicating the site is down
- If the infrastructure is taking on more load than normal
By having the right notifications and logging setup on a site it puts us in a position to be informing our client that we found something and that we are already working to resolve it.
Automated deployments to each environment help with visibility and reduces potential manual errors when deploying. We utilize Release Pipelines in Azure DevOps and everyone on the team can see what the latest deployment to each environment was and click through to see the exact commit that triggered the release.
Acceptance and regression test steps give us the confidence that most errors will be caught before making it to production. The history of deployments gives us the ability to quickly roll back to a previous version of code in the event there is an issue.
Infrastructure as Code
Infrastructure as Code is pretty much what it sounds like – instead of just having documentation on the configuration of infrastructure and someone manually making changes when they are needed, the infrastructure is defined in code files and stored with the application code in version control. All changes to the infrastructure are done in these files and sent through the same Pull Request process as the application code.
For our Azure infrastructure we are using ARM templates and for AWS infrastructure we are using Terraform. There are steps in the Release that verifies the existing infrastructure against these templates. If the two do not match, then the infrastructure is updated to match the template – if someone updated the settings and configuration of an App Service manually but did not update the template then the change would get overwritten.
In the event we ever needed to recreate the infrastructure we could do so in a matter of minutes. An added benefit from using these templates and release steps.
Application Architecture – Cloud First
The cloud gives us the ability to quickly spin up and scale resources, however it does require a change in thought while building an application. Making sure that in a microservice architecture the application can handle the scenario of a service being down. Using an Azure Storage Account to save files to rather than saving to the file structure of the application – with our infrastructure defined in IaC, that filesystem may be regularly purged.
When something unexpected does occur, you want to be able to quickly react. Once the immediate concerns have been addressed then it is beneficial to take some time to figure out what happened and what you can learn from the experience to either prevent it from occurring or to allow for a quicker reaction and turn around.
This is not the end
This was a general overview of some of the tools and techniques for a smooth-running application – each of these items could be a post or five in themselves (and may be eventually – keep an eye on the blog). It is not a full list of possibilities either – each tool, technique or process will add a tangible benefit to keeping an application healthy.
What are some processes you have implemented or know of that help keep things running smoothly, comment below?