Monitoring and Managing Production Applications

It was noon on a workday when an email comes in – “Your Azure Monitor alert was triggered – availability test failed” and a few seconds later a mention in Teams. Next thing I know I’m on a call with a developer to help track down the issue and get the site back up. 

Recovering from a production site outage during peak hours can be a daunting task. While everyone wants a 100% uptime, it can be near impossible because of things out of our hands. However, we can plan for these potential outages and architect the application and infrastructure to allow for a quick recovery.  

Notifications and Logging 

Clients do not want to be the one notifying us of an issue on their sites. We need to know the status of the application and be able to track down root causes of issues.  

We have logs in place that devs and ops can consult when debugging an issue and help identify the root cause of an outage. The two most utilized are Serilog (with logging to Seq) and Application Insights. 

In Application Insights we also have alerts set up at different thresholds to notify us of the overall application health. A few that we make sure to set up include:

  • When the application is experiencing an unusual amount of errors 
  • When the site availability tests start failing indicating the site is down
  • If the infrastructure is taking on more load than normal 

By having the right notifications and logging setup on a site it puts us in a position to be informing our client that we found something and that we are already working to resolve it.

Using mobile phone

Continuous Delivery 

Automated deployments to each environment help with visibility and reduces potential manual errors when deploying. We utilize Release Pipelines in Azure DevOps and everyone on the team can see what the latest deployment to each environment was and click through to see the exact commit that triggered the release.  

Acceptance and regression test steps give us the confidence that most errors will be caught before making it to production.  The history of deployments gives us the ability to quickly roll back to a previous version of code in the event there is an issue.  

Infrastructure as Code 

Infrastructure as Code is pretty much what it sounds like – instead of just having documentation on the configuration of infrastructure and someone manually making changes when they are needed, the infrastructure is defined in code files and stored with the application code in version control. All changes to the infrastructure are done in these files and sent through the same Pull Request process as the application code.  

For our Azure infrastructure we are using ARM templates and for AWS infrastructure we are using Terraform. There are steps in the Release that verifies the existing infrastructure against these templates. If the two do not match, then the infrastructure is updated to match the template – if someone updated the settings and configuration of an App Service manually but did not update the template then the change would get overwritten. 

In the event we ever needed to recreate the infrastructure we could do so in a matter of minutes. An added benefit from using these templates and release steps.

Wood dominos

Application Architecture – Cloud First 

The cloud gives us the ability to quickly spin up and scale resources, however it does require a change in thought while building an application. Making sure that in a microservice architecture the application can handle the scenario of a service being down. Using an Azure Storage Account to save files to rather than saving to the file structure of the application – with our infrastructure defined in IaC, that filesystem may be regularly purged. 

Continuous Learning 

When something unexpected does occur, you want to be able to quickly react. Once the immediate concerns have been addressed then it is beneficial to take some time to figure out what happened and what you can learn from the experience to either prevent it from occurring or to allow for a quicker reaction and turn around. 

With these processes and tools in place we were able to quickly find the solution to the error. By deleting the production App Service and redeploying the Production infrastructure and code we were able to get the site back up and running. 

This is not the end 

This was a general overview of some of the tools and techniques for a smooth-running application – each of these items could be a post or five in themselves (and may be eventually – keep an eye on the blog). It is not a full list of possibilities either – each tool, technique or process will add a tangible benefit to keeping an application healthy.

What are some processes you have implemented or know of that help keep things running smoothly?

Want brilliance sent straight to your inbox?