5 Best Practices for Creating Cloud Infrastructure That Breeds Success
Today, a successful business doesn’t guarantee that technology behind the scenes is up to standard. In fact, in my experience, I’ve seen many thriving companies with infrastructure out of scope, or with technologies that have lost focus, all of which are at different stages, some early-stage start-ups, some unicorn companies, and many more. Even companies with great DevOps can sometimes overlook their infrastructure and utilization needs, and this can have a direct impact on business stability and growth.
Of course, there are many ways to write and build your infrastructure, and the “right way” to go about this process will differ between companies. However, there are certainly some best practices that can be a great starting point for discussion, and I wanted to share 5 of the main ones right here.
#1: The Importance of Highly Available Infrastructure
I recently revamped the infrastructure of one company that had been suffering downtime over the course of 2 months. They couldn’t work out the problem, and yet when I looked into their production, I saw that there was a single instance of each service, no load balancers, and a database that had 100% of the CPU in utilization. Yikes!
This is an extreme example, but I’ve seen many versions of these kinds of configuration problems before. Your infrastructure should be built with no single point of failure – this is what it means to be highly available. You want at least 2 instances, preferably in different Availability Zones depending on your cloud, you should connect all applications with a load balancer so that faulty instances can be disconnected where necessary, and you need enough spare CPU that you don’t end up with latency problems. Altogether, a high-availability set up means you have a far better ability to meet SLA for your customers, you can handle load and scaling on the fly, and any new deployments don’t interrupt your service. Even in an emergency, if an application shut down or met an exception, your servers, databases and functions would stay working as usual, ensuring continuity of service and no business impact.
#2: Visibility Inside Your Environment
Another best practice is around being able to see at a glance everything that is happening inside your cloud environment. The core of your infrastructure should be shown as metrics, logs and events, which offers a streamlined system where you can see everything you need.
A system with the right visibility tools will help you to recover faster from bugs and find the actual root cause of any problem, rather than attempting to fix issues using trial and error, or guesswork. That means even less experienced developers will be able to be taught their way around your infrastructure quickly and uncover problems without needing to escalate to senior developers.
In my opinion, there is nothing more frustrating than searching around a bug and its cause for days or weeks at a time, while having the pressure of business continuity on my shoulders. It’s much better to be able to be the hero by quickly isolating issues as they occur.
#3: Question the Number of Environments You Need
Many industries use the holy trio of environments, “development, staging and production”, to manage their Infrastructure. Sometimes, even more environments are added, pre-production, management, QA, and more. In some very niche use cases, companies might need a different environment for each customer, if they can’t host multi-tenant environments and each customer has unique needs. I’ve even heard of companies with 50 different environments under their roof! In some cases, this might be necessary, but I like to challenge customers to ask themselves, “Could you reduce the number of environments that you hold and manage, down to… two?” This may sound difficult, but remember, environments cost money, and add complexity and overheads, not just in deployment but also in management. If you can reduce your environments down to development and production alone, you’ll see real savings and a reduction in complexity.
“But what about staging? Don’t I need it?” (I hear you ask!)
Often, the answer to that is a big no! Staging is meant to be a mirror ofr production, but the truth is that this will never really be the case. It will never have real traffic, it comes with its own unique moving parts, and people have different permissions that they won’t have with production, which means they may still develop in this environment, especially in a fast-paced company. The system is never static, so staging can never really catch up. Instead, make sure that you have the right Gitflow to make your development environment look similar to how you imagine staging, and say goodbye to the cost and overheads of managing staging for good.
#4: Use a Raw Data Lake
One of the biggest problems I see is when multiple departments use the same database when only one department maintains it. This commonly causes issues such as a lack of data integrity, downtime, resource sharing misconfigurations and faulty data flows.
Companies usually encapsulate data this way to save money and reduce their infrastructure costs, but unfortunately I’ve seen how eventually it leads to a lot of frustration and problems. For example, departments like R&D and BI usually have opposite needs when it comes to digesting data. BI may enter some queries that have a lot of load, and this then stops production from functioning as it should.
Instead, to increase the integrity of the data, and to prevent queries slowing down production servers, there should be a single source of truth, in the form of a raw data lake. This is the first place to store your data, before any changes are made on it. Later on, each department can ETL from it for their required data flows as necessary, or choose specific resources that they could share, but not on the production database.
#5: Avoid Infrastructure Mess
The final issue I’ll share is when companies let their engineers have full permissions on many cloud environments. In some companies, I’ve seen this happen with the majority of engineers. Usually, this leads to resource bloating while engineers keep creating resources with no details attached. Once the engineer forgets about the resource or even leaves the company the orphaned service is left without an owner, taking up storage, adding risk and complexity, and costing the company unawares.
Cases like that could even lead to security breaches. One example could be source code which is open to the entire world, while others include open ports, tokens, credentials, and passwords which never rotate or change.
When I see an environment like this, I offer possible solutions, such as maintaining your environment with IaC, like Terraform or Pulumi. You could also set up a flow using AWS Lambda or Azure function that cleans unused resources after a fixed period of time. You can also create policies where resources can only be created with tags, allowing you to review your infrastructure on a monthly basis and keep everything in order.
Bringing These Best Practices Together
The right partner can help you to set up your infrastructure from day one with these best practices in mind, or isolate the intelligent changes to a current cloud set-up that could alleviate existing issues such as downtime, latency, or high-costs. Reach out to schedule a call to discuss your unique cloud environment!