Improving Terraform resiliency with automation to fix failed commands

PUBLISHED
January 11, 2024
READ TIME
10 min

Earlier this year, our product team encountered a trend with some of our users who rely on Terraform to define their application infrastructure.

After initiating certain commands, transient errors with Terraform or the cloud service provider caused those commands to fail.

In almost all of these cases, simply re-trying the Terraform command would fix the problem. This manual step, however, was still just as frustrating to our team as it was to the Terraform users we worked with.

Our users typically deploy environments in a matter of minutes, with basically no coding or orchestration. Even the simple step of retrying the Terraform commands was pulling them away from their day-to-day work. It was the kind of additional step that we try to remove for our users.

So we looked into what we can do to fix failed Terraform commands automatically.

How does Quali Torque leverage Terraform?

First, it helps to understand how our platform works.

Quali Torque connects to the user’s Git repositories, discovers the Terraform modules within them, and creates a new YAML file in the platform—which we call blueprints—that leverages the resource configuration defined in the module.

This allows administrators to define the configuration of complete application environments as code in YAML. For example, if an environment requires multiple cloud services which are delivered in different IaC or Kubernetes tools (e.g. Terraform, CloudFormation, and Helm), Torque would discover those resources in Git, normalize the infrastructure configuration, and create a blueprint that defines all infrastructure components and how they should work together to spin up the environment.

Learn more with our guide to scaling Infrastructure as Code

Once an administrator is comfortable with the inputs, dependencies, and outputs for the environment defined in the YAML, they can “publish” it, which releases it to the teams with permissions to deploy the environment. From there, those teams can simply click “Launch” on the environment in the catalog to initiate the creation of those cloud resources and generate the environment (integrations extend this access to developer tools like IDE and CLI, as well as operational tools like CI/CD and Internal Development Platforms).

When launching the environment, Torque executes the Terraform plan for every cloud resource. If a command fails, the process would break down. This also applies to the destroy command—any failure in this function will leave environments running (and accruing costs) longer than the developer intended, even though they attempted to terminate it.

Identifying and correcting common Terraform errors

After speaking with some of our users, we identified a initial list of common errors that were corrected by a simple retry of the command.

We then updated Torque to recognize when one of these errors caused a failed command, then automatically re-try the Terraform plan in response. Essentially, the platform automates the retry step so the DevOps engineer doesn’t get pulled away from more important work.

The update had pretty quick results. Shortly after pushing it live, our users reported improved success rate for environment deployments, fewer idle cloud resources, and better overall resiliency as a result.

This kind of automation is just one way DevOps teams can cut out redundant manual work.

Watch a demo of Quali Torque to learn more

Additional Insights