Using AI to correct IaC provisioning errors more efficiently

PUBLISHED
May 17, 2024
READ TIME
10 min

When an Infrastructure as Code (IaC) command fails, DevOps teams are often forced to dig through lines and lines of code to understand what may have caused the issue.

Meanwhile, any work that relies on the successful launch of that resource is held up—only adding to the pressure and frustration for the DevOps team tasked with delivering it.

So our team asked, why spend time digging through code when we could train an AI tool to do it for us?

To help eliminate the manual de-bugging work required to ensure the successful deployment of application infrastructure, we introduced ChatGPT to do some of the work on behalf of our users.

Here’s how it works.

Using AI to identify cloud provisioning errors

To understand how you can use AI to improve cloud resiliency, it helps to understand how Quali Torque’s platform engineering tools enable our users to understand how infrastructure is provisioned.

Torque provides a self-service catalog for any user to provision application infrastructure securely. To accomplish this, the platform connects to the user’s Git repository, discovers the IaC modules within, and generates new YAML files in Torque based on the resource configurations from their IaC modules.

Torque discovers and leverages the IaC modules in the user’s Git repository to provide a self-service developer portal for provisioning.

With these resources in the platform, Torque administrators can design complete application environments—for example, a staging environment for an application that requires multiple cloud services defined in IaC—and the platform will automatically create a new YAML that defines all the code needed to run it.

Administrators can select IaC assets for a unique environment and Torque will generate the code to run it automatically.

These YAML files serve as the plan behind the self-service catalog. Users can find the cloud resource or environment they need on the catalog and simply click “launch,” and Torque will execute the code to provision all resources defined in the code.

Torque applies all security credentials needed to run the infrastructure and enforces role-based permissions to ensure that end users cannot create or modify new configurations unless given administrator access.

From the catalog, end users can provision infrastructure and environments by simply clicking “launch.”

Once live, the user can access the outputs generated via the plan via self service.

They can also review the logs and dive into the configurations for each cloud resource that the platform launched.

This is where our ChatGPT integration steps in.

In the event that one of the cloud resources failed to launch, ChatGPT will automatically evaluate the code and notify the user of any errors that occurred.

When an error occurs, AI insights recommend potential causes so the user can resolve them quickly.

With the error identified, the user can start to act on measures to correct the issue.

This builds upon additional automation and functionality to improve resiliency for infrastructure our users provision via the platform.

To get a better idea of how Torque’s platform engineering tools work, try launching a sample environment via the interactive Torque Playground.

Additional Resources