The $5 Decision That Wiped 2.5 Years of Production Data

March 9, 2026

10 min READ

The $5 Decision That Wiped 2.5 Years of Production Data

When Agentic Freedom Meets Infrastructure, Governance Isn’t Optional

Earlier this month, Alexey Grigorev, an AI educator, founder of DataTalks.Club, and one of the most followed voices in the data engineering community, published a very important incident report.

It’s important because it was not dramatic. It was actually ordinary. No malicious actor. No zero-day exploit. No rogue model. Just a capable AI agent, a missing Terraform state file, and the absence of a governance layer, and in one session, 2.5 years of course data was gone.

What Actually Happened

Alexey was migrating his AI Shipping Labs website to AWS. To save a few dollars a month, he decided to share the existing Terraform setup with his already-running DataTalks.Club production infrastructure rather than create a separate environment.

Claude Code explicitly advised against combining the two setups. Alexey overrode the recommendation.

What followed was a cascade of small, individually reasonable decisions:

The Terraform state file was on his old laptop, not the new one. Without it, the agent assumed no infrastructure existed and began creating duplicate resources.
Alexey noticed the duplicates, stopped the agent, and transferred the state file from his old machine to fix the situation.
The agent, now armed with the full state, did exactly what a well-trained agent should do: it reconciled the environment. It ran terraform destroy.
The state file described the full DataTalks.Club production infrastructure. Everything went: the RDS database, the VPC, the ECS cluster, load balancers, the bastion host.
The automated snapshots were deleted alongside the database.

When Alexey asked Claude where the database was, the answer was straightforward: it had been deleted.

He checked the RDS console for backups. Nothing visible. The AWS Events log showed a backup had been created at 2 AM, but clicking it opened nothing.

After upgrading to AWS Business Support (adding ~10% to his monthly cloud bill), a support engineer found a snapshot that wasn’t visible in his console. A 24-hour recovery process began. When it was done, 1,943,200 rows were back in the courses answer table alone. Which was very fortunate.

The Agent Didn’t Fail. The Governance Did.

This is the detail that matters most: the agent behaved correctly at every step.

It warned against combining infrastructures. It reported what it was doing. It used Terraform, the right tool, to clean up Terraform-managed resources. Its logic was sound.

The failure was entirely in the conditions surrounding the agent:

No least-privilege access: the agent had permissions to destroy anything in the AWS account
No approval gate: a terraform destroy touching production required no human confirmation
No deletion protection: neither Terraform nor AWS had safeguards preventing accidental destruction
No isolated state: the state file lived locally on a single laptop, not in S3 where it could be consistently and safely referenced
No environment isolation: dev and production infrastructure shared the same Terraform workspace

These aren’t exotic enterprise requirements. They’re table stakes for any team operating infrastructure at scale, with or without AI agents. But when you add an agent with execution permissions, the cost of missing any one of them increases dramatically.

The Central Tension of Agentic AI in Infrastructure

There’s a reason this incident resonates so widely: it captures the core tension every engineering team will face as AI agents become standard infrastructure tools.

Agentic freedom is valuable precisely because it removes friction. An agent that can plan, execute, and iterate autonomously can compress hours of infrastructure work into minutes. That’s not a marginal improvement; it’s a fundamental shift in how engineering teams operate. But friction isn’t always waste. Sometimes friction is a safety boundary.

The terraform plan → manual review → terraform apply workflow that Alexey now follows post-incident isn’t bureaucracy. It’s governance. It’s the recognition that some actions, particularly destructive, irreversible ones, require a human checkpoint before execution.

The question isn’t whether to give agents infrastructure autonomy. The answer to that is clearly yes, and it’s only going to expand. The question is: at what level of autonomy does your governance need to operate, and are you sure it’s keeping pace?

What Changed and What It Tells Us

After the incident, Alexey rebuilt his safety posture from the ground up:

Backups outside Terraform state: snapshots that survive a terraform destroy
Daily automated restore testing: a Lambda that spins up a database from the latest backup every night to verify it’s actually restorable
Deletion protection at two levels: in Terraform configuration and in AWS itself
S3-backed Terraform state: no more state files living on a single laptop
Agents no longer execute Terraform commands: generate a plan, review manually, run commands himself

Every one of these changes is a governance control. And every one of them could have been in place before the incident, automatically, as policy, rather than implemented in the aftermath. This is exactly what platforms like Quali Torque is built for.

The Result.

Savings: ~$5-10/month by not spinning up a separate AWS environment

Cost: A 24-hour outage, AWS Business Support upgrade (~10% added to his monthly bill permanently), and very nearly the permanent loss of 2.5 years of data for thousands of students

How Torque Addresses This Class of Problem

Torque is Quali’s environment-as-a-service platform built specifically for teams that need to give engineers, and increasingly, AI agents, access to cloud infrastructure without losing control of what they do with it.

Where Alexey had to manually rebuild his governance posture after an incident, Torque enforces it as the baseline:

Least-privilege access policies define exactly what an agent or engineer can touch in a given environment. An agent scoped to a dev workspace cannot reach production resources, by policy, not by hope.
Approval workflows for destructive actions mean that operations like terraform destroy, database deletion, or environment teardown require explicit human sign-off before execution. The agent proposes; a human approves.
Full audit trail logs every action taken in every environment, what ran, when, by whom (or what), and what it changed. When something goes wrong, the investigation starts with data, not guesswork.
Automated TTL and drift detection keep environments clean and controlled. Temporary environments expire. Drift from expected state is flagged before it becomes an incident.
Policy enforcement before execution means violations are caught at the plan stage, not discovered in the post-mortem.

The goal isn’t to slow agents down. It’s to give them a clearly defined operating space within which they can move fast, and to ensure that the boundaries of that space are enforced automatically, not assumed.

The Broader Lesson

Alexey published this incident as a warning, and it’s worth taking seriously precisely because of who he is. Alexey is an experienced engineer who runs one of the most active data engineering communities in the world, teaches infrastructure practices to thousands of practitioners, and works with these tools daily. He knew Terraform. He knew AWS. He understood the risks well enough that when Claude warned him not to combine the setups, he understood the tradeoff, he just underestimated it.

That’s the point. If this can happen to him, it can happen to anyone on your team.

Agentic AI doesn’t raise the skill requirement; it raises the stakes. The same autonomy that compresses hours of infrastructure work into minutes can compress hours of damage into seconds. Governance isn’t a safeguard for people who don’t know what they’re doing. It’s a safeguard for the moments when context is incomplete, assumptions are wrong, or things move faster than expected.

Alexey’s post-mortem is a practical guide to the engineering community. The least we can do is act on it.

Quali Torque Agentic AI Platform for Infrastructure

https://www.quali.com/torque/

References: Alexey’s On Data article:

How I Dropped Our Production Database and Now Pay 10% More for AWS

RECENT BLOG POST

Your Senior Engineers Are the Most Important Part of Your GenAI Infrastructure Strategy