Session 4: Infrastructure as Code
Why IaC?
Imagine your laptop dies tomorrow and the new one needs to be set up exactly like the old one — same VS Code, same extensions, same SSH keys, same shell config. If you do it by clicking around, you’ll forget half of it and the new machine will quietly behave differently for months. That, scaled to a production cluster, is the problem Infrastructure as Code (IaC) solves.
| Without IaC | With IaC |
|---|---|
| Click around the cloud console | Describe everything in code |
| Snowflake servers — no two are alike | Identical environments — dev, staging, prod |
| ”Who created this resource? When? Why?” | Git history answers all three |
| Recreate from a Word doc and prayer | terraform apply rebuilds it |
| Manual changes drift silently | Drift is detectable and reviewable |
The rule: if it’s not in code, it doesn’t exist. Anything you click into existence in a UI is a future incident waiting to happen — because no one else knows it exists, no one can reproduce it, and no one will remember why it was done.
What “as code” actually buys you
- Versioned — every change goes through git, reviewed in PRs
- Auditable —
git blametells you who added that firewall rule and why - Repeatable — same code produces the same infra, every time
- Reversible — bad change?
git revertand re-apply - Testable — you can lint and validate infra before applying it
Two Categories of IaC
People casually say “IaC” but there are really two distinct jobs, and the tools for each are different.
| Provisioning | Configuration | |
|---|---|---|
| What | Create the infrastructure | Configure what’s on the infrastructure |
| Question it answers | ”Does this server exist?" | "Is nginx installed and running on it?” |
| Tool | Terraform, Pulumi, CloudFormation | Ansible, Chef, Puppet |
| Example | Spin up a DigitalOcean Kubernetes cluster | Install kubectl, helm, doctl on your laptop |
| State | Has long-lived state (a state file) | Mostly stateless (idempotent runs) |
In modern stacks the line blurs — Kubernetes itself does a lot of “configuration” work declaratively, so Ansible is less central than it was a decade ago. But the categorization still helps you pick the right tool when both seem to fit.
Terraform — The Recap
Terraform is declarative: you describe what you want to exist, not how to make it exist. You write .tf files, run terraform apply, and Terraform figures out the API calls to bring reality in line with your code.
The Core Workflow
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ init │──▶│ plan │──▶│ apply │──▶│ destroy │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
download "what would "do it" "tear it
providers change?" all down"
| Command | What It Does |
|---|---|
terraform init | Downloads providers, initializes the backend (state storage) |
terraform plan | Computes the diff between your code and reality. Does not change anything. |
terraform apply | Executes the plan. Asks for confirmation unless you pass -auto-approve. |
terraform destroy | Removes everything Terraform created (the inverse of apply). |
Always run
planbeforeapply. The plan is your safety net. If the plan shows “5 resources to destroy” when you only meant to add a tag, that’s the moment to stop and figure out why — before you wreck production.
The Five Building Blocks
Every Terraform configuration is built from these:
| Block | Purpose | Example |
|---|---|---|
| provider | Which cloud/SaaS to talk to | provider "digitalocean" { ... } |
| resource | A thing to create and manage | resource "digitalocean_kubernetes_cluster" "k8s" { ... } |
| variable | An input you parameterize | variable "digitalocean_token" {} |
| output | A value to expose after apply | output "registry_endpoint" { ... } |
| data | Read-only lookup of existing infra | data "digitalocean_image" "ubuntu" { ... } |
That’s it. Five things. Every Terraform repo you’ll ever read is some combination of these.
A Real Example — loadsnap/infra/platform/cloud/do/
Let’s walk through the actual Terraform that runs the loadsnap production stack on DigitalOcean. Open platform/cloud/do/ and you’ll see this layout:
platform/cloud/do/
├── provider.tf # Which provider, which version
├── variables.tf # Inputs (the API token)
├── k8s.tf # The Kubernetes cluster
├── container-registery.tf # The Docker image registry
├── outputs.tf # Values to expose
├── terraform.tfstate # ⚠️ state file (more on this)
└── .terraform.lock.hcl # Provider version lockfile
provider.tf — declaring what we’re talking to
terraform {
required_providers {
digitalocean = {
source = "digitalocean/digitalocean"
version = "~> 2.0"
}
}
}
provider "digitalocean" {
token = var.digitalocean_token
}
Three things to notice:
required_providersblock locks the provider source and version.~> 2.0means “any 2.x version, but never 3.x” — protects you from a major breaking change landing overnight.provider "digitalocean"is the actual configuration — the API token comes from a variable.- The token is never hardcoded. It’s injected via the
TF_VAR_digitalocean_tokenenvironment variable.
variables.tf — declaring inputs
variable "digitalocean_token" {
description = "The token for the DigitalOcean API."
}
Bare-minimum variable definition. In a more polished setup you’d also declare:
variable "digitalocean_token" {
description = "DigitalOcean API token (read/write)."
type = string
sensitive = true # masks it in logs
}
Marking secrets as sensitive = true is the safest default — it stops Terraform from echoing the value in plan output.
k8s.tf — the actual cluster
resource "digitalocean_kubernetes_cluster" "k8s_cluster" {
name = "loadsnap-cluster"
region = "tor1"
version = "1.32.10-do.2"
node_pool {
name = "loadsnap"
size = "s-2vcpu-4gb"
node_count = 4
}
}
Read this top-to-bottom and notice the density:
resource "digitalocean_kubernetes_cluster" "k8s_cluster"— the resource type comes from the provider; the name (k8s_cluster) is your local identifier for referencing it elsewhere.region = "tor1"— Toronto datacenter. Pick close to your users.version = "1.32.10-do.2"— pinned Kubernetes version. Don’t use “latest” — your cluster will silently upgrade and your manifests may break.node_pool— the worker nodes. Four nodes, each 2 vCPU / 4 GB RAM.
That’s a production Kubernetes cluster in 13 lines.
container-registery.tf — the Docker registry
resource "digitalocean_container_registry" "registry" {
name = "loadsnap-registry"
subscription_tier_slug = "basic"
}
subscription_tier_slug = "basic" is a typed enum from the provider — only specific values are allowed. If you typo it as "Basic", plan rejects it before it ever hits the API.
outputs.tf — exposing useful values
output "registry_endpoint" {
value = digitalocean_container_registry.registry.endpoint
}
After apply, Terraform prints:
Outputs:
registry_endpoint = "registry.digitalocean.com/loadsnap-registry"
This is the registry URL you docker push to. You can also fetch it later without re-running apply:
terraform output registry_endpoint
Running it end-to-end
cd platform/cloud/do
export TF_VAR_digitalocean_token="dop_v1_..."
terraform init # one-time per machine
terraform plan # see what would change
terraform apply # actually do it
That’s the whole cycle. A 50-line repo creates a Kubernetes cluster, a container registry, and exposes the registry URL.
State — Where Terraform Keeps Reality
Terraform needs to know what it created, otherwise the next plan would think everything is new and try to create duplicates. That memory lives in the state file — terraform.tfstate.
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Your code │ ◀──diff──│ terraform.tfstate│──────▶ │ Real cloud │
│ (.tf files)│ │ ("what we made") │ api │ resources │
└─────────────┘ └──────────────────┘ └─────────────┘
The state file is the source of truth for what Terraform manages. It contains:
- IDs of every resource Terraform created
- Current attribute values (so plan can compute drift)
- Sometimes secrets (DB passwords, generated keys)
Three rules of state
- Never edit it by hand. Use
terraform statecommands instead. - Never commit it to git. It contains secrets and creates merge conflicts.
- Use remote state in any team setting. Local state files don’t allow concurrent runs.
⚠️ Real example from our repo: Open
platform/cloud/do/— you’ll seeterraform.tfstateandterraform.tfstate.backupchecked into git. That’s not a feature, that’s a legacy debt item. It works today because only one person runsapply, but it’s a foot-gun: two people runningapplysimultaneously will corrupt the state, and any secret in there is now in git history forever. The fix is moving to remote state.
What remote state looks like
For DigitalOcean, use Spaces (their S3-compatible object store):
terraform {
backend "s3" {
endpoints = { s3 = "https://nyc3.digitaloceanspaces.com" }
bucket = "loadsnap-tfstate"
key = "do/terraform.tfstate"
region = "us-east-1" # required, but ignored by DO
skip_credentials_validation = true
skip_region_validation = true
skip_metadata_api_check = true
}
}
Or for AWS (the most common pattern):
terraform {
backend "s3" {
bucket = "company-tfstate"
key = "loadsnap/do/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "tfstate-locks" # locking — prevents concurrent applies
encrypt = true
}
}
The DynamoDB table is critical: it provides a lock so two engineers can’t apply simultaneously. Without it, you can shred your state.
Variables, Outputs, and the .tfvars Pattern
Hardcoding values is fine for a toy. For a real repo you parameterize.
Three ways to set a variable
| Method | Use For |
|---|---|
variable block default | Sensible default that rarely changes |
terraform.tfvars file | Per-environment overrides (committed if no secrets) |
TF_VAR_<name> env var | Secrets, CI runs, anything you don’t want on disk |
-var "name=value" flag | One-off overrides |
Example from platform/cloud/atlas/
# variables.tf
variable "atlas_public_key" {}
variable "atlas_private_key" {}
variable "atlas_project_id" {}
variable "cluster_name" {
default = "wedophobia-cluster"
}
# providers.tf
provider "mongodbatlas" {
public_key = var.atlas_public_key
private_key = var.atlas_private_key
}
# cluster.tf
resource "mongodbatlas_cluster" "cluster_test" {
project_id = var.atlas_project_id
name = var.cluster_name
provider_name = "TENANT"
backing_provider_name = "AWS"
provider_region_name = "US_EAST_1"
provider_instance_size_name = "M0"
cluster_type = "REPLICASET"
}
The Atlas keys come from environment variables in CI (TF_VAR_atlas_public_key, etc.) — they’re never committed.
Outputs that other tools consume
# output.tf
output "connection_string_srv" {
value = mongodbatlas_cluster.cluster_test.connection_strings[0].standard_srv
}
After apply, the MongoDB connection string is printed and saved into state. CI can pull it with terraform output -raw connection_string_srv and feed it into the K8s secret that the API uses.
Multi-Environment Patterns
You’ll need dev, staging, and prod environments. Three common approaches:
Pattern A: Directory per environment (what loadsnap uses for workloads)
workload/loadsnap-back-end/
├── dev/
│ ├── deployment.yaml
│ └── secrets.yaml
└── prod/
├── deployment.yaml
└── secrets.yaml
Each environment has its own folder, its own state, its own everything. Pros: dead simple, no chance of cross-contamination. Cons: code duplication.
Pattern B: Workspaces (built into Terraform)
terraform workspace new dev
terraform workspace new prod
terraform workspace select prod
terraform apply
Same code, different state per workspace. Pros: no duplication. Cons: easy to forget which workspace you’re in and apply prod changes thinking you’re in dev. Use ${terraform.workspace} in resource names to keep them separate.
Pattern C: Modules + per-env wrappers
modules/
└── do-cluster/
├── main.tf
├── variables.tf
└── outputs.tf
environments/
├── dev/main.tf # calls module with dev inputs
└── prod/main.tf # calls module with prod inputs
This is the gold standard for serious infra. The shared module is a black box; the per-env wrappers just pass different values. Pros: zero duplication, clear separation. Cons: more upfront design work.
For a small repo, Pattern A is fine. The moment you have 3+ environments or 5+ resources, jump to Pattern C.
Terraform Best Practices (the short list)
- Pin provider versions.
version = "~> 2.0"— notversion = ">= 2.0". - Pin module versions. Same logic — supply chain risk if you don’t.
- Always run
planbeforeapply. Read the plan, even if it’s long. - Use remote state with locking the moment a second person joins.
- Never put secrets in
.tffiles. Variables + env vars + sensitive flag. terraform fmtandterraform validatein CI. Costs nothing, catches bugs.- Tag everything.
tags = { managed_by = "terraform", env = "prod" }so a human can find unmanaged sprawl later. - Don’t
importresources just to satisfy state. Either Terraform owns it fully or it doesn’t. - Small repos > one giant repo. Cluster, registry, DNS — split if blast radius matters.
- Comment the why, never the what. The HCL is self-documenting; the rationale isn’t.
IaC Testing
You wouldn’t ship app code without tests. Same applies to infra — though “tests” mean different things here.
| Tool | What It Catches |
|---|---|
terraform fmt -check | Formatting drift (run in CI) |
terraform validate | Syntax errors and undeclared references |
terraform plan | The actual diff — your manual review is a test |
| tflint | Lint rules — typos in resource names, deprecated args, missing required fields |
| checkov | Security misconfigurations — public S3 buckets, world-open security groups, etc. |
| tfsec | Same space as checkov, different ruleset (use whichever your team prefers) |
| Terratest | Real integration tests — spins up infra, asserts behavior, tears it down |
A reasonable CI pipeline for Terraform looks like:
- terraform fmt -check
- terraform init -backend=false
- terraform validate
- tflint
- checkov -d .
- terraform plan -out=tfplan # save plan as artifact
# manual approval step
- terraform apply tfplan # apply the *exact* plan that was reviewed
The -out=tfplan + apply tfplan combo is critical: it guarantees you apply the plan that was reviewed, not a fresh one computed at apply time (which might have drifted).
Ansible — Quick Recap
Where Terraform creates the box, Ansible configures what’s on it. It’s agentless (uses SSH) and idempotent (run it 100 times, same result).
The three core concepts
| What | |
|---|---|
| Inventory | List of hosts to manage — web1.example.com, db1.example.com |
| Playbook | YAML file describing what to do |
| Roles | Reusable bundles of playbook logic |
A minimal playbook
- hosts: webservers
become: yes
tasks:
- name: Install nginx
apt:
name: nginx
state: present
- name: Make sure nginx is running
service:
name: nginx
state: started
enabled: yes
Run with:
ansible-playbook -i inventory.ini site.yml
When to use Ansible (and when not to)
| Use Ansible for | Don’t use Ansible for |
|---|---|
| Bootstrapping VMs (install packages, configure users) | Managing K8s resources (use kubectl/Helm/ArgoCD) |
| One-off operational tasks across many hosts | Anything Terraform already manages |
| Configuring legacy systems where you can’t run agents | App-level configuration (use ConfigMaps/Secrets) |
For a Kubernetes-native stack like ours, Ansible is mostly not needed — the cluster’s declarative model already does what Ansible would do.
GitOps — A Two-Minute Intro
GitOps is the pattern of using git as the single source of truth for not just your code but also your deployed state. The principle:
If it’s in git, it’s deployed. If it’s not in git, it’s not deployed.
Two flavors
| Push-based (CI deploys) | Pull-based (agent reconciles) | |
|---|---|---|
| How | CI pipeline runs kubectl apply | An in-cluster agent watches git, applies changes |
| Tools | GitHub Actions, GitLab CI, Jenkins | ArgoCD, Flux |
| Pros | Familiar, works with any cluster | Self-healing, no CI cluster credentials needed |
| Cons | Drift if someone kubectl applys manually | More moving parts |
The reconciliation loop (pull-based)
┌─────────┐ ┌──────────┐ ┌──────────┐
│ Git │ ◀── watches ──│ ArgoCD │ ── applies ──▶│ Cluster │
└─────────┘ └──────────┘ └──────────┘
▲ │ │
└────────── if drifts, re-applies from git ◀───────────┘
If someone manually edits a Deployment in the cluster, ArgoCD notices the drift and reverts it within seconds. Git is always what’s running.
Where loadsnap is today: push-based —
kubectl applyis run from a developer laptop aftersops -d. The roadmap (seedocs/secrets-management.md) is to move to ArgoCD with sealed-secrets, which would make this fully pull-based.
Try This on Your Own
These exercises use the actual loadsnap/infra repo. Don’t run them against the real cluster — clone the repo locally and work from there. Most steps are read-only to keep you safe.
Exercise 1 — Read the existing Terraform (15 min)
cd loadsnap/infra/platform/cloud/do- Open every
.tffile. Identify eachresource,variable, andoutput. - Run
terraform init(it’ll download the provider). - Run
terraform plan— you’ll get an auth error because you don’t have the token. Good. That confirms credentials aren’t hardcoded. - Look at
terraform.tfstate(it’s checked in). Find one resource ID. Find one secret. Now you understand why state shouldn’t be in git.
Goal: confidently navigate a real Terraform codebase.
Exercise 2 — Add a new variable + output (20 min)
In a fork or scratch copy:
- Add a
variable "node_count"tovariables.tfwith a default of 4. - Use it in
k8s.tf:node_count = var.node_count. - Add an
output "cluster_endpoint"exposingdigitalocean_kubernetes_cluster.k8s_cluster.endpoint. - Run
terraform plan— confirm it shows no changes (because the value matches what’s deployed). - Try
terraform plan -var "node_count=5"— confirm it now shows a node pool change.
Goal: understand how variables flow from CLI → variable block → resource.
Exercise 3 — Refactor to remote state (planning only, no apply) (20 min)
- Sketch a
backend "s3"block for DigitalOcean Spaces in a newbackend.tf. - Identify what new resources you’d need (the Space, an access key, an IAM equivalent).
- Write the migration plan: how do you move local state → remote state without losing it? (
terraform init -migrate-stateis the magic incantation.) - List the rollback steps if it goes wrong.
Goal: learn to plan a stateful migration, not just write more HCL.
Where this leaves us
You should now be able to:
- Read a Terraform repo and explain what each block does
- Walk through
init → plan → applyconfidently - Explain why state in git is a problem and what remote state fixes
- Add a variable, output, or new resource without breaking what’s there
- Pick the right multi-environment pattern for the size of your repo
- Spot security issues with
tflint/checkovbefore they reach prod - Sketch a GitOps roadmap for moving from “kubectl apply from laptop” to ArgoCD
Next session takes the artifacts you’d deploy with this infrastructure (Docker images) and goes deep on building them right.