Session 4: Infrastructure as Code

Why IaC?

Imagine your laptop dies tomorrow and the new one needs to be set up exactly like the old one — same VS Code, same extensions, same SSH keys, same shell config. If you do it by clicking around, you’ll forget half of it and the new machine will quietly behave differently for months. That, scaled to a production cluster, is the problem Infrastructure as Code (IaC) solves.

Without IaC	With IaC
Click around the cloud console	Describe everything in code
Snowflake servers — no two are alike	Identical environments — dev, staging, prod
”Who created this resource? When? Why?”	Git history answers all three
Recreate from a Word doc and prayer	`terraform apply` rebuilds it
Manual changes drift silently	Drift is detectable and reviewable

The rule: if it’s not in code, it doesn’t exist. Anything you click into existence in a UI is a future incident waiting to happen — because no one else knows it exists, no one can reproduce it, and no one will remember why it was done.

What “as code” actually buys you

Versioned — every change goes through git, reviewed in PRs
Auditable — git blame tells you who added that firewall rule and why
Repeatable — same code produces the same infra, every time
Reversible — bad change? git revert and re-apply
Testable — you can lint and validate infra before applying it

Two Categories of IaC

People casually say “IaC” but there are really two distinct jobs, and the tools for each are different.

	Provisioning	Configuration
What	Create the infrastructure	Configure what’s on the infrastructure
Question it answers	”Does this server exist?"	"Is nginx installed and running on it?”
Tool	Terraform, Pulumi, CloudFormation	Ansible, Chef, Puppet
Example	Spin up a DigitalOcean Kubernetes cluster	Install kubectl, helm, doctl on your laptop
State	Has long-lived state (a state file)	Mostly stateless (idempotent runs)

In modern stacks the line blurs — Kubernetes itself does a lot of “configuration” work declaratively, so Ansible is less central than it was a decade ago. But the categorization still helps you pick the right tool when both seem to fit.

Terraform — The Recap

Terraform is declarative: you describe what you want to exist, not how to make it exist. You write .tf files, run terraform apply, and Terraform figures out the API calls to bring reality in line with your code.

The Core Workflow

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│   init   │──▶│   plan   │──▶│  apply   │──▶│ destroy  │
└──────────┘   └──────────┘   └──────────┘   └──────────┘
 download      "what would   "do it"       "tear it
 providers     change?"                    all down"

Command	What It Does
`terraform init`	Downloads providers, initializes the backend (state storage)
`terraform plan`	Computes the diff between your code and reality. Does not change anything.
`terraform apply`	Executes the plan. Asks for confirmation unless you pass `-auto-approve`.
`terraform destroy`	Removes everything Terraform created (the inverse of apply).

Always run plan before apply. The plan is your safety net. If the plan shows “5 resources to destroy” when you only meant to add a tag, that’s the moment to stop and figure out why — before you wreck production.

The Five Building Blocks

Every Terraform configuration is built from these:

Block	Purpose	Example
provider	Which cloud/SaaS to talk to	`provider "digitalocean" { ... }`
resource	A thing to create and manage	`resource "digitalocean_kubernetes_cluster" "k8s" { ... }`
variable	An input you parameterize	`variable "digitalocean_token" {}`
output	A value to expose after apply	`output "registry_endpoint" { ... }`
data	Read-only lookup of existing infra	`data "digitalocean_image" "ubuntu" { ... }`

That’s it. Five things. Every Terraform repo you’ll ever read is some combination of these.

A Real Example — `loadsnap/infra/platform/cloud/do/`

Let’s walk through the actual Terraform that runs the loadsnap production stack on DigitalOcean. Open platform/cloud/do/ and you’ll see this layout:

platform/cloud/do/
├── provider.tf              # Which provider, which version
├── variables.tf             # Inputs (the API token)
├── k8s.tf                   # The Kubernetes cluster
├── container-registery.tf   # The Docker image registry
├── outputs.tf               # Values to expose
├── terraform.tfstate        # ⚠️  state file (more on this)
└── .terraform.lock.hcl      # Provider version lockfile

`provider.tf` — declaring what we’re talking to

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

provider "digitalocean" {
  token = var.digitalocean_token
}

Three things to notice:

required_providers block locks the provider source and version. ~> 2.0 means “any 2.x version, but never 3.x” — protects you from a major breaking change landing overnight.
provider "digitalocean" is the actual configuration — the API token comes from a variable.
The token is never hardcoded. It’s injected via the TF_VAR_digitalocean_token environment variable.

`variables.tf` — declaring inputs

variable "digitalocean_token" {
  description = "The token for the DigitalOcean API."
}

Bare-minimum variable definition. In a more polished setup you’d also declare:

variable "digitalocean_token" {
  description = "DigitalOcean API token (read/write)."
  type        = string
  sensitive   = true              # masks it in logs
}

Marking secrets as sensitive = true is the safest default — it stops Terraform from echoing the value in plan output.

`k8s.tf` — the actual cluster

resource "digitalocean_kubernetes_cluster" "k8s_cluster" {
  name    = "loadsnap-cluster"
  region  = "tor1"
  version = "1.32.10-do.2"

  node_pool {
    name       = "loadsnap"
    size       = "s-2vcpu-4gb"
    node_count = 4
  }
}

Read this top-to-bottom and notice the density:

resource "digitalocean_kubernetes_cluster" "k8s_cluster" — the resource type comes from the provider; the name (k8s_cluster) is your local identifier for referencing it elsewhere.
region = "tor1" — Toronto datacenter. Pick close to your users.
version = "1.32.10-do.2" — pinned Kubernetes version. Don’t use “latest” — your cluster will silently upgrade and your manifests may break.
node_pool — the worker nodes. Four nodes, each 2 vCPU / 4 GB RAM.

That’s a production Kubernetes cluster in 13 lines.

`container-registery.tf` — the Docker registry

resource "digitalocean_container_registry" "registry" {
  name                   = "loadsnap-registry"
  subscription_tier_slug = "basic"
}

subscription_tier_slug = "basic" is a typed enum from the provider — only specific values are allowed. If you typo it as "Basic", plan rejects it before it ever hits the API.

`outputs.tf` — exposing useful values

output "registry_endpoint" {
  value = digitalocean_container_registry.registry.endpoint
}

After apply, Terraform prints:

Outputs:
registry_endpoint = "registry.digitalocean.com/loadsnap-registry"

This is the registry URL you docker push to. You can also fetch it later without re-running apply:

terraform output registry_endpoint

Running it end-to-end

cd platform/cloud/do

export TF_VAR_digitalocean_token="dop_v1_..."

terraform init             # one-time per machine
terraform plan             # see what would change
terraform apply            # actually do it

That’s the whole cycle. A 50-line repo creates a Kubernetes cluster, a container registry, and exposes the registry URL.

State — Where Terraform Keeps Reality

Terraform needs to know what it created, otherwise the next plan would think everything is new and try to create duplicates. That memory lives in the state file — terraform.tfstate.

┌─────────────┐         ┌──────────────────┐         ┌─────────────┐
│  Your code  │ ◀──diff──│ terraform.tfstate│──────▶ │  Real cloud │
│  (.tf files)│         │ ("what we made") │  api   │  resources  │
└─────────────┘         └──────────────────┘         └─────────────┘

The state file is the source of truth for what Terraform manages. It contains:

IDs of every resource Terraform created
Current attribute values (so plan can compute drift)
Sometimes secrets (DB passwords, generated keys)

Three rules of state

Never edit it by hand. Use terraform state commands instead.
Never commit it to git. It contains secrets and creates merge conflicts.
Use remote state in any team setting. Local state files don’t allow concurrent runs.

⚠️ Real example from our repo: Open platform/cloud/do/ — you’ll see terraform.tfstate and terraform.tfstate.backup checked into git. That’s not a feature, that’s a legacy debt item. It works today because only one person runs apply, but it’s a foot-gun: two people running apply simultaneously will corrupt the state, and any secret in there is now in git history forever. The fix is moving to remote state.

What remote state looks like

For DigitalOcean, use Spaces (their S3-compatible object store):

terraform {
  backend "s3" {
    endpoints                   = { s3 = "https://nyc3.digitaloceanspaces.com" }
    bucket                      = "loadsnap-tfstate"
    key                         = "do/terraform.tfstate"
    region                      = "us-east-1"        # required, but ignored by DO
    skip_credentials_validation = true
    skip_region_validation      = true
    skip_metadata_api_check     = true
  }
}

Or for AWS (the most common pattern):

terraform {
  backend "s3" {
    bucket         = "company-tfstate"
    key            = "loadsnap/do/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tfstate-locks"   # locking — prevents concurrent applies
    encrypt        = true
  }
}

The DynamoDB table is critical: it provides a lock so two engineers can’t apply simultaneously. Without it, you can shred your state.

Variables, Outputs, and the `.tfvars` Pattern

Hardcoding values is fine for a toy. For a real repo you parameterize.

Three ways to set a variable

Method	Use For
`variable` block default	Sensible default that rarely changes
`terraform.tfvars` file	Per-environment overrides (committed if no secrets)
`TF_VAR_<name>` env var	Secrets, CI runs, anything you don’t want on disk
`-var "name=value"` flag	One-off overrides

Example from `platform/cloud/atlas/`

# variables.tf
variable "atlas_public_key"  {}
variable "atlas_private_key" {}
variable "atlas_project_id"  {}
variable "cluster_name" {
  default = "wedophobia-cluster"
}

# providers.tf
provider "mongodbatlas" {
  public_key  = var.atlas_public_key
  private_key = var.atlas_private_key
}

# cluster.tf
resource "mongodbatlas_cluster" "cluster_test" {
  project_id                  = var.atlas_project_id
  name                        = var.cluster_name
  provider_name               = "TENANT"
  backing_provider_name       = "AWS"
  provider_region_name        = "US_EAST_1"
  provider_instance_size_name = "M0"
  cluster_type                = "REPLICASET"
}

The Atlas keys come from environment variables in CI (TF_VAR_atlas_public_key, etc.) — they’re never committed.

Outputs that other tools consume

# output.tf
output "connection_string_srv" {
  value = mongodbatlas_cluster.cluster_test.connection_strings[0].standard_srv
}

After apply, the MongoDB connection string is printed and saved into state. CI can pull it with terraform output -raw connection_string_srv and feed it into the K8s secret that the API uses.

Multi-Environment Patterns

You’ll need dev, staging, and prod environments. Three common approaches:

Pattern A: Directory per environment (what loadsnap uses for workloads)

workload/loadsnap-back-end/
├── dev/
│   ├── deployment.yaml
│   └── secrets.yaml
└── prod/
    ├── deployment.yaml
    └── secrets.yaml

Each environment has its own folder, its own state, its own everything. Pros: dead simple, no chance of cross-contamination. Cons: code duplication.

Pattern B: Workspaces (built into Terraform)

terraform workspace new dev
terraform workspace new prod

terraform workspace select prod
terraform apply

Same code, different state per workspace. Pros: no duplication. Cons: easy to forget which workspace you’re in and apply prod changes thinking you’re in dev. Use ${terraform.workspace} in resource names to keep them separate.

Pattern C: Modules + per-env wrappers

modules/
└── do-cluster/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

environments/
├── dev/main.tf       # calls module with dev inputs
└── prod/main.tf      # calls module with prod inputs

This is the gold standard for serious infra. The shared module is a black box; the per-env wrappers just pass different values. Pros: zero duplication, clear separation. Cons: more upfront design work.

For a small repo, Pattern A is fine. The moment you have 3+ environments or 5+ resources, jump to Pattern C.

Terraform Best Practices (the short list)

Pin provider versions. version = "~> 2.0" — not version = ">= 2.0".
Pin module versions. Same logic — supply chain risk if you don’t.
Always run plan before apply. Read the plan, even if it’s long.
Use remote state with locking the moment a second person joins.
Never put secrets in .tf files. Variables + env vars + sensitive flag.
terraform fmt and terraform validate in CI. Costs nothing, catches bugs.
Tag everything. tags = { managed_by = "terraform", env = "prod" } so a human can find unmanaged sprawl later.
Don’t import resources just to satisfy state. Either Terraform owns it fully or it doesn’t.
Small repos > one giant repo. Cluster, registry, DNS — split if blast radius matters.
Comment the why, never the what. The HCL is self-documenting; the rationale isn’t.

IaC Testing

You wouldn’t ship app code without tests. Same applies to infra — though “tests” mean different things here.

Tool	What It Catches
`terraform fmt -check`	Formatting drift (run in CI)
`terraform validate`	Syntax errors and undeclared references
`terraform plan`	The actual diff — your manual review is a test
tflint	Lint rules — typos in resource names, deprecated args, missing required fields
checkov	Security misconfigurations — public S3 buckets, world-open security groups, etc.
tfsec	Same space as checkov, different ruleset (use whichever your team prefers)
Terratest	Real integration tests — spins up infra, asserts behavior, tears it down

A reasonable CI pipeline for Terraform looks like:

- terraform fmt -check
- terraform init -backend=false
- terraform validate
- tflint
- checkov -d .
- terraform plan -out=tfplan       # save plan as artifact
# manual approval step
- terraform apply tfplan           # apply the *exact* plan that was reviewed

The -out=tfplan + apply tfplan combo is critical: it guarantees you apply the plan that was reviewed, not a fresh one computed at apply time (which might have drifted).

Ansible — Quick Recap

Where Terraform creates the box, Ansible configures what’s on it. It’s agentless (uses SSH) and idempotent (run it 100 times, same result).

The three core concepts

	What
Inventory	List of hosts to manage — `web1.example.com`, `db1.example.com`
Playbook	YAML file describing what to do
Roles	Reusable bundles of playbook logic

A minimal playbook

- hosts: webservers
  become: yes
  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present

    - name: Make sure nginx is running
      service:
        name: nginx
        state: started
        enabled: yes

Run with:

ansible-playbook -i inventory.ini site.yml

When to use Ansible (and when not to)

Use Ansible for	Don’t use Ansible for
Bootstrapping VMs (install packages, configure users)	Managing K8s resources (use kubectl/Helm/ArgoCD)
One-off operational tasks across many hosts	Anything Terraform already manages
Configuring legacy systems where you can’t run agents	App-level configuration (use ConfigMaps/Secrets)

For a Kubernetes-native stack like ours, Ansible is mostly not needed — the cluster’s declarative model already does what Ansible would do.

GitOps — A Two-Minute Intro

GitOps is the pattern of using git as the single source of truth for not just your code but also your deployed state. The principle:

If it’s in git, it’s deployed. If it’s not in git, it’s not deployed.

Two flavors

	Push-based (CI deploys)	Pull-based (agent reconciles)
How	CI pipeline runs `kubectl apply`	An in-cluster agent watches git, applies changes
Tools	GitHub Actions, GitLab CI, Jenkins	ArgoCD, Flux
Pros	Familiar, works with any cluster	Self-healing, no CI cluster credentials needed
Cons	Drift if someone `kubectl apply`s manually	More moving parts

The reconciliation loop (pull-based)

┌─────────┐                ┌──────────┐                ┌──────────┐
│   Git   │ ◀── watches ──│  ArgoCD  │ ── applies ──▶│ Cluster  │
└─────────┘                └──────────┘                └──────────┘
     ▲                          │                           │
     └────────── if drifts, re-applies from git ◀───────────┘

If someone manually edits a Deployment in the cluster, ArgoCD notices the drift and reverts it within seconds. Git is always what’s running.

Where loadsnap is today: push-based — kubectl apply is run from a developer laptop after sops -d. The roadmap (see docs/secrets-management.md) is to move to ArgoCD with sealed-secrets, which would make this fully pull-based.

Try This on Your Own

These exercises use the actual loadsnap/infra repo. Don’t run them against the real cluster — clone the repo locally and work from there. Most steps are read-only to keep you safe.

Exercise 1 — Read the existing Terraform (15 min)

cd loadsnap/infra/platform/cloud/do
Open every .tf file. Identify each resource, variable, and output.
Run terraform init (it’ll download the provider).
Run terraform plan — you’ll get an auth error because you don’t have the token. Good. That confirms credentials aren’t hardcoded.
Look at terraform.tfstate (it’s checked in). Find one resource ID. Find one secret. Now you understand why state shouldn’t be in git.

Goal: confidently navigate a real Terraform codebase.

Exercise 2 — Add a new variable + output (20 min)

In a fork or scratch copy:

Add a variable "node_count" to variables.tf with a default of 4.
Use it in k8s.tf: node_count = var.node_count.
Add an output "cluster_endpoint" exposing digitalocean_kubernetes_cluster.k8s_cluster.endpoint.
Run terraform plan — confirm it shows no changes (because the value matches what’s deployed).
Try terraform plan -var "node_count=5" — confirm it now shows a node pool change.

Goal: understand how variables flow from CLI → variable block → resource.

Exercise 3 — Refactor to remote state (planning only, no apply) (20 min)

Sketch a backend "s3" block for DigitalOcean Spaces in a new backend.tf.
Identify what new resources you’d need (the Space, an access key, an IAM equivalent).
Write the migration plan: how do you move local state → remote state without losing it? (terraform init -migrate-state is the magic incantation.)
List the rollback steps if it goes wrong.

Goal: learn to plan a stateful migration, not just write more HCL.

Where this leaves us

You should now be able to:

Read a Terraform repo and explain what each block does
Walk through init → plan → apply confidently
Explain why state in git is a problem and what remote state fixes
Add a variable, output, or new resource without breaking what’s there
Pick the right multi-environment pattern for the size of your repo
Spot security issues with tflint / checkov before they reach prod
Sketch a GitOps roadmap for moving from “kubectl apply from laptop” to ArgoCD

Next session takes the artifacts you’d deploy with this infrastructure (Docker images) and goes deep on building them right.