Session 1: What is DevOps & Why It Matters
The Old World: Dev vs Ops
In traditional software companies, Development and Operations were completely separate teams with conflicting goals:
| Development | Operations |
|---|---|
| Ship features fast | Keep systems stable |
| Move fast, break things | Don’t touch what’s working |
| Measured by feature delivery | Measured by uptime |
The result? A wall between teams. Developers wrote code and threw it over to Ops. Ops had never seen the code. Deployments broke. Blame games followed. Customers waited.
What is DevOps?
DevOps is a culture and set of practices that brings Development and Operations together. It is NOT just a tool or a job title.
The goal: Deliver software faster AND more reliably - not one or the other, both.
The core idea: The team that builds the software is also responsible for running it in production. You build it, you run it.
The DevOps Lifecycle
Plan → Code → Build → Test → Release → Deploy → Operate → Monitor
↑ |
└──────────────────────────────────────────────────────────┘
This is a continuous loop, not a one-way street.
The CALMS Framework
CALMS captures what DevOps really means:
Culture
- Shared responsibility between Dev and Ops
- No more “that’s not my job”
- Blameless environment - when things break, we learn together
Automation
- If you do something manually more than twice, automate it
- Deployments, testing, infrastructure setup - all automated
- Reduces human error, increases speed
Lean
- Work in small batches
- Instead of releasing once a quarter with 500 changes, release daily with 5 changes
- Small changes are easier to debug when something breaks
Measurement
- You can’t improve what you don’t measure
- Track: deployment frequency, lead time, failure rate, recovery time
- Data-driven decisions, not gut feelings
Sharing
- No knowledge silos
- Documentation, shared runbooks
- When something breaks, we share the learning (postmortems)
DevOps vs SRE vs Platform Engineering
These three terms are related but different:
DevOps
- A cultural movement and philosophy
- Says: “Dev and Ops should work together”
- Broad principles and practices
SRE (Site Reliability Engineering)
- Invented at Google
- A specific discipline with concrete practices
- Applies software engineering to operations problems
- Key concepts: SLOs, error budgets, toil reduction
- “What happens when you ask a software engineer to design an operations function”
Platform Engineering
- Building an internal self-service platform for developers
- Developers don’t need to understand every detail of infrastructure
- They push code, the platform handles the rest
How They Relate
| The Question It Answers | |
|---|---|
| DevOps | How should Dev and Ops work together? (the culture) |
| SRE | How do we keep systems reliable? (the practice) |
| Platform Engineering | What do we build to scale this? (the product) |
They don’t compete - they complement each other.
Key DevOps Principles
1. Automate Everything
- Manual processes are slow, error-prone, and don’t scale
- Automate builds, tests, deployments, infrastructure
2. Continuous Improvement
- Always look for ways to improve
- Measure → Identify bottleneck → Fix → Repeat
3. Fail Fast, Learn Fast
- Small changes = small failures = easy to fix
- Failures are learning opportunities, not blame opportunities
4. Infrastructure as Code
- Treat infrastructure the same as application code
- Version controlled, reviewed, tested, reproducible
5. Monitoring and Feedback
- Know the state of your systems at all times
- Fast feedback loops for developers
The DevOps Toolchain Overview
This is what we’ll cover in upcoming sessions:
| Stage | Tools | Session |
|---|---|---|
| Version Control | Git, GitHub | Session 2 |
| CI/CD | Jenkins, GitHub Actions | Session 3 |
| Infrastructure as Code | Terraform, Ansible | Session 4 |
| Containers | Docker | Session 5 |
| Orchestration | Kubernetes | Session 6 |
| Monitoring | Prometheus, Grafana | Phase 2 |
Real-World Example: How DevOps Changes Everything
Without DevOps
- Developer writes code for 3 weeks
- Sends it to QA team - they test for 1 week, find 20 bugs
- Developer fixes bugs for another week
- Sends to Ops team for deployment
- Ops deploys on a Saturday night (maintenance window)
- Deployment fails - wrong config, missing dependency
- Rollback. Start over.
- Total time: 6+ weeks for one release
With DevOps
- Developer writes a small change (few hours of work)
- Pushes code → automated tests run immediately
- Tests pass → automatically deployed to staging
- Quick review → deployed to production
- Monitoring confirms everything is healthy
- Total time: same day
The Numbers (Industry Benchmarks)
| Metric | Traditional | DevOps (Elite) |
|---|---|---|
| Deployment frequency | Once per month | Multiple times per day |
| Lead time for changes | 1-6 months | Less than 1 hour |
| Change failure rate | 46-60% | 0-15% |
| Recovery time | 1 week - 1 month | Less than 1 hour |
Source: DORA (DevOps Research and Assessment) State of DevOps Reports
Toil: The Enemy of Productivity
Toil is work that is:
- Manual
- Repetitive
- Automatable
- Reactive (not proactive)
- No lasting value
- Scales linearly with growth
Examples of Toil
| Toil | Automated Solution |
|---|---|
| Manually restarting crashed services | Auto-restart with health checks |
| Resizing disks when full | Auto-scaling storage |
| Rotating passwords every quarter | Automated secret rotation |
| Manually scaling servers before sales events | Auto-scaling policies |
| Running database backups every night | Managed automated backups |
| Checking if services are healthy | Automated monitoring + alerting |
The rule: Keep toil below 50% of your time. If you spend more than half your time on toil, you can’t do the engineering work that eliminates future toil.
Platform Engineering: The Golden Path
Instead of every team figuring out deployments on their own:
Without a Platform:
- Team A uses shell scripts to deploy
- Team B uses Terraform
- Team C clicks around in AWS console
- Team D has no monitoring
- Total chaos
With a Platform:
# Developer just writes this:
name: my-service
language: python
port: 8000
database: postgres
replicas: 3
The platform handles everything: CI/CD, containers, networking, monitoring, security - all built-in with company standards.
A good platform makes the right thing the easy thing.
Session 1 Key Takeaways
- DevOps is a culture first, tools second
- The CALMS framework: Culture, Automation, Lean, Measurement, Sharing
- DevOps, SRE, and Platform Engineering complement each other
- Small, frequent changes are safer than big, infrequent ones
- Automate repetitive work (toil) to focus on engineering
- You build it, you run it
Discussion Questions
Think about these before next session:
- What does the current dev-to-production flow look like at your workplace?
- Where are the biggest bottlenecks?
- What manual tasks do you do repeatedly that could be automated?
- How long does it take for a code change to reach production?
Next Session: Git in Practice - branching strategies, pull request workflows, and why Git is the foundation for everything in DevOps.