// devops
DevOps Handbook
A working handbook for the DevOps practices that keep modern backend systems shipping safely — CI/CD, GitOps, infrastructure as code, observability, monitoring, logging, incident response, and deployment strategies.
Quick Reference
- ›CI/CD — automated build, test, deploy on every commit
- ›GitOps — Git is the source of truth; agents reconcile the cluster
- ›IaC — Terraform/Pulumi/Crossplane define infra declaratively
- ›Observability — metrics + logs + traces, correlated by id
- ›Monitoring — SLO-driven alerts, not noisy thresholds
- ›Incident response — runbooks, on-call rotations, blameless postmortems
- ›Deployment — rolling, blue/green, canary, feature flags
Learning Path
Recommended order
- 1.Beginner
- 2.Intermediate
- 3.Advanced
Prerequisites
- •Git fluency
- •Docker + Linux basics
- •At least one app in production
Skills you will learn
- ✓Designing CI/CD pipelines
- ✓Operating GitOps with ArgoCD
- ✓Provisioning infra with Terraform
- ✓Running incidents calmly
Estimated time
Months of practice; the handbook stays open for years.
Architecture Overview
Architecture
CI/CD Deployment Pipeline
CI/CD
Automated build, test, deploy on every commit.
GitHub Actions, GitLab CI, CircleCI, Jenkins. Run tests, build images, push to registry, deploy to staging, gate prod on approval.
Pros
- +Catches regressions early
- +Repeatable deploys
Cons
- –Pipeline maintenance overhead
Best for: Every team.
GitOps
Git is the source of truth for cluster state.
ArgoCD or Flux watch a Git repo of manifests and reconcile the cluster. Rollback = git revert.
Pros
- +Auditable
- +Easy rollback
- +Self-healing
Cons
- –Requires Git discipline
Best for: Production Kubernetes.
Infrastructure as Code
Declarative infra, reviewable in PRs.
Terraform, Pulumi, Crossplane. Define VPCs, databases, clusters in code. Plan in CI, apply on merge.
Pros
- +Reproducible environments
- +Drift detection
Cons
- –State management complexity
Best for: All production infra.
Observability
Metrics, logs, traces — correlated.
OpenTelemetry for instrumentation, Prometheus + Grafana for metrics, Loki for logs, Tempo/Jaeger for traces.
Pros
- +Faster MTTR
- +Latency attribution
Cons
- –Storage and cardinality costs
Best for: Any distributed system.
Monitoring & Alerting
SLO-driven alerts that don't burn out humans.
Define SLOs (e.g., 99.9% requests under 300ms). Alert only when the error budget burns dangerously fast. Use Alertmanager → Slack/PagerDuty.
Pros
- +Fewer false alarms
- +Aligned with user experience
Cons
- –Requires baselining
Best for: Any production system.
Logging
Structured, indexed, sampled.
Emit JSON logs, ship via Fluent Bit / Vector to Loki or Elasticsearch. Always include request_id, trace_id, user_id (hashed).
Pros
- +Searchable
- +Joinable with traces
Cons
- –Cost scales with volume
Best for: All services.
Incident Response
Runbooks, on-call, blameless postmortems.
Define severity levels; rotate on-call; write runbooks for known failure modes; run blameless postmortems within 5 business days.
Pros
- +Faster recovery
- +Org learning loop
Cons
- –Cultural investment required
Best for: Any team running a 24/7 service.
Deployment Strategies
Rolling, blue/green, canary, feature flags.
Rolling = default; blue/green = instant cutover with rollback; canary = ship to 1% then 10% then 100%; feature flags = decouple deploy from release.
Pros
- +Lower blast radius
- +Faster rollback
Cons
- –More moving parts
Best for: Any service with paying users.
Deployment strategy selection
| Strategy | Risk | Complexity | Best For |
|---|---|---|---|
| Rolling | Low | Low | Most services |
| Blue/Green | Low | Medium | Stateful or DB-coupled services |
| Canary | Lowest | High | High-traffic critical paths |
| Feature flags | Lowest | Medium | Product experimentation |
Common Mistakes
- !Treating monitoring as 'set up Grafana and forget'.
- !Alerting on CPU thresholds instead of user-impacting SLOs.
- !Letting CI pipelines balloon past 15 minutes — feedback loops die.
- !Skipping postmortems because 'we already fixed it'.
Production Tips
- ★Define one SLO per user-facing endpoint; alert on error-budget burn rate.
- ★Bake security scans (Trivy, gitleaks) into CI — never optional.
- ★Use ephemeral preview environments per PR (Vercel-style for backends).
- ★Practice failover quarterly — untested DR is no DR.
Further Reading
Frequently Asked Questions
DevOps vs Platform Engineering?
DevOps is the culture; platform engineering is the team that builds the internal developer platform supporting it.
Do I need GitOps for a small project?
Not strictly — but you'll outgrow `kubectl apply` quickly. Even a tiny ArgoCD instance pays off.
Best first observability stack?
Prometheus + Grafana + Loki + Tempo, wired via OpenTelemetry. All open source, all production-ready.
