// devops

DevOps Handbook

A working handbook for the DevOps practices that keep modern backend systems shipping safely — CI/CD, GitOps, infrastructure as code, observability, monitoring, logging, incident response, and deployment strategies.

Quick Reference

›CI/CD — automated build, test, deploy on every commit
›GitOps — Git is the source of truth; agents reconcile the cluster
›IaC — Terraform/Pulumi/Crossplane define infra declaratively
›Observability — metrics + logs + traces, correlated by id
›Monitoring — SLO-driven alerts, not noisy thresholds
›Incident response — runbooks, on-call rotations, blameless postmortems
›Deployment — rolling, blue/green, canary, feature flags

Learning Path

Recommended order

1.Beginner
2.Intermediate
3.Advanced

Prerequisites

•Git fluency
•Docker + Linux basics
•At least one app in production

Skills you will learn

✓Designing CI/CD pipelines
✓Operating GitOps with ArgoCD
✓Provisioning infra with Terraform
✓Running incidents calmly

Estimated time

Months of practice; the handbook stays open for years.

Architecture Overview

Architecture

CI/CD Deployment Pipeline

Code pushed to the repository triggers automated build and test. The artifact is published to a registry then promoted from staging to production.

CI/CD

Automated build, test, deploy on every commit.

Recommended

GitHub Actions, GitLab CI, CircleCI, Jenkins. Run tests, build images, push to registry, deploy to staging, gate prod on approval.

Pros

+Catches regressions early
+Repeatable deploys

Cons

–Pipeline maintenance overhead

Best for: Every team.

GitOps

Git is the source of truth for cluster state.

ArgoCD or Flux watch a Git repo of manifests and reconcile the cluster. Rollback = git revert.

Pros

+Auditable
+Easy rollback
+Self-healing

Cons

–Requires Git discipline

Best for: Production Kubernetes.

GitOps with ArgoCD

Infrastructure as Code

Declarative infra, reviewable in PRs.

Terraform, Pulumi, Crossplane. Define VPCs, databases, clusters in code. Plan in CI, apply on merge.

Pros

+Reproducible environments
+Drift detection

Cons

–State management complexity

Best for: All production infra.

Terraform on AWS

Observability

Metrics, logs, traces — correlated.

OpenTelemetry for instrumentation, Prometheus + Grafana for metrics, Loki for logs, Tempo/Jaeger for traces.

Pros

+Faster MTTR
+Latency attribution

Cons

–Storage and cardinality costs

Best for: Any distributed system.

Observability Tutorial

Monitoring & Alerting

SLO-driven alerts that don't burn out humans.

Define SLOs (e.g., 99.9% requests under 300ms). Alert only when the error budget burns dangerously fast. Use Alertmanager → Slack/PagerDuty.

Pros

+Fewer false alarms
+Aligned with user experience

Cons

–Requires baselining

Best for: Any production system.

Logging

Structured, indexed, sampled.

Emit JSON logs, ship via Fluent Bit / Vector to Loki or Elasticsearch. Always include request_id, trace_id, user_id (hashed).

Pros

+Searchable
+Joinable with traces

Cons

–Cost scales with volume

Best for: All services.

Incident Response

Runbooks, on-call, blameless postmortems.

Define severity levels; rotate on-call; write runbooks for known failure modes; run blameless postmortems within 5 business days.

Pros

+Faster recovery
+Org learning loop

Cons

–Cultural investment required

Best for: Any team running a 24/7 service.

Deployment Strategies

Rolling, blue/green, canary, feature flags.

Rolling = default; blue/green = instant cutover with rollback; canary = ship to 1% then 10% then 100%; feature flags = decouple deploy from release.

Pros

+Lower blast radius
+Faster rollback

Cons

–More moving parts

Best for: Any service with paying users.

Deployment strategy selection

Strategy	Risk	Complexity	Best For
Rolling	Low	Low	Most services
Blue/Green	Low	Medium	Stateful or DB-coupled services
Canary	Lowest	High	High-traffic critical paths
Feature flags	Lowest	Medium	Product experimentation

Common Mistakes

!Treating monitoring as 'set up Grafana and forget'.
!Alerting on CPU thresholds instead of user-impacting SLOs.
!Letting CI pipelines balloon past 15 minutes — feedback loops die.
!Skipping postmortems because 'we already fixed it'.

Production Tips

★Define one SLO per user-facing endpoint; alert on error-budget burn rate.
★Bake security scans (Trivy, gitleaks) into CI — never optional.
★Use ephemeral preview environments per PR (Vercel-style for backends).
★Practice failover quarterly — untested DR is no DR.

Frequently Asked Questions

DevOps vs Platform Engineering?

DevOps is the culture; platform engineering is the team that builds the internal developer platform supporting it.

Do I need GitOps for a small project?

Not strictly — but you'll outgrow `kubectl apply` quickly. Even a tiny ArgoCD instance pays off.

Best first observability stack?

Prometheus + Grafana + Loki + Tempo, wired via OpenTelemetry. All open source, all production-ready.

Quick Reference

Learning Path

Architecture Overview

CI/CD Deployment Pipeline

CI/CD

GitOps

Infrastructure as Code

Observability

Monitoring & Alerting

Logging

Incident Response

Deployment Strategies

Deployment strategy selection

Common Mistakes

Production Tips

Further Reading

Frequently Asked Questions

Related Resources

Related Tutorials