// devops

DevOps Handbook

A working handbook for the DevOps practices that keep modern backend systems shipping safely — CI/CD, GitOps, infrastructure as code, observability, monitoring, logging, incident response, and deployment strategies.

Quick Reference

  • CI/CD — automated build, test, deploy on every commit
  • GitOps — Git is the source of truth; agents reconcile the cluster
  • IaC — Terraform/Pulumi/Crossplane define infra declaratively
  • Observability — metrics + logs + traces, correlated by id
  • Monitoring — SLO-driven alerts, not noisy thresholds
  • Incident response — runbooks, on-call rotations, blameless postmortems
  • Deployment — rolling, blue/green, canary, feature flags

Learning Path

Recommended order

  1. 1.Beginner
  2. 2.Intermediate
  3. 3.Advanced

Prerequisites

  • Git fluency
  • Docker + Linux basics
  • At least one app in production

Skills you will learn

  • Designing CI/CD pipelines
  • Operating GitOps with ArgoCD
  • Provisioning infra with Terraform
  • Running incidents calmly

Estimated time

Months of practice; the handbook stays open for years.

Architecture Overview

Architecture

CI/CD Deployment Pipeline

SOURCECIREGISTRYDEPLOYRUNTIMEpushimagedeploypromoteGitHubmain branchCI PipelineBuild · TestContainer RegistryImageStagingKubernetesProductionKubernetesUsersMonitoringPrometheus / Grafana
Code pushed to the repository triggers automated build and test. The artifact is published to a registry then promoted from staging to production.

CI/CD

Automated build, test, deploy on every commit.

Recommended

GitHub Actions, GitLab CI, CircleCI, Jenkins. Run tests, build images, push to registry, deploy to staging, gate prod on approval.

Pros

  • +Catches regressions early
  • +Repeatable deploys

Cons

  • Pipeline maintenance overhead

Best for: Every team.

GitOps

Git is the source of truth for cluster state.

ArgoCD or Flux watch a Git repo of manifests and reconcile the cluster. Rollback = git revert.

Pros

  • +Auditable
  • +Easy rollback
  • +Self-healing

Cons

  • Requires Git discipline

Best for: Production Kubernetes.

Infrastructure as Code

Declarative infra, reviewable in PRs.

Terraform, Pulumi, Crossplane. Define VPCs, databases, clusters in code. Plan in CI, apply on merge.

Pros

  • +Reproducible environments
  • +Drift detection

Cons

  • State management complexity

Best for: All production infra.

Observability

Metrics, logs, traces — correlated.

OpenTelemetry for instrumentation, Prometheus + Grafana for metrics, Loki for logs, Tempo/Jaeger for traces.

Pros

  • +Faster MTTR
  • +Latency attribution

Cons

  • Storage and cardinality costs

Best for: Any distributed system.

Monitoring & Alerting

SLO-driven alerts that don't burn out humans.

Define SLOs (e.g., 99.9% requests under 300ms). Alert only when the error budget burns dangerously fast. Use Alertmanager → Slack/PagerDuty.

Pros

  • +Fewer false alarms
  • +Aligned with user experience

Cons

  • Requires baselining

Best for: Any production system.

Logging

Structured, indexed, sampled.

Emit JSON logs, ship via Fluent Bit / Vector to Loki or Elasticsearch. Always include request_id, trace_id, user_id (hashed).

Pros

  • +Searchable
  • +Joinable with traces

Cons

  • Cost scales with volume

Best for: All services.

Incident Response

Runbooks, on-call, blameless postmortems.

Define severity levels; rotate on-call; write runbooks for known failure modes; run blameless postmortems within 5 business days.

Pros

  • +Faster recovery
  • +Org learning loop

Cons

  • Cultural investment required

Best for: Any team running a 24/7 service.

Deployment Strategies

Rolling, blue/green, canary, feature flags.

Rolling = default; blue/green = instant cutover with rollback; canary = ship to 1% then 10% then 100%; feature flags = decouple deploy from release.

Pros

  • +Lower blast radius
  • +Faster rollback

Cons

  • More moving parts

Best for: Any service with paying users.

Deployment strategy selection

StrategyRiskComplexityBest For
RollingLowLowMost services
Blue/GreenLowMediumStateful or DB-coupled services
CanaryLowestHighHigh-traffic critical paths
Feature flagsLowestMediumProduct experimentation

Common Mistakes

  • !Treating monitoring as 'set up Grafana and forget'.
  • !Alerting on CPU thresholds instead of user-impacting SLOs.
  • !Letting CI pipelines balloon past 15 minutes — feedback loops die.
  • !Skipping postmortems because 'we already fixed it'.

Production Tips

  • Define one SLO per user-facing endpoint; alert on error-budget burn rate.
  • Bake security scans (Trivy, gitleaks) into CI — never optional.
  • Use ephemeral preview environments per PR (Vercel-style for backends).
  • Practice failover quarterly — untested DR is no DR.

Further Reading

Frequently Asked Questions

DevOps vs Platform Engineering?

DevOps is the culture; platform engineering is the team that builds the internal developer platform supporting it.

Do I need GitOps for a small project?

Not strictly — but you'll outgrow `kubectl apply` quickly. Even a tiny ArgoCD instance pays off.

Best first observability stack?

Prometheus + Grafana + Loki + Tempo, wired via OpenTelemetry. All open source, all production-ready.