All posts

Building for Scale: Kubernetes and DigitalOcean for Food Delivery

·
kubernetesdigitaloceanscalingfood-deliveryinfrastructure

You've shipped. Users are signing up. Orders are flowing through. The question isn't whether you'll need to scale—it's whether you'll scale competently or watch your platform collapse.

This is the CTO's reality for food delivery platforms. We operate in a market where latency kills revenue, downtime costs money in refunds, and vendor lock-in is a slow bleed you can't afford.

Here's how we scale without betting the company.

The Architecture Snapshot

# The stack we're building toward
- DigitalOcean Kubernetes (DOKS)
- Persistent volumes for stateful services
- Redis Cluster for session/cache layer
- PostgreSQL managed database
- Kong API Gateway for traffic shaping
- Timescale for event analytics

This isn't exotic. It's battle-tested. But every component choice is a constraint—latency, cost, operational complexity.

Why DigitalOcean, Not AWS

AWS is power. DigitalOcean is coherence.

For a food delivery platform, the difference matters.

AWS gives you 200+ services. You spend months picking the right combination. Your ops team needs deep AWS expertise. Your terraform grows into a 5000-line monstrosity. Your bill arrives and surprises you because you forgot to tag resources properly.

DigitalOcean gives you:

  • Kubernetes that works — DOKS is simpler than EKS. Same concepts. Less configuration noise.
  • Transparent pricing — $12/node. You know what you're paying for.
  • No service multiplication — Fewer choices means faster decisions.
  • Community solutions — Most food delivery platforms run on similar stacks. You find reference implementations.

The cost difference? 30-40% cheaper at comparable scale. For a 50-node cluster, that's $15-20K monthly you don't burn.

AWS wins when you need:

  • Global CDN at scale
  • Advanced ML services
  • Exotic compliance requirements

We don't have those constraints yet. Optimize for the platform you have, not the one you might build in 3 years.

The Kubernetes Layer

Your food delivery app needs these primitives:

Stateless Microservices

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: registry.digitalocean.com/myapp/order-service:v1.2.3
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - order-service
              topologyKey: kubernetes.io/hostname

Key decisions:

  • Resources: Set requests/limits. Without them, noisy neighbors crash your platform.
  • Pod Anti-Affinity: Spread replicas across nodes. One node failure doesn't cascade.
  • Probes: Liveness tells Kubernetes when to restart. Readiness tells it when to accept traffic. Get these wrong and you have zombie pods consuming resources.

Horizontal Pod Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max

The philosophy: scale fast on the way up, slow on the way down.

During a lunch rush, orders spike. You want new pods spinning up within 15 seconds. But you don't want to thrash—spinning up and down pods every second burns CPU and disrupts connections. The stabilizationWindowSeconds on scale-down prevents that.

Critical: Target 60% CPU, not 80%. Kubernetes needs headroom. When the HPA triggers, it takes 15-30 seconds for new pods to be ready. If you're already at 80% usage when scaling starts, you'll hit 95%+ before the new capacity arrives.

The Data Layer

Food delivery generates data at scale. Orders, geolocation, delivery tracking, cancellations—every action is a data point.

PostgreSQL: Your Source of Truth

Use managed PostgreSQL on DigitalOcean. Don't run it in Kubernetes. Running databases in containers works for learning Kubernetes. In production, it's a liability.

-- Orders table: OLTP workload
CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  customer_id BIGINT NOT NULL,
  restaurant_id BIGINT NOT NULL,
  status VARCHAR(20) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  delivered_at TIMESTAMPTZ,

  FOREIGN KEY (customer_id) REFERENCES customers(id),
  FOREIGN KEY (restaurant_id) REFERENCES restaurants(id),
  INDEX idx_customer_created (customer_id, created_at DESC),
  INDEX idx_restaurant_status (restaurant_id, status),
  INDEX idx_status_created (status, created_at DESC)
);

-- Order items: denormalize for query performance
CREATE TABLE order_items (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT NOT NULL,
  menu_item_id BIGINT NOT NULL,
  quantity INT NOT NULL,
  price_cents BIGINT NOT NULL,

  FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
  INDEX idx_order_id (order_id)
);

-- Deliveries: track driver, location, ETA
CREATE TABLE deliveries (
  id BIGSERIAL PRIMARY KEY,
  order_id BIGINT NOT NULL UNIQUE,
  driver_id BIGINT NOT NULL,
  status VARCHAR(20) NOT NULL,
  current_lat NUMERIC(10,8),
  current_lng NUMERIC(11,8),
  eta_minutes INT,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),

  FOREIGN KEY (order_id) REFERENCES orders(id),
  FOREIGN KEY (driver_id) REFERENCES drivers(id),
  INDEX idx_driver_status (driver_id, status),
  INDEX idx_order_id (order_id)
);

Optimizations:

  • Indexed for queries, not for writes: The status/created indexes are for customer queries ("show me my orders"). They slow writes slightly. Worth it.
  • Foreign keys: They enforce data integrity and add negligible overhead.
  • Denormalization where it matters: order_items lives separate from orders but price is denormalized. You need historical pricing. Don't calculate it.

Connection pooling in your app layer (use PgBouncer if deploying to Kubernetes):

# pgbouncer.ini
[databases]
production = host=prod-db.c.digitalocean.com port=5432 dbname=food_delivery user=app

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5

Why transaction-level pooling? Because your services are stateless. Holding a connection across multiple transactions wastes resources.

Redis: Session & Cache Layer

apiVersion: v1
kind: Service
metadata:
  name: redis-cluster
spec:
  ports:
  - port: 6379
  clusterIP: None
  selector:
    app: redis
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis-cluster
  replicas: 3
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:7-alpine
        command:
            - redis-server
            - "--cluster-enabled"
            - "yes"
            - "--cluster-config-file"
            - "/data/nodes.conf"
            - "--cluster-node-timeout"
            - "5000"
            - "--appendonly"
            - "yes"
        ports:
        - containerPort: 6379
          name: client
        - containerPort: 16379
          name: gossip
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "200m"
        volumeMounts:
        - name: data
          mountPath: /data
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          accessModes: [ "ReadWriteOnce" ]
          resources:
            requests:
              storage: 10Gi

Redis in Kubernetes is tricky. You need:

  • Persistent volume: Cache Loss causes cascading failures.
  • Cluster mode: 3+ nodes for quorum.
  • Monitoring: Track evictions, memory pressure.

What do you cache?

  • Session state — authentication tokens, user preferences
  • Menu data — restaurant menus don't change every second
  • Geospatial indices — driver locations for "nearest restaurants/drivers"
  • Rate limiting — allow 30 orders per minute per user

Don't cache:

  • Order data — it's authoritative in PostgreSQL
  • Real-time delivery tracking — go to the source

Cache TTL strategy:

# Session: 1 day
cache.set(f"session:{auth_token}", user_data, ex=86400)

# Menu: 1 hour (updates are rare)
cache.set(f"menu:{restaurant_id}", menu_json, ex=3600)

# Driver location: 10 seconds (updates frequently)
cache.set(f"delivery:{order_id}:location", geo_data, ex=10)

# Rate limit: 60 seconds
cache.incr(f"rate_limit:{user_id}:{minute}", ex=60)

Traffic Management: Kong API Gateway

apiVersion: configuration.konghq.com/v1
kind: KongIngress
metadata:
  name: api-rate-limit
spec:
  route:
    rate_limit:
      minute: 600  # 10 requests/second per customer
  upstream:
    hash_on: none
    algorithm: round_robin
    slots: 10
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-gateway
  annotations:
    konghq.com/plugins: "rate-limiting"
spec:
  rules:
  - host: api.food-delivery.local
    http:
      paths:
      - path: /orders
        pathType: Prefix
        backend:
          service:
            name: order-service
            port:
              number: 8080
      - path: /restaurants
        pathType: Prefix
        backend:
          service:
            name: restaurant-service
            port:
              number: 8080

Kong (or nginx ingress) handles:

  • Rate limiting — prevent abuse
  • Request logging — audit trail
  • SSL termination — TLS at the edge
  • Request transformation — add/modify headers

Critical for food delivery:

  • Client-side retries: Your mobile app retries failed requests. Kong needs idempotent key tracking so duplicate payment attempts are rejected.
  • Circuit breaking: If order-service is down, fail fast rather than holding connections.
  • Timeouts: Order placement should timeout in 5 seconds. Keep connections from hanging.

Observability: You Can't Manage What You Don't Measure

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Metrics you need:

# Order execution latency (crucial for UX)
order_placement_duration_seconds{percentile="p50,p99,p99.9"}

# Service error rates
order_service_errors_total{method="POST",endpoint="/orders",status="500"}

# Database connection pool exhaustion (early warning sign)
postgres_connections_busy / postgres_connections_max

# Cache hit ratios
redis_cache_hits / (redis_cache_hits + redis_cache_misses)

# Delivery SLA compliance
delivery_completed_within_eta_percent

Alert on business metrics, not just infrastructure:

- alert: HighOrderErrorRate
  expr: rate(order_service_errors_total[5m]) > 0.01
  for: 2m
  annotations:
    summary: "Order service error rate > 1%"

- alert: DeliveryETAMissed
  expr: delivery_completed_within_eta_percent < 0.95
  for: 5m
  annotations:
    summary: "Less than 95% of deliveries meeting SLA"

The Scaling Reality

You don't scale from 100 to 100K orders smoothly. You scale in phases:

Phase 1 (0-1K orders/day): Single node. PostgreSQL on shared instance. Redis for cache. Works fine.

Phase 2 (1-10K orders/day): 3-node Kubernetes cluster. Managed PostgreSQL. Redis cluster. You hit your first bottleneck: database writes during lunch rush.

Phase 3 (10-50K orders/day): Read replicas for analytics queries. Order service optimizations (batch writes, denormalization). Queue-based processing for async tasks.

Phase 4 (50K+ orders/day): Event-driven architecture. Orders become events. Delivery tracking is event-driven. Analytics queries read from a data warehouse, not production PostgreSQL.

Each phase costs money and engineering time. You don't jump to Phase 4 on day one.

The Hard Tradeoffs

Consistency vs. Availability: When your payment service is unavailable, do you fail the order or queue it? Most food delivery platforms queue and settle later. Your customer sees "order received, payment pending." The risk: double-charging if not handled carefully.

Latency vs. Cost: Replicating data across regions near drivers adds latency but reduces costs (local data fetches are cheaper). Not worth it at 10K orders/day. Worth it at 100K.

Vendor Lock-in vs. Simplicity: Running on DigitalOcean means you're locked into their infrastructure. But the alternative—building for portability—costs engineering time. By the time you're big enough for portability to matter, you can afford the migration.

What Gets You Killed

  1. No monitoring — You won't know you're in trouble until customers tell you.
  2. Database without backups — One corrupted query and you lose your business.
  3. Single point of failure — One node, one database, one Redis instance. Fine until it isn't.
  4. Coupling payment to delivery — If delivery tracking fails, orders should still process.
  5. No circuit breakers — When a downstream service is down, it should fast-fail, not hang.

What Gets You Money

  1. Fast order placement — Sub-second order placement increases conversion.
  2. Accurate delivery ETAs — Reduces support tickets, increases customer lifetime value.
  3. Reliable payment processing — Failed payments = lost revenue and customer frustration.
  4. Low infrastructure costs — Direct margin improvement.

Starting Out: The Minimal Viable Scale

Don't build Phase 4 on day one.

# Day 1 production
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: order-service
        image: registry.digitalocean.com/myapp/order-service:v1
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: order-service

One node. Two pod replicas for redundancy. PostgreSQL managed database. Done.

As traffic grows, you scale horizontally: add nodes, add replicas, add read replicas. The architecture doesn't change—it just gets broader.

The End State

Six months in, you're running:

  • 50-node DigitalOcean Kubernetes cluster
  • Managed PostgreSQL with read replicas
  • Redis cluster for caching
  • Kong for traffic management
  • Prometheus + Grafana for observability
  • Order volume: 50K orders/day

Cost: ~$15-20K monthly infrastructure. Each order costs you $0.30-0.50 in infrastructure. Your margins are healthy if unit economics support it.

You're not at Uber scale. But you're past the "will we survive" phase. You're in the "how do we optimize" phase.

That's the goal. Not perfection. Viability.


This playbook assumes you've already shipped a functional food delivery platform. You have customers, revenue, and data. Now you're scaling it competently.

If you're still in the MVP phase, scale vertically first. One server. One database. Get product-market fit. Then optimize infrastructure.