You've shipped. Users are signing up. Orders are flowing through. The question isn't whether you'll need to scale—it's whether you'll scale competently or watch your platform collapse.
This is the CTO's reality for food delivery platforms. We operate in a market where latency kills revenue, downtime costs money in refunds, and vendor lock-in is a slow bleed you can't afford.
Here's how we scale without betting the company.
The Architecture Snapshot
# The stack we're building toward
- DigitalOcean Kubernetes (DOKS)
- Persistent volumes for stateful services
- Redis Cluster for session/cache layer
- PostgreSQL managed database
- Kong API Gateway for traffic shaping
- Timescale for event analytics
This isn't exotic. It's battle-tested. But every component choice is a constraint—latency, cost, operational complexity.
Why DigitalOcean, Not AWS
AWS is power. DigitalOcean is coherence.
For a food delivery platform, the difference matters.
AWS gives you 200+ services. You spend months picking the right combination. Your ops team needs deep AWS expertise. Your terraform grows into a 5000-line monstrosity. Your bill arrives and surprises you because you forgot to tag resources properly.
DigitalOcean gives you:
- Kubernetes that works — DOKS is simpler than EKS. Same concepts. Less configuration noise.
- Transparent pricing — $12/node. You know what you're paying for.
- No service multiplication — Fewer choices means faster decisions.
- Community solutions — Most food delivery platforms run on similar stacks. You find reference implementations.
The cost difference? 30-40% cheaper at comparable scale. For a 50-node cluster, that's $15-20K monthly you don't burn.
AWS wins when you need:
- Global CDN at scale
- Advanced ML services
- Exotic compliance requirements
We don't have those constraints yet. Optimize for the platform you have, not the one you might build in 3 years.
The Kubernetes Layer
Your food delivery app needs these primitives:
Stateless Microservices
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: registry.digitalocean.com/myapp/order-service:v1.2.3
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- order-service
topologyKey: kubernetes.io/hostname
Key decisions:
- Resources: Set requests/limits. Without them, noisy neighbors crash your platform.
- Pod Anti-Affinity: Spread replicas across nodes. One node failure doesn't cascade.
- Probes: Liveness tells Kubernetes when to restart. Readiness tells it when to accept traffic. Get these wrong and you have zombie pods consuming resources.
Horizontal Pod Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
The philosophy: scale fast on the way up, slow on the way down.
During a lunch rush, orders spike. You want new pods spinning up within 15 seconds. But you don't want to thrash—spinning up and down pods every second burns CPU and disrupts connections. The stabilizationWindowSeconds on scale-down prevents that.
Critical: Target 60% CPU, not 80%. Kubernetes needs headroom. When the HPA triggers, it takes 15-30 seconds for new pods to be ready. If you're already at 80% usage when scaling starts, you'll hit 95%+ before the new capacity arrives.
The Data Layer
Food delivery generates data at scale. Orders, geolocation, delivery tracking, cancellations—every action is a data point.
PostgreSQL: Your Source of Truth
Use managed PostgreSQL on DigitalOcean. Don't run it in Kubernetes. Running databases in containers works for learning Kubernetes. In production, it's a liability.
-- Orders table: OLTP workload
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
restaurant_id BIGINT NOT NULL,
status VARCHAR(20) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
delivered_at TIMESTAMPTZ,
FOREIGN KEY (customer_id) REFERENCES customers(id),
FOREIGN KEY (restaurant_id) REFERENCES restaurants(id),
INDEX idx_customer_created (customer_id, created_at DESC),
INDEX idx_restaurant_status (restaurant_id, status),
INDEX idx_status_created (status, created_at DESC)
);
-- Order items: denormalize for query performance
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL,
menu_item_id BIGINT NOT NULL,
quantity INT NOT NULL,
price_cents BIGINT NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(id) ON DELETE CASCADE,
INDEX idx_order_id (order_id)
);
-- Deliveries: track driver, location, ETA
CREATE TABLE deliveries (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT NOT NULL UNIQUE,
driver_id BIGINT NOT NULL,
status VARCHAR(20) NOT NULL,
current_lat NUMERIC(10,8),
current_lng NUMERIC(11,8),
eta_minutes INT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
FOREIGN KEY (order_id) REFERENCES orders(id),
FOREIGN KEY (driver_id) REFERENCES drivers(id),
INDEX idx_driver_status (driver_id, status),
INDEX idx_order_id (order_id)
);
Optimizations:
- Indexed for queries, not for writes: The status/created indexes are for customer queries ("show me my orders"). They slow writes slightly. Worth it.
- Foreign keys: They enforce data integrity and add negligible overhead.
- Denormalization where it matters:
order_itemslives separate fromordersbut price is denormalized. You need historical pricing. Don't calculate it.
Connection pooling in your app layer (use PgBouncer if deploying to Kubernetes):
# pgbouncer.ini
[databases]
production = host=prod-db.c.digitalocean.com port=5432 dbname=food_delivery user=app
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
Why transaction-level pooling? Because your services are stateless. Holding a connection across multiple transactions wastes resources.
Redis: Session & Cache Layer
apiVersion: v1
kind: Service
metadata:
name: redis-cluster
spec:
ports:
- port: 6379
clusterIP: None
selector:
app: redis
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis-cluster
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:7-alpine
command:
- redis-server
- "--cluster-enabled"
- "yes"
- "--cluster-config-file"
- "/data/nodes.conf"
- "--cluster-node-timeout"
- "5000"
- "--appendonly"
- "yes"
ports:
- containerPort: 6379
name: client
- containerPort: 16379
name: gossip
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Redis in Kubernetes is tricky. You need:
- Persistent volume: Cache Loss causes cascading failures.
- Cluster mode: 3+ nodes for quorum.
- Monitoring: Track evictions, memory pressure.
What do you cache?
- Session state — authentication tokens, user preferences
- Menu data — restaurant menus don't change every second
- Geospatial indices — driver locations for "nearest restaurants/drivers"
- Rate limiting — allow 30 orders per minute per user
Don't cache:
- Order data — it's authoritative in PostgreSQL
- Real-time delivery tracking — go to the source
Cache TTL strategy:
# Session: 1 day
cache.set(f"session:{auth_token}", user_data, ex=86400)
# Menu: 1 hour (updates are rare)
cache.set(f"menu:{restaurant_id}", menu_json, ex=3600)
# Driver location: 10 seconds (updates frequently)
cache.set(f"delivery:{order_id}:location", geo_data, ex=10)
# Rate limit: 60 seconds
cache.incr(f"rate_limit:{user_id}:{minute}", ex=60)
Traffic Management: Kong API Gateway
apiVersion: configuration.konghq.com/v1
kind: KongIngress
metadata:
name: api-rate-limit
spec:
route:
rate_limit:
minute: 600 # 10 requests/second per customer
upstream:
hash_on: none
algorithm: round_robin
slots: 10
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-gateway
annotations:
konghq.com/plugins: "rate-limiting"
spec:
rules:
- host: api.food-delivery.local
http:
paths:
- path: /orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 8080
- path: /restaurants
pathType: Prefix
backend:
service:
name: restaurant-service
port:
number: 8080
Kong (or nginx ingress) handles:
- Rate limiting — prevent abuse
- Request logging — audit trail
- SSL termination — TLS at the edge
- Request transformation — add/modify headers
Critical for food delivery:
- Client-side retries: Your mobile app retries failed requests. Kong needs idempotent key tracking so duplicate payment attempts are rejected.
- Circuit breaking: If order-service is down, fail fast rather than holding connections.
- Timeouts: Order placement should timeout in 5 seconds. Keep connections from hanging.
Observability: You Can't Manage What You Don't Measure
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Metrics you need:
# Order execution latency (crucial for UX)
order_placement_duration_seconds{percentile="p50,p99,p99.9"}
# Service error rates
order_service_errors_total{method="POST",endpoint="/orders",status="500"}
# Database connection pool exhaustion (early warning sign)
postgres_connections_busy / postgres_connections_max
# Cache hit ratios
redis_cache_hits / (redis_cache_hits + redis_cache_misses)
# Delivery SLA compliance
delivery_completed_within_eta_percent
Alert on business metrics, not just infrastructure:
- alert: HighOrderErrorRate
expr: rate(order_service_errors_total[5m]) > 0.01
for: 2m
annotations:
summary: "Order service error rate > 1%"
- alert: DeliveryETAMissed
expr: delivery_completed_within_eta_percent < 0.95
for: 5m
annotations:
summary: "Less than 95% of deliveries meeting SLA"
The Scaling Reality
You don't scale from 100 to 100K orders smoothly. You scale in phases:
Phase 1 (0-1K orders/day): Single node. PostgreSQL on shared instance. Redis for cache. Works fine.
Phase 2 (1-10K orders/day): 3-node Kubernetes cluster. Managed PostgreSQL. Redis cluster. You hit your first bottleneck: database writes during lunch rush.
Phase 3 (10-50K orders/day): Read replicas for analytics queries. Order service optimizations (batch writes, denormalization). Queue-based processing for async tasks.
Phase 4 (50K+ orders/day): Event-driven architecture. Orders become events. Delivery tracking is event-driven. Analytics queries read from a data warehouse, not production PostgreSQL.
Each phase costs money and engineering time. You don't jump to Phase 4 on day one.
The Hard Tradeoffs
Consistency vs. Availability: When your payment service is unavailable, do you fail the order or queue it? Most food delivery platforms queue and settle later. Your customer sees "order received, payment pending." The risk: double-charging if not handled carefully.
Latency vs. Cost: Replicating data across regions near drivers adds latency but reduces costs (local data fetches are cheaper). Not worth it at 10K orders/day. Worth it at 100K.
Vendor Lock-in vs. Simplicity: Running on DigitalOcean means you're locked into their infrastructure. But the alternative—building for portability—costs engineering time. By the time you're big enough for portability to matter, you can afford the migration.
What Gets You Killed
- No monitoring — You won't know you're in trouble until customers tell you.
- Database without backups — One corrupted query and you lose your business.
- Single point of failure — One node, one database, one Redis instance. Fine until it isn't.
- Coupling payment to delivery — If delivery tracking fails, orders should still process.
- No circuit breakers — When a downstream service is down, it should fast-fail, not hang.
What Gets You Money
- Fast order placement — Sub-second order placement increases conversion.
- Accurate delivery ETAs — Reduces support tickets, increases customer lifetime value.
- Reliable payment processing — Failed payments = lost revenue and customer frustration.
- Low infrastructure costs — Direct margin improvement.
Starting Out: The Minimal Viable Scale
Don't build Phase 4 on day one.
# Day 1 production
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 2
template:
spec:
containers:
- name: order-service
image: registry.digitalocean.com/myapp/order-service:v1
resources:
requests:
memory: "256Mi"
cpu: "250m"
---
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8080
selector:
app: order-service
One node. Two pod replicas for redundancy. PostgreSQL managed database. Done.
As traffic grows, you scale horizontally: add nodes, add replicas, add read replicas. The architecture doesn't change—it just gets broader.
The End State
Six months in, you're running:
- 50-node DigitalOcean Kubernetes cluster
- Managed PostgreSQL with read replicas
- Redis cluster for caching
- Kong for traffic management
- Prometheus + Grafana for observability
- Order volume: 50K orders/day
Cost: ~$15-20K monthly infrastructure. Each order costs you $0.30-0.50 in infrastructure. Your margins are healthy if unit economics support it.
You're not at Uber scale. But you're past the "will we survive" phase. You're in the "how do we optimize" phase.
That's the goal. Not perfection. Viability.
This playbook assumes you've already shipped a functional food delivery platform. You have customers, revenue, and data. Now you're scaling it competently.
If you're still in the MVP phase, scale vertically first. One server. One database. Get product-market fit. Then optimize infrastructure.