You charge the customer's card. The payment processor confirms the transaction. Your database is updated. Three things happened sequentially.
That is not how distributed systems work.
Three independent systems communicated over the network. State changes did not align. Your payment processor accepted the charge one second before your database updated. During that second, the system exists in an impossible state: money gone, database unaware.
This is the consistency problem. It lives at the heart of every distributed system and kills production systems when ignored.
The Incident Pattern
Customer: "You charged me but my order is still pending."
Diagnostics:
- Payment processor: transaction succeeded. Charge is real.
- Database: order status is "pending." Not "paid"—waiting for payment.
- Webhook arrived at gateway. Returned HTTP 200.
- Order status never transitioned to "paid."
Three facts are simultaneously true. They should not be.
State Diagram of Failure:
Normal Flow (Expected):
┌─────────────┐ webhook ┌──────────────┐ update ┌──────────────┐
│ Payment │────────→│ Order System │───────→│ Database │
│ Processor │ │ (Handler) │ │ (Order) │
└─────────────┘ └──────────────┘ └──────────────┘
status: paid processed status: paid
Failure State (Reality):
┌─────────────┐ webhook ┌──────────────┐ update ┌──────────────┐
│ Payment │────────→│ Order System │┄ ┄ ┄ ┄→│ Database │
│ Processor │ received│ (Handler) │ CRASH │ (Order) │
└─────────────┘ └──────────────┘ └──────────────┘
status: paid acknowledged status: pending
System is split: payment taken, order not marked paid
The Failure Modes
1. Webhook Handler Crashes
Handler receives payment confirmation. Parses message. Tries to update order status—database connection pool is exhausted.
Handler crashes. Never acknowledges the message.
Payment processor resends webhook after 30 minutes. Customer already complained.
Fix: Write webhook to message queue immediately, then acknowledge. Separate worker reads queue and updates database. If the worker crashes, the queue persists the message. No inconsistency.
Most systems don't do this. They call database directly. When handler crashes, inconsistency is baked in.
2. Handler Returns 200 Before Write Confirms
Sequence:
- Opens database transaction
- Updates order status
- Closes database connection (commit queued but not confirmed)
- Returns HTTP 200 to payment processor
- Database may or may not confirm the commit
Payment processor trusts the 200 and doesn't resend. But if the commit failed, the order never transitioned to "paid."
The worse variant:
- Opens transaction
- Updates order status
- Error occurs in the response-building phase (serialization fails, validation check fails)
- Returns an error (500, 503, 409—any non-2xx)
- But the transaction already committed before the error occurred
Now the payment processor sees failure and resends the webhook. Your system processes the same payment twice.
Both are race conditions. Undetectable without distributed tracing.
3. Multiple Microservices, No Coordination
Order service receives payment, updates order status, then calls fulfillment service to ship.
Fulfillment service is down. Ship request times out and gets discarded.
Customer's money is gone. Order marked paid. No fulfillment was requested. Fulfillment service comes back online later. No retry logic exists. The order service has no way to know the fulfillment failed.
Order sits in paid state forever, never shipped. This is a cascading failure across service boundaries requiring explicit compensation: either the service retries systematically, or you have a human process to detect and fix mismatched orders.
4. Dual Write Problem
Receive webhook. Must update three things:
- Order database
- Send notification email
- Mark payment in accounting system
Write code that does all three. One fails. Database updated. Email failed. Customer's order marked paid, but they never got confirmation, and accounting doesn't know the payment came in.
Or: all succeed, but email service is eventually-consistent. Delivery takes 3 minutes. Customer reloads page and sees paid status. Email arrives 3 minutes later. The order appears paid twice in their inbox.
You're trying to update two independent systems atomically. It's impossible. The network can partition between the first write and the second. The second system might be slow. One system might reject the update for local reasons the first system didn't predict.
Every distributed system has dual writes hidden somewhere, and they are always wrong.
Why Consistency Matters
Inconsistency is not academic. Customer is out real money. Business owes fulfillment or refund. Until reconciled, the inconsistency represents untracked liability.
At scale: 0.1% of payment orders inconsistent. At 100,000 orders per month, that's 100 orders per month charged but not fulfilled.
At $50 per order: $60,000 per year in lost revenue plus support tickets, refunds, chargebacks.
Consistency is not a feature. It's the difference between a functioning business and one that hemorrhages money silently.
The Solutions
These are not theoretical. Used by every payment system that survived production.
Solution 1: Idempotency
Every operation must have a unique identifier. The payment processor assigns a transaction ID. The webhook handler uses that ID as a key for the idempotent operation.
When the handler receives the webhook:
- Check if already processed this transaction ID
- If yes, return 200 immediately (idempotent success)
- If no, process the request
Even if the webhook arrives twice, the second request returns the same result as the first. The database stays consistent.
This requires storing the transaction ID in the database alongside the order. Check it before every state change.
Implementation:
INSERT INTO order_payments (transaction_id, order_id, status)
VALUES ($1, $2, 'paid')
ON CONFLICT (transaction_id) DO NOTHING;
Transaction ID already exists? The insert fails silently. Order is already paid. Everything is fine.
Idempotency is not optional. Every external call that modifies state must be idempotent.
Solution 2: Event Sourcing / Append-Only Logs
Stop thinking of the order as a record with a "status" field. Instead, think of it as a sequence of events: PaymentReceived, OrderConfirmed, ShipmentInitiated, DeliveryCompleted.
When the webhook arrives, don't update the order. Write an event to an append-only log:
{
"timestamp": "2024-10-15T14:23:14Z",
"order_id": "ORD-123",
"type": "PaymentReceived",
"amount": 50.00
}
The event is immutable. It cannot fail partially. Either the event is written to the log, or it is not.
A separate process reads the event log. When it sees "PaymentReceived," it updates the order status to "paid." If that update fails, it retries. The event is still in the log. It eventually processes.
The source of truth becomes the event log, not the derived state in the orders table. The state table can be rebuilt from the log. Inconsistencies are eventually resolved.
Operational benefit: Replay the order history. Understand exactly what happened and when.
Payment Received @ 14:23:14 → Order Confirmed @ 14:23:15 → Ship Initiated @ 14:25:00 → Delivered @ 16:30:00
This approach is more operationally complex, but it's the standard pattern for payment systems handling trillions of dollars.
Solution 3: Transactional Outbox
You need to update two things atomically: the database and the message system.
Write both in the same database transaction:
- Update order table to mark status as "paid"
- Insert a row into an outbox table:
{notification_id, email_address, type} - Commit the transaction
A separate process polls the outbox table. It finds unprocessed rows, sends emails, and marks them done.
The key insight: the outbox write happens in the same transaction as the order update. They both succeed or both fail, atomically.
Failure handling: If the email send fails, the outbox row persists. A retry process sends the email later. The order state is already correct.
This pattern decouples the order update (which must be fast and atomic) from the notification dispatch (which can be slow and unreliable).
Solution 4: Saga Pattern
When multiple microservices must coordinate, you cannot have a single transaction across them. The network is unreliable and services may be down.
Instead, define a sequence of steps and make each step idempotent.
Orchestration (explicit coordinator):
- Central service receives payment webhook
- Calls service A, then service B, then service C
- If any step fails, it triggers compensating actions
- Order and operations are explicitly logged
Choreography (event-driven):
- Service A receives payment and emits an event
- Service B listens for that event and performs the next step
- Service C listens and performs the final step
- Each step is independent and idempotent
Choreography is simpler (no central coordinator) but harder to debug (you follow events across logs). Orchestration is more visible but adds a central point of failure.
Both require compensation logic: If step 3 fails, what happens? Do you refund the payment? Re-queue the order? Alert a human?
Define this before shipping.
Solution 5: Distributed Transactions (Generally: Don't)
Two-phase commit (2PC) is the academic solution: lock both systems, commit on both, release.
It's also fragile. If one system fails after the lock but before the commit, the other system sits deadlocked waiting.
In practice, use 2PC within a single database (most databases support it across multiple schemas or with logical replication). Don't use it across independent services. The failure modes are worse than the inconsistency you're trying to prevent.
Operational Discipline
Technical patterns require discipline to work:
-
Every webhook handler must be idempotent. Not negotiable. Test by simulating duplicate webhook arrivals.
-
Acknowledge webhook after queueing, not after processing. Return 200 immediately. Process asynchronously.
-
Log everything. Every state transition logged with timestamp. You should be able to replay the order history and understand what happened and when.
-
Reconciliation jobs. Run a nightly job that checks for inconsistencies between order database and payment processor records. If a transaction is recorded in the processor but not in your database, investigate and reconcile.
-
Monitoring and alerts. If an order sits in "paid" state for more than N hours without a corresponding fulfillment request, alert someone. The system has failed.
-
Runbook for manual recovery. Document: Who handles refunds? Who contacts customer? Who updates the ledger?
Consistency Failure Detection Pattern:
Every Hour: Run Reconciliation
┌─────────────────────────────────────────┐
│ Select orders in 'paid' state │
│ WHERE fulfillment_requested IS NULL │
│ AND created_at < NOW() - '2 hours' │
└─────────────────────────────────────────┘
│
├─ Found orders? Alert immediately
│
└─ None found? System consistent
The Cost of Ignoring This
Ship a payment system without thinking through consistency:
- Customers encounter orders charged but not fulfilled
- Support team spends time investigating inconsistencies instead of helping customers
- Revenue leaks to refunds and chargebacks
- Auditors ask uncomfortable questions about financial controls
Worse: consistency failures cluster. When your system is under load (holidays, marketing campaigns), the failure rate increases. You are least prepared to handle chaos exactly when the cost is highest.
Conclusion
Distributed systems do not guarantee consistency. Network partitions happen. Messages get delayed. Processes crash.
Acknowledge these facts in your architecture and plan accordingly:
- Use idempotency for every state-changing operation
- Use event logs as your source of truth
- Use transactional outbox for notifications
- Use sagas for multi-service coordination
- Acknowledge webhooks before processing them
- Log everything meticulously
- Run reconciliation jobs regularly
These are not new patterns. Every payment system that survived more than a few weeks in production uses them. They're not free either—they add complexity, operational overhead, and storage.
But the cost of not using them is higher: you lose money, you lose customers, and you lose the ability to understand what happened to your own data.
Build it right the first time. Or explain to your CEO why the charge went through but the order didn't.
References
Distributed Systems & Consistency
- CAP Theorem — Brewer's fundamental constraint on distributed systems
- Martin Fowler - Event Sourcing — Foundation for immutable audit logs
- Designing Data-Intensive Applications by Martin Kleppmann — Definitive reference on consistency models
Idempotency & Exactly-Once Semantics
- UUID RFC 4122 — Standard for unique identifiers
- Stripe Idempotency — Payment system best practices
- Google Cloud Idempotent APIs — Enterprise patterns
Saga Pattern & Distributed Transactions
- Chris Richardson - Saga Pattern — Comprehensive choreography and orchestration guide
- AWS Step Functions — Orchestrated saga implementation
- Two-Phase Commit (2PC) Problems — Why it fails
Transactional Outbox
- Outbox Pattern — Guaranteed message delivery
- Debezium — CDC system for outbox polling
Payment Systems & Reconciliation
- PCI DSS Compliance — Security framework for payment processors
- Square Engineering - Payment Reconciliation — Production payment patterns
- Stripe Reconciliation Best Practices — Industry standard
Observability & Debugging Distributed Systems
- Google SRE Book - Troubleshooting — Systematic debugging
- OpenTelemetry Documentation — Distributed tracing standard
- Observability Engineering - Charity Majors — Cardinality-based approach to debugging
Financial Controls & Auditing
- COSO Framework — Internal controls standard
- SOX Compliance for Tech — Audit requirements