The Consistency Problem: Why Distributed Systems Charge Without Fulfilling

You charge the customer's card. The payment processor confirms the transaction. Your database is updated. Three things happened sequentially.

That is not how distributed systems work.

Three independent systems communicated over the network. State changes did not align. Your payment processor accepted the charge one second before your database updated. During that second, the system exists in an impossible state: money gone, database unaware.

This is the consistency problem. It lives at the heart of every distributed system and kills production systems when ignored.

The Incident Pattern

Customer: "You charged me but my order is still pending."

Diagnostics:

Payment processor: transaction succeeded. Charge is real.
Database: order status is "pending." Not "paid"—waiting for payment.
Webhook arrived at gateway. Returned HTTP 200.
Order status never transitioned to "paid."

Three facts are simultaneously true. They should not be.

State Diagram of Failure:

Normal Flow (Expected):
┌─────────────┐ webhook ┌──────────────┐ update ┌──────────────┐
│   Payment   │────────→│ Order System │───────→│   Database   │
│ Processor   │         │   (Handler)  │        │   (Order)    │
└─────────────┘         └──────────────┘        └──────────────┘
     status: paid            processed            status: paid


Failure State (Reality):
┌─────────────┐ webhook ┌──────────────┐ update ┌──────────────┐
│   Payment   │────────→│ Order System │┄ ┄ ┄ ┄→│   Database   │
│ Processor   │ received│   (Handler)  │ CRASH  │   (Order)    │
└─────────────┘         └──────────────┘        └──────────────┘
     status: paid          acknowledged          status: pending

System is split: payment taken, order not marked paid

The Failure Modes

1. Webhook Handler Crashes

Handler receives payment confirmation. Parses message. Tries to update order status—database connection pool is exhausted.

Handler crashes. Never acknowledges the message.

Payment processor resends webhook after 30 minutes. Customer already complained.

Fix: Write webhook to message queue immediately, then acknowledge. Separate worker reads queue and updates database. If the worker crashes, the queue persists the message. No inconsistency.

Most systems don't do this. They call database directly. When handler crashes, inconsistency is baked in.

2. Handler Returns 200 Before Write Confirms

Sequence:

Opens database transaction
Updates order status
Closes database connection (commit queued but not confirmed)
Returns HTTP 200 to payment processor
Database may or may not confirm the commit

Payment processor trusts the 200 and doesn't resend. But if the commit failed, the order never transitioned to "paid."

The worse variant:

Opens transaction
Updates order status
Error occurs in the response-building phase (serialization fails, validation check fails)
Returns an error (500, 503, 409—any non-2xx)
But the transaction already committed before the error occurred

Now the payment processor sees failure and resends the webhook. Your system processes the same payment twice.

Both are race conditions. Undetectable without distributed tracing.

3. Multiple Microservices, No Coordination

Order service receives payment, updates order status, then calls fulfillment service to ship.

Fulfillment service is down. Ship request times out and gets discarded.

Customer's money is gone. Order marked paid. No fulfillment was requested. Fulfillment service comes back online later. No retry logic exists. The order service has no way to know the fulfillment failed.

Order sits in paid state forever, never shipped. This is a cascading failure across service boundaries requiring explicit compensation: either the service retries systematically, or you have a human process to detect and fix mismatched orders.

4. Dual Write Problem

Receive webhook. Must update three things:

Order database
Send notification email
Mark payment in accounting system

Write code that does all three. One fails. Database updated. Email failed. Customer's order marked paid, but they never got confirmation, and accounting doesn't know the payment came in.

Or: all succeed, but email service is eventually-consistent. Delivery takes 3 minutes. Customer reloads page and sees paid status. Email arrives 3 minutes later. The order appears paid twice in their inbox.

You're trying to update two independent systems atomically. It's impossible. The network can partition between the first write and the second. The second system might be slow. One system might reject the update for local reasons the first system didn't predict.

Every distributed system has dual writes hidden somewhere, and they are always wrong.

Why Consistency Matters

Inconsistency is not academic. Customer is out real money. Business owes fulfillment or refund. Until reconciled, the inconsistency represents untracked liability.

At scale: 0.1% of payment orders inconsistent. At 100,000 orders per month, that's 100 orders per month charged but not fulfilled.

At $50 per order: $60,000 per year in lost revenue plus support tickets, refunds, chargebacks.

Consistency is not a feature. It's the difference between a functioning business and one that hemorrhages money silently.

The Solutions

These are not theoretical. Used by every payment system that survived production.

Solution 1: Idempotency

Every operation must have a unique identifier. The payment processor assigns a transaction ID. The webhook handler uses that ID as a key for the idempotent operation.

When the handler receives the webhook:

Check if already processed this transaction ID
If yes, return 200 immediately (idempotent success)
If no, process the request

Even if the webhook arrives twice, the second request returns the same result as the first. The database stays consistent.

This requires storing the transaction ID in the database alongside the order. Check it before every state change.

Implementation:

INSERT INTO order_payments (transaction_id, order_id, status)
VALUES ($1, $2, 'paid')
ON CONFLICT (transaction_id) DO NOTHING;

Transaction ID already exists? The insert fails silently. Order is already paid. Everything is fine.

Idempotency is not optional. Every external call that modifies state must be idempotent.

Solution 2: Event Sourcing / Append-Only Logs

Stop thinking of the order as a record with a "status" field. Instead, think of it as a sequence of events: PaymentReceived, OrderConfirmed, ShipmentInitiated, DeliveryCompleted.

When the webhook arrives, don't update the order. Write an event to an append-only log:

{
  "timestamp": "2024-10-15T14:23:14Z",
  "order_id": "ORD-123",
  "type": "PaymentReceived",
  "amount": 50.00
}

The event is immutable. It cannot fail partially. Either the event is written to the log, or it is not.

A separate process reads the event log. When it sees "PaymentReceived," it updates the order status to "paid." If that update fails, it retries. The event is still in the log. It eventually processes.

The source of truth becomes the event log, not the derived state in the orders table. The state table can be rebuilt from the log. Inconsistencies are eventually resolved.

Operational benefit: Replay the order history. Understand exactly what happened and when.

Payment Received @ 14:23:14 → Order Confirmed @ 14:23:15 → Ship Initiated @ 14:25:00 → Delivered @ 16:30:00

This approach is more operationally complex, but it's the standard pattern for payment systems handling trillions of dollars.

Solution 3: Transactional Outbox

You need to update two things atomically: the database and the message system.

Write both in the same database transaction:

Update order table to mark status as "paid"
Insert a row into an outbox table: {notification_id, email_address, type}
Commit the transaction

A separate process polls the outbox table. It finds unprocessed rows, sends emails, and marks them done.

The key insight: the outbox write happens in the same transaction as the order update. They both succeed or both fail, atomically.

Failure handling: If the email send fails, the outbox row persists. A retry process sends the email later. The order state is already correct.

This pattern decouples the order update (which must be fast and atomic) from the notification dispatch (which can be slow and unreliable).

Solution 4: Saga Pattern

When multiple microservices must coordinate, you cannot have a single transaction across them. The network is unreliable and services may be down.

Instead, define a sequence of steps and make each step idempotent.

Orchestration (explicit coordinator):

Central service receives payment webhook
Calls service A, then service B, then service C
If any step fails, it triggers compensating actions
Order and operations are explicitly logged

Choreography (event-driven):

Service A receives payment and emits an event
Service B listens for that event and performs the next step
Service C listens and performs the final step
Each step is independent and idempotent

Choreography is simpler (no central coordinator) but harder to debug (you follow events across logs). Orchestration is more visible but adds a central point of failure.

Both require compensation logic: If step 3 fails, what happens? Do you refund the payment? Re-queue the order? Alert a human?

Define this before shipping.

Solution 5: Distributed Transactions (Generally: Don't)

Two-phase commit (2PC) is the academic solution: lock both systems, commit on both, release.

It's also fragile. If one system fails after the lock but before the commit, the other system sits deadlocked waiting.

In practice, use 2PC within a single database (most databases support it across multiple schemas or with logical replication). Don't use it across independent services. The failure modes are worse than the inconsistency you're trying to prevent.

Operational Discipline

Technical patterns require discipline to work:

Every webhook handler must be idempotent. Not negotiable. Test by simulating duplicate webhook arrivals.
Acknowledge webhook after queueing, not after processing. Return 200 immediately. Process asynchronously.
Log everything. Every state transition logged with timestamp. You should be able to replay the order history and understand what happened and when.
Reconciliation jobs. Run a nightly job that checks for inconsistencies between order database and payment processor records. If a transaction is recorded in the processor but not in your database, investigate and reconcile.
Monitoring and alerts. If an order sits in "paid" state for more than N hours without a corresponding fulfillment request, alert someone. The system has failed.
Runbook for manual recovery. Document: Who handles refunds? Who contacts customer? Who updates the ledger?

Consistency Failure Detection Pattern:

Every Hour: Run Reconciliation

┌─────────────────────────────────────────┐
│ Select orders in 'paid' state           │
│ WHERE fulfillment_requested IS NULL     │
│ AND created_at < NOW() - '2 hours'      │
└─────────────────────────────────────────┘
          │
          ├─ Found orders? Alert immediately
          │
          └─ None found? System consistent

The Cost of Ignoring This

Ship a payment system without thinking through consistency:

Customers encounter orders charged but not fulfilled
Support team spends time investigating inconsistencies instead of helping customers
Revenue leaks to refunds and chargebacks
Auditors ask uncomfortable questions about financial controls

Worse: consistency failures cluster. When your system is under load (holidays, marketing campaigns), the failure rate increases. You are least prepared to handle chaos exactly when the cost is highest.

Conclusion

Distributed systems do not guarantee consistency. Network partitions happen. Messages get delayed. Processes crash.

Acknowledge these facts in your architecture and plan accordingly:

Use idempotency for every state-changing operation
Use event logs as your source of truth
Use transactional outbox for notifications
Use sagas for multi-service coordination
Acknowledge webhooks before processing them
Log everything meticulously
Run reconciliation jobs regularly

These are not new patterns. Every payment system that survived more than a few weeks in production uses them. They're not free either—they add complexity, operational overhead, and storage.

But the cost of not using them is higher: you lose money, you lose customers, and you lose the ability to understand what happened to your own data.

Build it right the first time. Or explain to your CEO why the charge went through but the order didn't.

References

Distributed Systems & Consistency

CAP Theorem — Brewer's fundamental constraint on distributed systems
Martin Fowler - Event Sourcing — Foundation for immutable audit logs
Designing Data-Intensive Applications by Martin Kleppmann — Definitive reference on consistency models

Idempotency & Exactly-Once Semantics

UUID RFC 4122 — Standard for unique identifiers
Stripe Idempotency — Payment system best practices
Google Cloud Idempotent APIs — Enterprise patterns

Saga Pattern & Distributed Transactions

Chris Richardson - Saga Pattern — Comprehensive choreography and orchestration guide
AWS Step Functions — Orchestrated saga implementation
Two-Phase Commit (2PC) Problems — Why it fails

Transactional Outbox

Outbox Pattern — Guaranteed message delivery
Debezium — CDC system for outbox polling

Payment Systems & Reconciliation

PCI DSS Compliance — Security framework for payment processors
Square Engineering - Payment Reconciliation — Production payment patterns
Stripe Reconciliation Best Practices — Industry standard

Observability & Debugging Distributed Systems

Google SRE Book - Troubleshooting — Systematic debugging
OpenTelemetry Documentation — Distributed tracing standard
Observability Engineering - Charity Majors — Cardinality-based approach to debugging

Financial Controls & Auditing

COSO Framework — Internal controls standard
SOX Compliance for Tech — Audit requirements