Case Study: Fintech Platform Migration
12 November 2023 — Agbey Team
The Problem We Inherited
When the client came to us, their payments platform had been running for six years on a PHP 5.6 monolith backed by a single MySQL instance. It processed roughly 4,000 transactions a day across mobile money, card, and bank transfer channels — and it was held together by patches on top of patches.
The symptoms were hard to ignore:
- Deployment took 3–4 hours, required a maintenance window, and broke something roughly one in three times
- A single slow database query during peak hours would stall the entire payment pipeline
- There was no environment separation — developers tested against a sanitised copy of production
- Error handling was scattered: some payment failures silently updated nothing; others double-charged customers
- The team had stopped shipping features. Every sprint was dominated by firefighting
The business had grown to the point where the platform was the ceiling, not the floor.
Constraints That Shaped Every Decision
Before writing a single line of new code, we spent two weeks mapping the existing system. The constraints we uncovered defined the entire migration strategy:
- Zero downtime tolerance. The client processed salary disbursements for three corporate customers on the 25th–27th of every month. Any outage during that window was a contractual breach.
- Regulatory obligations. Transaction audit logs had to be immutable and accessible to regulators for seven years. We couldn't just migrate — we had to migrate and preserve the full historical record.
- Third-party integrations. Eight external payment providers were integrated via bespoke, undocumented HTTP calls. Several of them used deprecated API versions that the providers had sunset but not shut down.
- A three-person engineering team. The client couldn't pause feature development entirely during the migration. The new system had to be built alongside the old one.
Constraints are not obstacles to good architecture. They are the architecture. — Internal engineering note
The Migration Strategy: Strangler Fig
We rejected a big-bang rewrite immediately. Rewriting everything in parallel and switching over on a fixed date is how projects fail publicly. Instead we used the Strangler Fig pattern: intercept traffic at the edge, route it to the new system one slice at a time, and retire the old system incrementally as each slice is proven stable.
The phases looked like this:
| Phase | Scope | Duration | Risk Level |
|---|---|---|---|
| 1 — Observability | Instrument the monolith, establish baselines | 2 weeks | None |
| 2 — Data layer | Migrate to PostgreSQL, run dual-writes | 3 weeks | Low |
| 3 — Payment routing | New service handles one provider; proxy routes to it | 4 weeks | Medium |
| 4 — Core pipeline | All payment channels on new stack | 6 weeks | High |
| 5 — Admin & reporting | Internal tooling migrated last | 3 weeks | Low |
| 6 — Decommission | Monolith taken offline | 1 week | Low |
No phase required downtime. Each one was reversible: if a phase introduced instability, the proxy router could be flipped back to the monolith within 60 seconds.
Phase 1: Observability First
You cannot migrate what you cannot measure. Before touching the architecture, we instrumented the monolith.
We added structured logging to every payment state transition, wrapped every outbound HTTP call in a timing decorator, and imported the logs into a simple dashboard. Within 48 hours we had baselines:
- P50 transaction latency: 340ms
- P99 transaction latency: 4,200ms
- Error rate: 0.8% of all transactions (mix of provider timeouts and internal failures)
- Peak throughput: ~12 transactions/second during salary week
- Slowest recurring query: A
JOINacross the transactions and reconciliation tables that ran on every status-check poll — 900ms average, no index oncreated_at
That last finding alone gave us a quick win. We added one composite index and dropped the P99 from 4,200ms to 1,100ms in the monolith before the new system touched a single payment. It bought goodwill with the client and demonstrated we understood the system.
Phase 2: The Data Migration
The monolith's MySQL schema had 14 years of organic growth written into it — nullable columns that should never be null, VARCHAR fields storing JSON strings, a status column that held 23 distinct string values with no enforced enum, and foreign key relationships that existed only in comments in PHP files.
We built the new PostgreSQL schema from first principles: what does a payment actually need to represent? The answer was a clean event-sourced model:
-- Every state change is a recorded event, never an overwrite
CREATE TABLE payment_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
payment_id UUID NOT NULL REFERENCES payments(id),
event_type payment_event_type NOT NULL, -- enforced enum
payload JSONB NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT now(),
recorded_by TEXT NOT NULL -- service or user that triggered the event
);
CREATE INDEX idx_payment_events_payment_id ON payment_events(payment_id);
CREATE INDEX idx_payment_events_recorded_at ON payment_events(recorded_at DESC);
The current state of any payment is derived by replaying its events. This satisfied the regulator's immutability requirement: we never update or delete rows, so the audit trail is structural, not a policy.
Dual-write period: For three weeks, every payment write went to both databases. A reconciliation job ran hourly and flagged any row that existed in one but not the other, or where derived state diverged. We found and fixed six edge cases during this period — all involving the monolith's silent failure paths — before we routed a single live transaction to the new stack.
Phase 3: Isolating the Payment Providers
Each of the eight payment providers had its own integration quirks. Rather than wrapping them all in a single "provider adapter", we treated each as its own isolated service with a shared interface:
interface PaymentProvider {
initiate(request: PaymentRequest): Promise<ProviderResponse>;
verify(reference: string): Promise<VerificationResult>;
handleWebhook(payload: unknown): Promise<WebhookResult>;
}
This forced every provider integration to implement the same contract, which meant:
- Swapping a provider required changing exactly one file
- Testing a provider integration required mocking exactly one interface
- A provider outage was isolated — it couldn't cascade into the rest of the pipeline
We migrated the lowest-volume provider first (roughly 80 transactions/day), running it through the new service while the proxy routed all other providers to the monolith. We watched it for two weeks under real traffic before moving to the next. By the time we tackled the highest-volume provider, we had seven previous migrations behind us and every failure mode already encountered.
Phase 4: The Core Pipeline
The core payment pipeline in the new system was built around three principles we don't compromise on in financial systems:
1. Idempotency everywhere. Every payment initiation accepts an idempotency_key. Duplicate requests with the same key return the original result — they never create a second transaction. This eliminated double-charge scenarios that had cost the client real money in customer refunds.
async function initiatePayment(request: PaymentRequest): Promise<Payment> {
const existing = await db.payments.findByIdempotencyKey(
request.idempotencyKey,
);
if (existing) return existing; // safe to return — same result guaranteed
return db.transaction(async (tx) => {
const payment = await tx.payments.create({ ...request, status: "pending" });
await tx.paymentEvents.create({
paymentId: payment.id,
eventType: "initiated",
});
return payment;
});
}
2. Explicit state machines. The old system's 23-value status string was replaced with a validated state machine. Illegal transitions throw — they don't silently write bad state.
pending → processing → completed
pending → processing → failed
pending → cancelled
processing → refund_pending → refunded
Any code that tries to move a payment from completed directly to processing gets an exception, not a corrupted row.
3. Reconciliation as a first-class feature. An automated reconciliation service runs every 15 minutes, comparing our internal state against provider records. Discrepancies are flagged in a queue for human review within the hour. During the first month, it caught 14 reconciliation gaps — all edge cases involving provider webhook delivery failures that would previously have been invisible.
Results After Six Months
| Metric | Before | After |
|---|---|---|
| Deployment time | 3–4 hours + window | 8 minutes, zero downtime |
| P50 transaction latency | 340ms | 95ms |
| P99 transaction latency | 4,200ms | 380ms |
| Transaction error rate | 0.8% | 0.04% |
| Double-charge incidents | ~3/month | 0 |
| Feature release cadence | ~1 per quarter | 2–3 per month |
| Salary week incidents | 2–3 per cycle | 0 in 6 cycles |
The team went from spending roughly 60% of sprint capacity on incidents and maintenance to less than 15%. The rest went to product.
What We Would Do Differently
No migration is perfect. Three things we'd approach differently with the benefit of hindsight:
1. Migrate the webhook layer earlier. We left provider webhook handling in the monolith longer than we should have, which meant running two reconciliation pipelines in parallel during Phase 3 and 4. It wasn't a crisis, but it added complexity we could have avoided.
2. Build the admin tooling in parallel, not last. The operations team was working in the old admin UI while payments ran on the new stack for two months. This created a support burden — they were reading logs from two systems to trace a single transaction. We should have prioritised their tooling earlier.
3. Automate the load test from day one. We had good observability, but our load testing was manual until Phase 4. Scripted load tests from Week 1 would have caught the database connection pool exhaustion we hit during the first salary week on the new stack — which we fixed in 20 minutes but could have found before it ever reached production.
The Bigger Point
A migration like this isn't primarily a technical exercise. It's a risk management exercise that happens to involve writing code. The decisions that mattered most — strangler fig over big bang, dual-writes over cutover, event sourcing over overwrites, idempotency keys on every endpoint — were all answers to business risk questions, not engineering preferences.
Systems get legacy because businesses grow faster than their initial assumptions. The goal isn't to build the system that will never need migration. The goal is to build the next system so that the following migration is significantly cheaper.