Skip to content
V3.0 // STABLE
LOAD 12%
LAT 24MS
SLA 99.99%

Database Failover Strategies: Passive vs Active

2 min read
5 views
postgresqlhigh availabilityfailoverconsistency

For a financial system, data must be available 99.99% of the time. When the primary database fails, the system must automatically switch to a standby replica without losing data. This transition is known as Failover.

Replication Modes: The Consistency Trade-off

Choosing between Synchronous and Asynchronous replication depends entirely on your RPO (Recovery Point Objective).

ModeRPOLatencyUse Case
Synchronous0 (No Data Loss)HighInternal Ledger, Banking Core
Asynchronous> 0 (Small Loss)LowLogging, User Profiles, Analytics

[!CAUTION] Synchronous replication can cause "write amplification" where the Primary node becomes unresponsive if the network between nodes is unstable.

High-Availability Cluster Design

A modern HA setup requires a "Distributed Consensus" to avoid the dreaded Split-Brain.

Live architecture
Analyzing Schema...

Arch Note

Interactive logic enabled. Click components in expanded view for technical service definitions.

Layer.0 / Distributed_System_Viz

The Split Brain Problem

When two database nodes both think they are the "Primary" due to a network partition, they can both accept writes, leading to irreversible data corruption.

The Solution: A Quorum-based mechanism (N/2 + 1) or a dedicated Cluster Manager (like Patroni for PostgreSQL) that ensures only one node is elected as the leader at any time using Raft or Paxos algorithms.

Failure Scenario Analysis

IncidentDetectionResulting Action
Primary Process CrashImmediate (PID lost)Standby promoted within < 5s.
Network PartitionTimeout (Keep-alive)Sentinel drops leader lock; new election begins.
Storage FailureI/O ErrorNode enters "Failed" state; manual intervention required.

Critical Monitoring Metrics

To ensure a healthy failover, monitor these KPIs:

  • Replication Lag (Bytes/Seconds): How far behind is the replica?
  • Failover Attempt Counts: Frequency of automatic node switches.
  • Disk I/O Latency: Ensure the replica can keep up with the primary's write throughput.

[!TIP] Engineering Advice: Always test your failover strategy in a "Chaos Engineering" experiment. Unplug a node in staging and observe how your application handles the connection reset.