Monitoring The PostgreSQL WAL Log
Guide to monitoring the PostgreSQL WAL Log
What is the PostgreSQL WAL?
Write-Ahead Logging (WAL) is how PostgreSQL records changes before applying them to the database. This ensures data durability, consistency, and recovery. WAL also enables Change Data Capture (CDC) for real-time streaming solutions like Streamkap.
Starting with PostgreSQL 16, read replicas support logical replication, allowing pipelines to run from a replica—depending on your cloud provider.
How WAL Works
WAL continuously generates log files capturing all database changes. These logs help restore transactions after a crash and enable replication. Each WAL file is usually 16MB, and the number generated depends on database activity. Since WAL writes sequentially, it reduces random I/O and improves performance.
Factors Affecting WAL Growth
WAL size increases with database activity, particularly:
- Frequent Transactions: More inserts, updates, and deletes generate more WAL.
- Retention Policies: Keeping WAL files for recovery or replication increases storage needs. Streamkap recommends a 3-day retention.
- Publications: Define which tables and changes are logged for replication.
- Replication Slots: Ensure WAL is retained for subscribers (e.g., Streamkap). Unused or slow slots cause WAL buildup.
- Replication Delays: If subscribers lag, WAL accumulates, consuming disk space.
As subscribers consume data, WAL size decreases. The restart_lsn
in pg_replication_slots
shows how far back WAL is retained per slot.
Disk Space Management
Ensure at least 3 days of WAL storage to handle replication delays. Retaining 5+ days is ideal for troubleshooting issues.
Determining Disk Space Requirements
Use pg_wal_lsn_diff()
to measure WAL generation over time:
`SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));`
Consider these factors when determining storage requirements:
- Transaction Volume: Analyze typical and peak transaction rates to estimate WAL growth.
- Retention Needs: Multiply average WAL generation by the desired retention period (e.g., 3 days).
- Auto-Growth Limits: Configure auto-growth policies but don’t rely solely on them for space management.
Reducing WAL Growth
To prevent excessive WAL accumulation:
- Limit publications to essential tables and columns.
- Retain at least 3 days of WAL but avoid unnecessary extensions.
- Monitor replication slots—inactive slots cause WAL buildup.
-- Monitor slots
SELECT * FROM pg_replication_slots;
-- Drop Slot
SELECT pg_drop_replication_slot('{slot}');
- Ensure you are using
VACUUM
andANALYZE
operations regularly
Essential Metrics for Oversight
WAL Disk Utilization
- Datadog: postgresql.wal_bytes
- New Relic: Custom query needed, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));
- Grafana: Use pg_stat_wal.wal_bytes from the pg_stat_wal view
WAL Generation Velocity
- Datadog: postgresql.wal_write (to approximate velocity)
- New Relic: Custom query for rate, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')) - pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::pg_lsn - interval '1 minute')
- Grafana: Use pg_stat_wal.wal_bytes and calculate the change over time
Replication Lag
- Datadog: pg_stat_replication.flush_lag
- New Relic: Custom query on pg_stat_replication, e.g., SELECT sent_lsn - write_lsn AS lag FROM pg_stat_replication;
- Grafana: Monitor pg_stat_replication.flush_lag or pg_stat_replication.write_lag from the pg_stat_replication view
Replication Slot Status
- Datadog: Custom metric needed, query pg_replication_slots
- New Relic: Custom query, e.g., SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;
- Grafana: Query pg_replication_slots for active, restart_lsn, etc.
WAL File Count
- Datadog: Custom metric based on pgls_dir('pg_wal'), e.g., SELECT count() FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
- New Relic: Custom query, e.g., SELECT count(_) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
- Grafana: Use a query like SELECT count(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$'; to get the count of WAL files
Updated 3 days ago