Monitoring The PostgreSQL WAL Log

Guide to monitoring the PostgreSQL WAL Log

What is the PostgreSQL WAL?

Write-Ahead Logging (WAL) is how PostgreSQL records changes before applying them to the database. This ensures data durability, consistency, and recovery. WAL also enables Change Data Capture (CDC) for real-time streaming solutions like Streamkap.

Starting with PostgreSQL 16, read replicas support logical replication, allowing pipelines to run from a replica—depending on your cloud provider.

How WAL Works

WAL continuously generates log files capturing all database changes. These logs help restore transactions after a crash and enable replication. Each WAL file is usually 16MB, and the number generated depends on database activity. Since WAL writes sequentially, it reduces random I/O and improves performance.

Factors Affecting WAL Growth

WAL size increases with database activity, particularly:

  • Frequent Transactions: More inserts, updates, and deletes generate more WAL.
  • Retention Policies: Keeping WAL files for recovery or replication increases storage needs. Streamkap recommends a 3-day retention.
  • Publications: Define which tables and changes are logged for replication.
  • Replication Slots: Ensure WAL is retained for subscribers (e.g., Streamkap). Unused or slow slots cause WAL buildup.
  • Replication Delays: If subscribers lag, WAL accumulates, consuming disk space.

As subscribers consume data, WAL size decreases. The restart_lsn in pg_replication_slots shows how far back WAL is retained per slot.

Disk Space Management

Ensure at least 3 days of WAL storage to handle replication delays. Retaining 5+ days is ideal for troubleshooting issues.

Determining Disk Space Requirements

Use pg_wal_lsn_diff() to measure WAL generation over time:

`SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));`

Consider these factors when determining storage requirements:

  • Transaction Volume: Analyze typical and peak transaction rates to estimate WAL growth.
  • Retention Needs: Multiply average WAL generation by the desired retention period (e.g., 3 days).
  • Auto-Growth Limits: Configure auto-growth policies but don’t rely solely on them for space management.

Reducing WAL Growth

To prevent excessive WAL accumulation:

  • Limit publications to essential tables and columns.
  • Retain at least 3 days of WAL but avoid unnecessary extensions.
  • Monitor replication slots—inactive slots cause WAL buildup.
-- Monitor slots

SELECT * FROM pg_replication_slots;

-- Drop Slot
SELECT pg_drop_replication_slot('{slot}');
  • Ensure you are using VACUUM and ANALYZE operations regularly

Essential Metrics for Oversight

WAL Disk Utilization

  • Datadog: postgresql.wal_bytes
  • New Relic: Custom query needed, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));
  • Grafana: Use pg_stat_wal.wal_bytes from the pg_stat_wal view

WAL Generation Velocity

  • Datadog: postgresql.wal_write (to approximate velocity)
  • New Relic: Custom query for rate, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')) - pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::pg_lsn - interval '1 minute')
  • Grafana: Use pg_stat_wal.wal_bytes and calculate the change over time

Replication Lag

  • Datadog: pg_stat_replication.flush_lag
  • New Relic: Custom query on pg_stat_replication, e.g., SELECT sent_lsn - write_lsn AS lag FROM pg_stat_replication;
  • Grafana: Monitor pg_stat_replication.flush_lag or pg_stat_replication.write_lag from the pg_stat_replication view

Replication Slot Status

  • Datadog: Custom metric needed, query pg_replication_slots
  • New Relic: Custom query, e.g., SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;
  • Grafana: Query pg_replication_slots for active, restart_lsn, etc.

WAL File Count

  • Datadog: Custom metric based on pgls_dir('pg_wal'), e.g., SELECT count() FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
  • New Relic: Custom query, e.g., SELECT count(_) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
  • Grafana: Use a query like SELECT count(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$'; to get the count of WAL files