Monitoring The PostgreSQL WAL Log

Guide to monitoring the PostgreSQL WAL Log

What is the PostgreSQL WAL?

Write-Ahead Logging (WAL) is an internal mechanism within PostgreSQL, engineered to guarantee durability, consistency, and recoverability. WAL ensures that any modification to the database is recorded sequentially in a log file before being committed to the data files. This allows the use of WAL for reading real-time updates out of PostgreSQL using Change Data Capture (CDC).

From PostgreSQL Version 16, replicas support logical replication and therefore you can run your pipeline off a replica to reduce the risk against production.

WAL Mechanics

WAL operates through the continuous generation of log files that encapsulate all transactional changes. These log files allow PostgreSQL to replay transactions post-crash, thereby maintaining database consistency. Each WAL file adheres to a fixed size, typically 16MB, and their generation rate corresponds directly to database activity levels. Moreover, WAL serves as a primary medium for data replication, transmitting changes to replica nodes efficiently. Sequential writing reduces random I/O, thereby enhancing performance.

WAL Growth and Contributing Factors

The volume of WAL files escalates with increased database activity, including transactional operations such as inserts, updates, and deletes. Here are some of the factors that will drive the WAL growth:

  • Transaction Intensity: High-frequency data modifications result in accelerated WAL generation.
  • Retention Policies: Extended retention of WAL files for recovery purposes or replication guarantees can substantially amplify storage demands. Streamkap recommends this being at 3 days.
  • Publications: Define the dataset available for replication as in which tables and changes are logged.
  • Replication Slots: Ensure data availability for replication by preserving WAL files until they are consumed by subscribers inc Streamkap. Unused or lagging slots exacerbate WAL growth, consuming disk resources unnecessarily.
  • Replication Delays: Logical replication, contingent upon publications and replication slots, experiences growth bottlenecks if subscribers lag or are inactive, leading to WAL accumulation.

WAL volumes naturally decrease as subscribers such as Streamkap consume logged data. However, inadequate subscriber activity or the persistence of unused slots can precipitate unchecked growth, straining storage capacity.

The restart_lsn in pg_replication_slots provides insight into how far back WAL files must be retained for each slot.

Disk Space Management for WAL

To accommodate WAL growth, careful allocation of disk resources is very important. You should provision sufficient space to sustain at least three days of WAL data for CDC workflows, thus mitigating risks associated with replication lags or downtime.

Determining Disk Space Requirements

Leverage tools like pg_wal_lsn_diff() to measure WAL generation over specific timeframes. For example:

SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));

Then bear in mind factors such as the following to determine how much spare disk space you require

  • Transaction Analysis: Evaluate typical and peak transaction rates to understand WAL generation patterns.
  • Retention Period Calculation: Multiply the average generation rate by the desired retention period (e.g., three days).
  • Database Auto Growth Constraints: Assess and configure database auto-growth policies, keeping in mind limitations on maximum growth rates and thresholds and so you must not rely only on this. Do not rely only on this.

Techniques for WAL Growth Mitigation

To curtail excessive WAL expansion, implement the following methodologies:

  • Restrict publications to essential tables and columns
  • Keep a minimum 3 days but it's unlikely you need more than this
  • Monitor slots and ensure they are Active. If unactive, WAL log will grow.
-- Monitor slots

SELECT \* FROM pg_replication_slots;

-- Drop Slots

DROP_REPLICATION_SLOT('slot_name');
  • Ensure you are using VACUUM and ANALYZE operations regularly

Essential Metrics for Oversight

WAL Disk Utilization

  • Datadog: postgresql.wal_bytes
  • New Relic: Custom query needed, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));
  • Grafana: Use pg_stat_wal.wal_bytes from the pg_stat_wal view

WAL Generation Velocity

  • Datadog: postgresql.wal_write (to approximate velocity)
  • New Relic: Custom query for rate, e.g., SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')) - pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')::pg_lsn - interval '1 minute')
  • Grafana: Use pg_stat_wal.wal_bytes and calculate the change over time

Replication Lag

  • Datadog: pg_stat_replication.flush_lag
  • New Relic: Custom query on pg_stat_replication, e.g., SELECT sent_lsn - write_lsn AS lag FROM pg_stat_replication;
  • Grafana: Monitor pg_stat_replication.flush_lag or pg_stat_replication.write_lag from the pg_stat_replication view

Replication Slot Status

  • Datadog: Custom metric needed, query pg_replication_slots
  • New Relic: Custom query, e.g., SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;
  • Grafana: Query pg_replication_slots for active, restart_lsn, etc.

WAL File Count

  • Datadog: Custom metric based on pgls_dir('pg_wal'), e.g., SELECT count() FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
  • New Relic: Custom query, e.g., SELECT count(_) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$';
  • Grafana: Use a query like SELECT count(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}$'; to get the count of WAL files