> ## Documentation Index
> Fetch the complete documentation index at: https://docs.streamkap.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring The PostgreSQL WAL Log

> Guide to monitoring the PostgreSQL WAL Log

## What is the PostgreSQL WAL?

Write-Ahead Logging (WAL) is how PostgreSQL records changes before applying them to the database. This ensures data durability, consistency, and recovery. WAL also enables Change Data Capture (CDC) for real-time streaming solutions like Streamkap.

Starting with PostgreSQL 16, read replicas support **logical replication**, allowing pipelines to run from a replica—depending on your cloud provider.

## How WAL Works

WAL continuously generates log files capturing all database changes. These logs help restore transactions after a crash and enable replication. Each WAL file is usually 16MB, and the number generated depends on database activity. Since WAL writes sequentially, it reduces random I/O and improves performance.

## Factors Affecting WAL Growth

WAL size increases with database activity, particularly:

* **Frequent Transactions**: More inserts, updates, and deletes generate more WAL.
* **Retention Policies**: Keeping WAL files for recovery or replication increases storage needs. Streamkap recommends a 3-day retention.
* **Publications**: Define which tables and changes are logged for replication.
* **Replication Slots**: Ensure WAL is retained for subscribers (e.g., Streamkap). Unused or slow slots cause WAL buildup.
* **Replication Delays**: If subscribers lag, WAL accumulates, consuming disk space.

As subscribers consume data, WAL size decreases. The `restart_lsn` in `pg_replication_slots` shows how far back WAL is retained per slot.

## Disk Space Management

Ensure at least 3 days of WAL storage to handle replication delays. Retaining 5+ days is ideal for troubleshooting issues.

## Determining Disk Space Requirements

Use `pg_wal_lsn_diff()` to measure WAL generation over time:

<CodeGroup>
  ```SQL SQL theme={null}
  `SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000'));`
  ```
</CodeGroup>

Consider these factors when determining storage requirements:

* **Transaction Volume**: Analyze typical and peak transaction rates to estimate WAL growth.
* **Retention Needs**: Multiply average WAL generation by the desired retention period (e.g., 3 days).
* **Auto-Growth Limits**: Configure auto-growth policies but don’t rely solely on them for space management.

## Reducing WAL Growth

To prevent excessive WAL accumulation:

* Limit publications to essential tables and columns.
* Retain at least 3 days of WAL but avoid unnecessary extensions.
* Monitor replication slots—inactive slots cause WAL buildup.

<CodeGroup>
  ```SQL SQL theme={null}
  -- Monitor slots

  SELECT * FROM pg_replication_slots;

  -- Drop Slot
  SELECT pg_drop_replication_slot('{slot}');
  ```
</CodeGroup>

* Ensure you are using `VACUUM` and `ANALYZE` operations regularly

## Essential Metrics for Oversight

### WAL Disk Utilization

* **Datadog**: postgresql.wal\_bytes
* **New Relic**: Custom query needed, e.g., SELECT pg\_size\_pretty(pg\_wal\_lsn\_diff(pg\_current\_wal\_lsn(), '0/00000000'));
* **Grafana**: Use pg\_stat\_wal.wal\_bytes from the pg\_stat\_wal view

### WAL Generation Velocity

* **Datadog**: postgresql.wal\_write (to approximate velocity)
* **New Relic**: Custom query for rate, e.g., SELECT pg\_size\_pretty(pg\_wal\_lsn\_diff(pg\_current\_wal\_lsn(), '0/00000000')) - pg\_size\_pretty(pg\_wal\_lsn\_diff(pg\_current\_wal\_lsn(), '0/00000000')::pg\_lsn - interval '1 minute')
* **Grafana**: Use pg\_stat\_wal.wal\_bytes and calculate the change over time

### Replication Lag

* **Datadog**: pg\_stat\_replication.flush\_lag
* **New Relic**: Custom query on pg\_stat\_replication, e.g., SELECT sent\_lsn - write\_lsn AS lag FROM pg\_stat\_replication;
* **Grafana**: Monitor pg\_stat\_replication.flush\_lag or pg\_stat\_replication.write\_lag from the pg\_stat\_replication view

### Replication Slot Status

* **Datadog**: Custom metric needed, query pg\_replication\_slots
* **New Relic**: Custom query, e.g., SELECT slot\_name, active, pg\_size\_pretty(pg\_wal\_lsn\_diff(pg\_current\_wal\_lsn(), restart\_lsn)) AS lag FROM pg\_replication\_slots;
* **Grafana**: Query pg\_replication\_slots for active, restart\_lsn, etc.

### WAL File Count

* **Datadog**: Custom metric based on pg*ls\_dir('pg\_wal'), e.g., SELECT count(*) FROM pg\_ls\_dir('pg\_wal') WHERE pg\_ls\_dir \~ '^\[0-9A-F]\{24}\$';
* **New Relic**: Custom query, e.g., SELECT count(\_) FROM pg\_ls\_dir('pg\_wal') WHERE pg\_ls\_dir \~ '^\[0-9A-F]\{24}\$';
* **Grafana**: Use a query like SELECT count(\*) FROM pg\_ls\_dir('pg\_wal') WHERE pg\_ls\_dir \~ '^\[0-9A-F]\{24}\$'; to get the count of WAL files
