PostgreSQL Source FAQ

PostgreSQL Sources FAQ for Streamkap

This FAQ focuses on using PostgreSQL as a source in Streamkap, including general self-hosted setups and cloud variants (AWS RDS/Aurora, Azure, Google Cloud SQL, Neon). Streamkap's PostgreSQL connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.

What is a PostgreSQL source in Streamkap?

A PostgreSQL source in Streamkap enables real-time Change Data Capture (CDC) from PostgreSQL databases, capturing row-level inserts, updates, and deletes with sub-second latency. It uses logical replication to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and heartbeats for low-traffic DBs. Streamkap abstracts complexity, offering a serverless setup via UI or API.

What PostgreSQL versions are supported as sources?

  • PostgreSQL 10+ for basic CDC; 13+ for advanced features like read-only snapshots and partitioned tables with publish_via_partition_root. Compatible with PostgreSQL 15–17+ in cloud deployments.
  • PostgreSQL 9.4+ for logical decoding; 10+ for pgoutput plugin; 13+ for read-only incremental snapshots; 16+ for replica server slots; 17+ for failover-enabled slots.

What PostgreSQL deployments are supported?

Streamkap supports:

  • Self-hosted (on-prem/VM).
  • AWS RDS PostgreSQL (including Aurora and Serverless).
  • Azure Database for PostgreSQL.
  • Google Cloud SQL for PostgreSQL.
  • Neon PostgreSQL.

Streamkap also supports these, plus CrunchyBridge and Amazon RDS, with similar configs.

What are the key features of PostgreSQL sources in Streamkap?

  • CDC: Log-based via pgoutput decoder; captures inserts/updates/deletes.
  • Snapshots: Ad-hoc/initial backfills using read-only (PG 13+) or signal table methods; watermarking for minimal impact.
  • Schema Evolution: Automatic handling of adds/drops/type changes.
  • Heartbeats: Enabled by default to prevent WAL buildup in low-traffic DBs via a streamkap_heartbeat table.
  • Data Types: Standard mappings (numerics, temporals, strings, binary as bytes/base64/hex, arrays/JSON/hstore); supports pgvector/VECTOR types in latest Streamkap.
  • Ingestion Modes: Inserts (append) or upserts.
  • Security: SSL, SSH/VPN, IP allowlisting.
  • Monitoring: Latency, lag, WAL metrics in-app.
  • Streamkap adds transaction metadata, ad-hoc snapshots with filters, and custom types like PostGIS/GEOMETRY.

How does CDC work for PostgreSQL sources?

Streamkap reads PostgreSQL's write-ahead log (WAL) via logical decoding, emitting changes as events. It uses Streamkap's pgoutput plugin for native support. REPLICA IDENTITY FULL ensures complete before/after images for updates/deletes.

What is REPLICA IDENTITY and why does it matter?

REPLICA IDENTITY controls logged data for changes:

  • DEFAULT: Logs PK only (partial deletes).
  • FULL: Logs all columns (complete records, recommended for audits/soft deletes).
    Set via ALTER TABLE ... REPLICA IDENTITY FULL;. Required for full delete handling

How do snapshots work for PostgreSQL sources?

  • Trigger ad-hoc at source/table level. Methods: Read-only (PG 13+, no signal table) or streamkap_signal table. Uses watermarking for incremental loads.
  • Modes like initial, always, when_needed; supports incremental (chunked, parallel) and read-only (PG 13+); ad-hoc via signaling. Streamkap simplifies triggering via UI.

What are heartbeats and how do they work?

Heartbeats prevent WAL buildup in low-traffic DBs by periodically updating a streamkap_heartbeat table, advancing the replication slot.

  • Setup: Create schema/table, grant permissions, insert initial row, add to publication if needed.
  • How it Works: Connector updates table periodically, generating change events.
  • Verification: Check table timestamps; troubleshoot permissions/publication inclusion.
  • Limitations: Requires manual addition to specific-table publications. Streamkap has configurable heartbeat.interval.ms for similar events.

What data types are supported?

  • Basics: Integers, floats, strings, dates/timestamps (micro/nano precision modes).
  • Advanced: Arrays, JSON, binary (bytes/hex), decimals (precise/double/string modes).
  • Custom: Domain types, network addresses, PostGIS (GEOMETRY/GEOGRAPHY), pgvector (VECTOR/HALFVEC/SPARSEVEC).
  • Unsupported: Non-UTF8 encodings; some spatial/custom without config.

How to monitor WAL for PostgreSQL sources?

  • Use queries like SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots WHERE slot_name = 'streamkap_slot'; for size/lag.
  • Tools: Datadog (wal_bytes), New Relic/Grafana (custom queries).
  • Best Practices: Retain 3–5 days WAL; alert on growth; drop inactive slots.
  • Streamkap: Monitor via JMX; WAL issues from inactive slots common.

What are common limitations?

  • Non-UTF8 unsupported; data loss risk on upgrades (recreate slots); read replicas limit snapshots; no DDL events.
  • No DDL; PK changes need manual handling; TOASTed values may be incomplete; consistency issues pre-commit; generated columns missing in pgoutput.
  • WAL buildup without heartbeats; schema changes during snapshots unsupported.

How to handle deletes?

Use REPLICA IDENTITY FULL for complete records. Supports soft deletes. In upserts, deletes propagate as events.

Troubleshooting common issues

  • WAL Buildup: Enable heartbeats; monitor/drop slots; retain 3–5 days.
  • Upgrade Data Loss: Stop writes, capture events, recreate slot, resnapshot.
  • Connection Failures: Verify pg_hba.conf, firewalls, SSL.
  • Missing Events: Ensure publication includes tables; check REPLICA IDENTITY.

Best practices for PostgreSQL sources

  • Use dedicated replication user.
  • Limit publications to needed tables.
  • Enable auto-vacuum; set WAL retention.
  • Test snapshots in staging.
  • For cloud: Monitor provider tools; use heartbeats for low traffic.
  • Use PG 17+ for failover resilience.