PostgreSQL Sources FAQ for Streamkap

This FAQ focuses on using PostgreSQL as a source in Streamkap, including general self-hosted setups and cloud variants (AWS RDS/Aurora, Azure, Google Cloud SQL, Neon). Streamkap's PostgreSQL connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.

What is a PostgreSQL source in Streamkap?

A PostgreSQL source in Streamkap enables real-time Change Data Capture (CDC) from PostgreSQL databases, capturing row-level inserts, updates, and deletes with sub-second latency. It uses logical replication to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and heartbeats for low-traffic DBs. Streamkap abstracts complexity, offering a serverless setup via UI or API.

What PostgreSQL versions are supported as sources?

PostgreSQL 10+ for basic CDC; 13+ for advanced features like read-only snapshots and partitioned tables with publish_via_partition_root. Compatible with PostgreSQL 15–17+ in cloud deployments.
PostgreSQL 9.4+ for logical decoding; 10+ for pgoutput plugin; 13+ for read-only incremental snapshots; 16+ for replica server slots; 17+ for failover-enabled slots.

What PostgreSQL deployments are supported?

Streamkap supports:

Self-hosted (on-prem/VM).
AWS RDS PostgreSQL (including Aurora and Serverless).
Azure Database for PostgreSQL.
Google Cloud SQL for PostgreSQL.
Neon PostgreSQL.

Streamkap also supports these, plus CrunchyBridge and Amazon RDS, with similar configs.

What are the key features of PostgreSQL sources in Streamkap?

CDC: Log-based via pgoutput decoder; captures inserts/updates/deletes.
Snapshots: Ad-hoc/initial backfills using read-only (PG 13+) or signal table methods; watermarking for minimal impact.
Schema Evolution: Automatic handling of adds/drops/type changes.
Heartbeats: Enabled by default to prevent WAL buildup in low-traffic DBs via a streamkap_heartbeat table.
Data Types: Standard mappings (numerics, temporals, strings, binary as bytes/base64/hex, arrays/JSON/hstore); supports pgvector/VECTOR types in latest Streamkap.
Ingestion Modes: Inserts (append) or upserts.
Security: SSL, SSH/VPN, IP allowlisting.
Monitoring: Latency, lag, WAL metrics in-app.
Streamkap adds transaction metadata, ad-hoc snapshots with filters, and custom types like PostGIS/GEOMETRY.

How does CDC work for PostgreSQL sources?

Streamkap reads PostgreSQL's write-ahead log (WAL) via logical decoding, emitting changes as events. It uses Streamkap's pgoutput plugin for native support. REPLICA IDENTITY FULL ensures complete before/after images for updates/deletes.

What is REPLICA IDENTITY and why does it matter?

REPLICA IDENTITY controls logged data for changes:

DEFAULT: Logs PK only (partial deletes).
FULL: Logs all columns (complete records, recommended for audits/soft deletes).
Set via ALTER TABLE ... REPLICA IDENTITY FULL;. Required for full delete handling

How do snapshots work for PostgreSQL sources?

Trigger ad-hoc at source/table level. Methods: Read-only (PG 13+, no signal table) or streamkap_signal table. Uses watermarking for incremental loads.
Modes like initial, always, when_needed; supports incremental (chunked, parallel) and read-only (PG 13+); ad-hoc via signaling. Streamkap simplifies triggering via UI.

What are heartbeats and how do they work?

Heartbeats prevent WAL buildup in low-traffic DBs by periodically updating a streamkap_heartbeat table, advancing the replication slot.

Setup: Create schema/table, grant permissions, insert initial row, add to publication if needed.
How it Works: Connector updates table periodically, generating change events.
Verification: Check table timestamps; troubleshoot permissions/publication inclusion.
Limitations: Requires manual addition to specific-table publications. Streamkap has configurable heartbeat.interval.ms for similar events.

What data types are supported?

Basics: Integers, floats, strings, dates/timestamps (micro/nano precision modes).
Advanced: Arrays, JSON, binary (bytes/hex), decimals (precise/double/string modes).
Custom: Domain types, network addresses, PostGIS (GEOMETRY/GEOGRAPHY), pgvector (VECTOR/HALFVEC/SPARSEVEC).
Unsupported: Non-UTF8 encodings; some spatial/custom without config.

How to monitor WAL for PostgreSQL sources?

Use queries like SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots WHERE slot_name = 'streamkap_slot'; for size/lag.
Tools: Datadog (wal_bytes), New Relic/Grafana (custom queries).
Best Practices: Retain 3–5 days WAL; alert on growth; drop inactive slots.
Streamkap: Monitor via JMX; WAL issues from inactive slots common.

What are common limitations?

Non-UTF8 unsupported; data loss risk on upgrades (recreate slots); read replicas limit snapshots; no DDL events.
No DDL; PK changes need manual handling; TOASTed values may be incomplete; consistency issues pre-commit; generated columns missing in pgoutput.
WAL buildup without heartbeats; schema changes during snapshots unsupported.

How to handle deletes?

Use REPLICA IDENTITY FULL for complete records. Supports soft deletes. In upserts, deletes propagate as events.

Troubleshooting common issues

WAL Buildup: Enable heartbeats; monitor/drop slots; retain 3–5 days.
Upgrade Data Loss: Stop writes, capture events, recreate slot, resnapshot.
Connection Failures: Verify pg_hba.conf, firewalls, SSL.
Missing Events: Ensure publication includes tables; check REPLICA IDENTITY.

Best practices for PostgreSQL sources

Use dedicated replication user.
Limit publications to needed tables.
Enable auto-vacuum; set WAL retention.
Test snapshots in staging.
For cloud: Monitor provider tools; use heartbeats for low traffic.
Use PG 17+ for failover resilience.