DocumentDB Source FAQ

Amazon DocumentDB Sources FAQ for streamkap

This FAQ focuses on using Amazon DocumentDB as a source in Streamkap, including general AWS-hosted setups (compatible with MongoDB). Streamkap’s DocumentDB connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.

What is an Amazon DocumentDB source in Streamkap?

An Amazon DocumentDB source in streamkap enables real-time Change Data Capture (CDC) from DocumentDB databases, capturing document-level inserts, updates, and deletes with sub-second latency. It uses change streams (MongoDB-compatible) to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested JSON data. Streamkap offers a serverless setup via UI or API.

What Amazon DocumentDB versions are supported as sources?

DocumentDB 4.0+ for basic CDC; 5.0+ for advanced features like enhanced change streams and array encoding options.
Compatible with MongoDB 3.6/4.0 compatibility modes.

What Amazon DocumentDB deployments are supported?

AWS-hosted DocumentDB clusters (single instance or replica sets).
Streamkap handles sharded setups and multi-region replicas with automatic shard/membership tracking.

What are the key features of Amazon DocumentDB sources in Streamkap?

CDC: Change streams for inserts/updates/deletes; oplog-based resume tracking.
Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
Schema Evolution: Automatic handling of document structure changes; field renaming/exclusion.
Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable as bytes/base64/hex), JSON; extended JSON for identifiers.
Ingestion Modes: Inserts (append) or upserts.
Security: SSL, IAM authentication, access control.
Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
Streamkap adds transaction metadata, filtering by collections, and aggregation pipelines.

How does CDC work for Amazon DocumentDB sources?

Streamkap uses DocumentDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (DocumentDB 5.0+).

How do snapshots work for Amazon DocumentDB sources?

Trigger ad-hoc at source/collection level.
Methods: Incremental (phased, chunked by _id, default 1024 documents) or blocking (pauses streaming).
Uses watermarking for progress; supports partial snapshots via conditions.
Modes: initial (default), always, initial_only, no_data, when_needed, configuration_based, custom.

What data types are supported?

Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
Advanced: Arrays (configurable encoding: array, document, string), objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (STRING/io.debezium.data.Json).
Identifiers: _id (ObjectId, string, etc.), binary (extended JSON strict mode).
Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split).

How to set up a general Amazon DocumentDB source?

Ensure DocumentDB cluster is in active state with change streams enabled (default in 4.0+)
Create IAM user with read permissions on cluster and streamkap_signal collection
Create streamkap_signal collection for snapshots (can be in a different DB on same instance)
In Streamkap UI: Add source, enter connection string (e.g., mongodb://<user>:<pass>@<host>:27017/?ssl=true&replicaSet=rs0), databases/collections, snapshot mode, array encoding
Allow Streamkap IPs in VPC security group.

How to monitor for Amazon DocumentDB sources?

Use AWS CloudWatch for oplog size/lag; Streamkap app for queue metrics.Best Practices: Retain oplog 7 days (min 48 hours); alert on growth.

What are common limitations?

Standalone instances unsupported (requires replica set)
Oplog purging during downtime may lose events
BSON size limits (fail/skip/split)
No transactions pre-4.0
Sharded clusters need config server access
Incremental snapshots require stable _id (non-strings preferred)
UTF-8 only

How to handle deletes?

Captures deletes as events with before images; supports full records with pre-images (5.0+).

What security features are available?

Encrypted connections (SSL), IAM authentication, role-based access; VPC security groups.

Troubleshooting common issues

Oplog Buildup: Monitor retention (AWS Console); resume from last position
Connection Failures: Verify VPC, SSL, IAM roles
Missing Events: Check include/exclude lists; ensure change streams enabled
Streamkap-Specific: Check logs for resume token issues; validate signal collection

Can CDC capture database Views and other virtual objects?

No, CDC cannot capture Views or most virtual database objects.Why Views cannot be captured:
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.What cannot be captured:

Views: Virtual collections defined by aggregation pipelines, no physical storage or oplog entries
System Collections (system.*, admin.*, config.*): Metadata and internal state, not user data
Time Series Collections: Amazon DocumentDB does not natively support MongoDB 5.0+ time series collections. However, if using DocumentDB 5.0-compatible mode or custom implementations of time-stamped data with specialized storage, change streams may be limited or unavailable due to storage optimizations that don’t maintain document-level change granularity in the oplog. Solution: Use regular collections with appropriate time-based indexes for CDC on time-stamped data.
On-Demand Materialized Views ($merge, $out results): Generated data, not original sources

Solution:
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.DocumentDB-specific notes:

Aggregation pipelines on views: Capture the source collections and apply the pipeline logic downstream
Standalone instances: Not supported for CDC—must use a replica set configuration

Example:
If you have a view order_summary created from the orders collection with filters and projections, capture the orders collection instead, then apply the same aggregation logic in your destination.

Best practices for Amazon DocumentDB sources

Use replica sets (min 3 nodes for production)
Enable pre/post-images for full updates (5.0+)
Limit collections to reduce load
Test snapshots in staging
Monitor via CloudWatch; set 7-day oplog retention

Getting Started

App

Deployment

Sources

Destinations

Transformation

Billing

Security

Rest API

Amazon DocumentDB Sources FAQ for streamkap

Getting Started

App

Deployment

Sources

Destinations

Transformation

Billing

Security

Rest API

​Amazon DocumentDB Sources FAQ for streamkap

Amazon DocumentDB Sources FAQ for streamkap