Amazon DocumentDB Sources FAQ for streamkap
This FAQ focuses on using Amazon DocumentDB as a source in Streamkap, including general AWS-hosted setups (compatible with MongoDB). Streamkap’s DocumentDB connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.What is an Amazon DocumentDB source in Streamkap?
What is an Amazon DocumentDB source in Streamkap?
An Amazon DocumentDB source in streamkap enables real-time Change Data Capture (CDC) from DocumentDB databases, capturing document-level inserts, updates, and deletes with sub-second latency. It uses change streams (MongoDB-compatible) to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested JSON data. Streamkap offers a serverless setup via UI or API.
What Amazon DocumentDB versions are supported as sources?
What Amazon DocumentDB versions are supported as sources?
- DocumentDB 4.0+ for basic CDC; 5.0+ for advanced features like enhanced change streams and array encoding options.
- Compatible with MongoDB 3.6/4.0 compatibility modes.
What Amazon DocumentDB deployments are supported?
What Amazon DocumentDB deployments are supported?
- AWS-hosted DocumentDB clusters (single instance or replica sets).
- Streamkap handles sharded setups and multi-region replicas with automatic shard/membership tracking.
What are the key features of Amazon DocumentDB sources in Streamkap?
What are the key features of Amazon DocumentDB sources in Streamkap?
- CDC: Change streams for inserts/updates/deletes; oplog-based resume tracking.
- Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
- Schema Evolution: Automatic handling of document structure changes; field renaming/exclusion.
- Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable as bytes/base64/hex), JSON; extended JSON for identifiers.
- Ingestion Modes: Inserts (append) or upserts.
- Security: SSL, IAM authentication, access control.
- Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
- Streamkap adds transaction metadata, filtering by collections, and aggregation pipelines.
How does CDC work for Amazon DocumentDB sources?
How does CDC work for Amazon DocumentDB sources?
Streamkap uses DocumentDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (DocumentDB 5.0+).
How do snapshots work for Amazon DocumentDB sources?
How do snapshots work for Amazon DocumentDB sources?
-
Trigger ad-hoc at source/collection level.
Methods: Incremental (phased, chunked by_id, default 1024 documents) or blocking (pauses streaming).
Uses watermarking for progress; supports partial snapshots via conditions. -
Modes:
initial(default),always,initial_only,no_data,when_needed,configuration_based,custom.
What data types are supported?
What data types are supported?
- Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
- Advanced: Arrays (configurable encoding: array, document, string), objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (
STRING/io.debezium.data.Json). - Identifiers:
_id(ObjectId, string, etc.), binary (extended JSON strict mode). - Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split).
How to set up a general Amazon DocumentDB source?
How to set up a general Amazon DocumentDB source?
- Ensure DocumentDB cluster is in active state with change streams enabled (default in 4.0+)
- Create IAM user with read permissions on cluster and
streamkap_signalcollection - Create
streamkap_signalcollection for snapshots (can be in a different DB on same instance) - In Streamkap UI: Add source, enter connection string (e.g.,
mongodb://<user>:<pass>@<host>:27017/?ssl=true&replicaSet=rs0), databases/collections, snapshot mode, array encoding - Allow Streamkap IPs in VPC security group.
How to monitor for Amazon DocumentDB sources?
How to monitor for Amazon DocumentDB sources?
Use AWS CloudWatch for oplog size/lag; Streamkap app for queue metrics.Best Practices: Retain oplog 7 days (min 48 hours); alert on growth.
What are common limitations?
What are common limitations?
- Standalone instances unsupported (requires replica set)
- Oplog purging during downtime may lose events
- BSON size limits (fail/skip/split)
- No transactions pre-4.0
- Sharded clusters need config server access
- Incremental snapshots require stable
_id(non-strings preferred) - UTF-8 only
How to handle deletes?
How to handle deletes?
Captures deletes as events with before images; supports full records with pre-images (5.0+).
What security features are available?
What security features are available?
Encrypted connections (SSL), IAM authentication, role-based access; VPC security groups.
Troubleshooting common issues
Troubleshooting common issues
- Oplog Buildup: Monitor retention (AWS Console); resume from last position
- Connection Failures: Verify VPC, SSL, IAM roles
- Missing Events: Check include/exclude lists; ensure change streams enabled
- Streamkap-Specific: Check logs for resume token issues; validate signal collection
Can CDC capture database Views and other virtual objects?
Can CDC capture database Views and other virtual objects?
No, CDC cannot capture Views or most virtual database objects.Why Views cannot be captured:
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.What cannot be captured:
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.DocumentDB-specific notes:
If you have a view
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.What cannot be captured:
- Views: Virtual collections defined by aggregation pipelines, no physical storage or oplog entries
- System Collections (system.*, admin.*, config.*): Metadata and internal state, not user data
- Time Series Collections: Amazon DocumentDB does not natively support MongoDB 5.0+ time series collections. However, if using DocumentDB 5.0-compatible mode or custom implementations of time-stamped data with specialized storage, change streams may be limited or unavailable due to storage optimizations that don’t maintain document-level change granularity in the oplog. Solution: Use regular collections with appropriate time-based indexes for CDC on time-stamped data.
- On-Demand Materialized Views (
$merge,$outresults): Generated data, not original sources
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.DocumentDB-specific notes:
- Aggregation pipelines on views: Capture the source collections and apply the pipeline logic downstream
- Standalone instances: Not supported for CDC—must use a replica set configuration
If you have a view
order_summary created from the orders collection with filters and projections, capture the orders collection instead, then apply the same aggregation logic in your destination.Best practices for Amazon DocumentDB sources
Best practices for Amazon DocumentDB sources
- Use replica sets (min 3 nodes for production)
- Enable pre/post-images for full updates (5.0+)
- Limit collections to reduce load
- Test snapshots in staging
- Monitor via CloudWatch; set 7-day oplog retention