DocumentDB Source FAQ
Amazon DocumentDB Sources FAQ for streamkap
This FAQ focuses on using Amazon DocumentDB as a source in Streamkap, including general AWS-hosted setups (compatible with MongoDB). Streamkap's DocumentDB connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.
What is an Amazon DocumentDB source in Streamkap?
An Amazon DocumentDB source in streamkap enables real-time Change Data Capture (CDC) from DocumentDB databases, capturing document-level inserts, updates, and deletes with sub-second latency. It uses change streams (MongoDB-compatible) to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested JSON data. Streamkap offers a serverless setup via UI or API.
What Amazon DocumentDB versions are supported as sources?
- DocumentDB 4.0+ for basic CDC; 5.0+ for advanced features like enhanced change streams and array encoding options. Compatible with MongoDB 3.6/4.0 compatibility modes.
What Amazon DocumentDB deployments are supported?
- AWS-hosted DocumentDB clusters (single instance or replica sets).
Streamkap handles sharded setups and multi-region replicas with automatic shard/membership tracking.
What are the key features of Amazon DocumentDB sources in streamkap?
- CDC: Change streams for inserts/updates/deletes; oplog-based resume tracking.
- Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
- Schema Evolution: Automatic handling of document structure changes; field renaming/exclusion.
- Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable as bytes/base64/hex), JSON; extended JSON for identifiers.
- Ingestion Modes: Inserts (append) or upserts.
- Security: SSL, IAM authentication, access control.
- Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
- Streamkap adds transaction metadata, filtering by collections, and aggregation pipelines.
How does CDC work for Amazon DocumentDB sources?
Streamkap uses DocumentDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (DocumentDB 5.0+).
How do snapshots work for Amazon DocumentDB sources?
- *Trigger ad-hoc at source/collection level. Methods: Incremental (phased, chunked by _id, default 1024 documents) or blocking (pauses streaming). Uses watermarking for progress; supports partial snapshots via conditions.
- Modes: initial (default), always, initial_only, no_data, when_needed, configuration_based, custom.
What data types are supported?
- Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
- Advanced: Arrays (configurable encoding: array, document, string), objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (STRING/io.debezium.data.Json).
- Identifiers: _id (ObjectId, string, etc.), binary (extended JSON strict mode).
Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split).
How to set up a general Amazon DocumentDB source?
- Ensure DocumentDB cluster is in active state with change streams enabled (default in 4.0+)
- Create IAM user with read permissions on cluster and streamkap_signal collection
- Create
streamkap_signal
collection for snapshots (can be in a different DB on same instance) - In Streamkap UI: Add source, enter connection string (e.g., mongodb://<user>:<pass>@<host>:27017/?ssl=true&replicaSet=rs0), databases/collections, snapshot mode, array encoding (array optimal, document/string for mixed types).
- Allow Streamkap IPs in VPC security group.
How to monitor for Amazon DocumentDB sources?
Use AWS CloudWatch for oplog size/lag; Streamkap app for queue metrics. Best Practices: Retain oplog 7 days (min 48 hours); alert on growth.
What are common limitations?
- Standalone instances unsupported (requires replica set); oplog purging during downtime may lose events; BSON size limits (fail/skip/split); no transactions pre-4.0.
- Sharded clusters need config server access; incremental snapshots require stable _id (non-strings preferred); UTF-8 only.
How to handle deletes?
Captures deletes as events with before images; supports full records with pre-images (5.0+).
What security features are available?
Encrypted connections (SSL), IAM authentication, role-based access; VPC security groups.
Troubleshooting common issues
- Oplog Buildup: Monitor retention (AWS Console); resume from last position.
- Connection Failures: Verify VPC, SSL, IAM roles.
- Missing Events: Check include/exclude lists; ensure change streams enabled.
- Streamkap-Specific: Check logs for resume token issues; validate signal collection.
Best practices for Amazon DocumentDB sources
- Use replica sets (min 3 nodes for production).
- Enable pre/post-images for full updates (5.0+).
- Limit collections to reduce load.
- Test snapshots in staging.
- Monitor via CloudWatch; set 7-day oplog retention.
Updated about 20 hours ago