Skip to main content

Amazon DocumentDB Sources FAQ for streamkap

This FAQ focuses on using Amazon DocumentDB as a source in Streamkap, including general AWS-hosted setups (compatible with MongoDB). Streamkap’s DocumentDB connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.
An Amazon DocumentDB source in streamkap enables real-time Change Data Capture (CDC) from DocumentDB databases, capturing document-level inserts, updates, and deletes with sub-second latency. It uses change streams (MongoDB-compatible) to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested JSON data. Streamkap offers a serverless setup via UI or API.
  • DocumentDB 4.0+ for basic CDC; 5.0+ for advanced features like enhanced change streams and array encoding options.
  • Compatible with MongoDB 3.6/4.0 compatibility modes.
  • AWS-hosted DocumentDB clusters (single instance or replica sets).
  • Streamkap handles sharded setups and multi-region replicas with automatic shard/membership tracking.
  • CDC: Change streams for inserts/updates/deletes; oplog-based resume tracking.
  • Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
  • Schema Evolution: Automatic handling of document structure changes; field renaming/exclusion.
  • Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable as bytes/base64/hex), JSON; extended JSON for identifiers.
  • Ingestion Modes: Inserts (append) or upserts.
  • Security: SSL, IAM authentication, access control.
  • Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
  • Streamkap adds transaction metadata, filtering by collections, and aggregation pipelines.
Streamkap uses DocumentDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (DocumentDB 5.0+).
  • Trigger ad-hoc at source/collection level.
    Methods: Incremental (phased, chunked by _id, default 1024 documents) or blocking (pauses streaming).
    Uses watermarking for progress; supports partial snapshots via conditions.
  • Modes: initial (default), always, initial_only, no_data, when_needed, configuration_based, custom.
  • Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
  • Advanced: Arrays (configurable encoding: array, document, string), objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (STRING/io.debezium.data.Json).
  • Identifiers: _id (ObjectId, string, etc.), binary (extended JSON strict mode).
  • Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split).
  1. Ensure DocumentDB cluster is in active state with change streams enabled (default in 4.0+)
  2. Create IAM user with read permissions on cluster and streamkap_signal collection
  3. Create streamkap_signal collection for snapshots (can be in a different DB on same instance)
  4. In Streamkap UI: Add source, enter connection string (e.g., mongodb://<user>:<pass>@<host>:27017/?ssl=true&replicaSet=rs0), databases/collections, snapshot mode, array encoding
  5. Allow Streamkap IPs in VPC security group.
Use AWS CloudWatch for oplog size/lag; Streamkap app for queue metrics.Best Practices: Retain oplog 7 days (min 48 hours); alert on growth.
  • Standalone instances unsupported (requires replica set)
  • Oplog purging during downtime may lose events
  • BSON size limits (fail/skip/split)
  • No transactions pre-4.0
  • Sharded clusters need config server access
  • Incremental snapshots require stable _id (non-strings preferred)
  • UTF-8 only
Captures deletes as events with before images; supports full records with pre-images (5.0+).
Encrypted connections (SSL), IAM authentication, role-based access; VPC security groups.
  • Oplog Buildup: Monitor retention (AWS Console); resume from last position
  • Connection Failures: Verify VPC, SSL, IAM roles
  • Missing Events: Check include/exclude lists; ensure change streams enabled
  • Streamkap-Specific: Check logs for resume token issues; validate signal collection
No, CDC cannot capture Views or most virtual database objects.Why Views cannot be captured:
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.
What cannot be captured:
  • Views: Virtual collections defined by aggregation pipelines, no physical storage or oplog entries
  • System Collections (system.*, admin.*, config.*): Metadata and internal state, not user data
  • Time Series Collections: Amazon DocumentDB does not natively support MongoDB 5.0+ time series collections. However, if using DocumentDB 5.0-compatible mode or custom implementations of time-stamped data with specialized storage, change streams may be limited or unavailable due to storage optimizations that don’t maintain document-level change granularity in the oplog. Solution: Use regular collections with appropriate time-based indexes for CDC on time-stamped data.
  • On-Demand Materialized Views ($merge, $out results): Generated data, not original sources
Solution:
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.
DocumentDB-specific notes:
  • Aggregation pipelines on views: Capture the source collections and apply the pipeline logic downstream
  • Standalone instances: Not supported for CDC—must use a replica set configuration
Example:
If you have a view order_summary created from the orders collection with filters and projections, capture the orders collection instead, then apply the same aggregation logic in your destination.
  • Use replica sets (min 3 nodes for production)
  • Enable pre/post-images for full updates (5.0+)
  • Limit collections to reduce load
  • Test snapshots in staging
  • Monitor via CloudWatch; set 7-day oplog retention