Skip to main content

MongoDB Sources FAQ for Streamkap

This FAQ focuses on using MongoDB as a source in streamkap, including general self-hosted setups and cloud variants (MongoDB Atlas, Amazon DocumentDB). Streamkap’s MongoDB connector provides real-time CDC with managed features like automatic scaling, UI setup, and ETL transformations.
A MongoDB source in streamkap enables real-time Change Data Capture (CDC) from MongoDB databases, capturing row-level inserts, updates, and deletes with sub-second latency. It uses change streams to stream changes to destinations, supporting snapshots for initial loads, schema evolution, and handling for nested data. Streamkap offers a serverless setup via UI or API.
  • MongoDB 4.0+ for basic CDC; 6.0+ for advanced features like post-images and full document lookups.
  • Compatible with MongoDB 3.6+ in some modes.
Streamkap supports:
  • Self-hosted (on-prem/VM).
  • MongoDB Atlas.
  • Amazon DocumentDB (MongoDB-compatible).
  • Streamkap also supports sharded clusters and replica sets, with automatic handling of shard additions/removals and membership changes.
  • CDC: Change streams for inserts/updates/deletes; supports oplog for resume tracking.
  • Snapshots: Ad-hoc/initial backfills using incremental or blocking methods; phased chunking for minimal impact.
  • Schema Evolution: Automatic handling of document changes; field renaming/exclusion.
  • Data Types: Supports integers, floats, strings, dates, arrays, objects, binary (configurable), JSON; extended JSON for identifiers.
  • Ingestion Modes: Inserts (append) or upserts.
  • Security: SSL, authentication, access control.
  • Monitoring: Latency, lag, queue sizes in-app; heartbeat messages.
  • Streamkap adds transaction metadata, filtering by collections/fields, and aggregation pipelines.
Streamkap uses MongoDB change streams to capture and decode oplog data, emitting changes as events. It starts from the last recorded transaction, performs a snapshot if needed, then streams from the oplog position. Supports full document updates with pre/post-images (MongoDB 6.0+).
  • Trigger ad-hoc at source/table level.
    Methods: Incremental (phased, chunked by ID, default 1024 documents) or blocking (stops streaming temporarily).
    Uses watermarking for progress; supports partial snapshots via conditions.
  • Modes: initial (default), always, initial_only, no_data, when_needed, configuration_based, custom.
    Streamkap simplifies triggering via UI.
  • Basics: Integers (INT32/64), floats (FLOAT32/64), strings, dates/timestamps.
  • Advanced: Arrays, objects (STRUCT/Tuple), binary (BYTES/base64/hex), decimals, JSON (STRING/io.debezium.data.Json).
  • Identifiers: Integer, float, string, document, ObjectId, binary (extended JSON strict mode).
  • Unsupported: Inconsistent nested structures without preprocessing; non-UTF8; oversized BSON (strategies: fail/skip/split in 6.0.9+).
Use queries for oplog size; tools like Datadog/New Relic for lag/queue metrics.Best Practices: Retain oplog 3–5 days; alert on growth.
  • Standalone servers unsupported (convert to replica set)
  • Oplog purging during downtime may lose events
  • BSON size limits (fail/skip/split)
  • No transactions in older versions
  • General: Sharded clusters need careful config; incremental snapshots require stable primary keys (non-strings recommended)
Captures deletes as events with before images; supports full records with pre-images.
Encrypted connections (SSL), keystore/truststore, authentication; role-based access.
  • Oplog Buildup: Monitor retention; resume from last position.
  • Connection Failures: Verify firewall, SSL, authentication.
  • Missing Events: Check include/exclude lists; resnapshot.
  • Streamkap-Specific: Check logs for resume token issues.
No, CDC cannot capture Views or most virtual database objects.Why Views cannot be captured:
CDC captures changes by reading the database transaction log (binlog, WAL, oplog, redo log, etc.). Views are query-time computations over base tables—they don’t store data or generate transaction log entries. When you query a view, the database engine executes the underlying SELECT statement against the base tables. Since views don’t store data, they don’t generate transaction log entries.
What cannot be captured:
  • Views: Virtual collections defined by aggregation pipelines, no physical storage or oplog entries
  • System Collections (system.*, admin.*, config.*): Metadata and internal state, not user data
  • Time Series Collections (MongoDB 5.0+): These use a specialized internal storage format optimized for time-stamped data (IoT sensors, metrics, logs). Change streams are NOT supported on time series collections because the bucketing storage engine compresses and reorganizes documents internally, making individual document change tracking incompatible with the oplog-based change stream mechanism. Solution: If you need CDC on time series data, store it in regular collections (with appropriate indexes on timestamp fields), or use aggregation pipelines to periodically export data snapshots.
  • On-Demand Materialized Views ($merge, $out results): Generated data, not original sources
What CAN be captured but with caveats:
  • Capped Collections (fixed-size collections): These fixed-size collections automatically overwrite oldest documents when full. CDC can capture them, but there’s risk of missing events. If your connector is offline or falls behind, and the oplog position it needs to resume from gets overwritten by newer operations, those change events are lost forever. Capped collections are typically used for high-throughput logs where data loss is acceptable. Recommendation: For mission-critical data requiring guaranteed capture, use regular collections instead.
Solution:
Configure CDC on the underlying base tables that power your views. The view logic can be recreated in your destination or transformation layer.
MongoDB-specific notes:
  • Aggregation pipelines on views: Capture the source collections and apply the pipeline logic downstream
  • Standalone MongoDB servers: Not supported for CDC at all—must use a replica set (minimum 1 member)
  • Time Series Collections (MongoDB 5.0+): Feature with columnar-like storage for time-stamped data. The internal bucketing mechanism groups documents by time windows and metadata, then compresses them. This storage optimization is incompatible with change streams because individual document inserts/updates/deletes cannot be tracked in the oplog. Change streams require document-level granularity which time series collections don’t provide. Workaround: Use regular collections with compound indexes on {metadata_field: 1, timestamp: 1} for CDC-compatible time series data.
Example:
If you have a view active_users_summary created from the users collection with filters and projections, capture the users collection instead, then apply the same aggregation logic in your destination.
  • Use replica sets (min 3 members for production)
  • Enable pre/post-images for full updates
  • Limit collections to needed ones
  • Test snapshots in staging