Batch vs Streaming

Traditional ETL is based on the batch loading of data.

You might achieve this by either doing a nightly job where you run a single query to extract all the data from your database to refresh your data warehouse, or you poll your database on some periodic cadence (for instance, every 60 minutes) to get the new data and load that into your data warehouse.

There are at least three big downsides:

  • Periodic spikes in load: These large queries impact the latency and ultimately the user experience, which is why a lot of companies tend to schedule jobs during low traffic periods e.g. overnight.
  • Network provisioning: Sending all that data puts a lot of strain on your network. And because you have big spikes in network costs and bytes that you’re sending, you have to provision your network to be able to handle that.
  • Delayed business decisions: Business decisions based on the data are delayed by your job frequency. So if you update your data every night that means you can’t query what happened yesterday until the following day.

This is where streaming has a big advantage!

Using change data capture to stream data from your database to your data warehouse solves the above problems because:

  • CDC does not require that you execute high load queries on a periodic basis, so you don’t get really spiky behaviors in load.
  • the data is sent continuously and in much smaller batches, you don’t need to provision as much network resource, and you can save money on network costs.
  • you’re continuously streaming data from your database to your data warehouse, the data in your warehouse is up-to-date, allowing you to create real-time insights, and giving you a leg up on your competitors because you’re making business decisions on fresher data.