Batch vs Streaming Data
Traditional ETL is based on the batch loading of data. You would achieve this by either doing a nightly job where you do one big query to extract all the data from your database to then refresh your data warehouse, or you poll your database on some periodic cadence, for instance, every half hour or an hour, to get the new data and just load that new data into your data warehouse. Either way, there are three big downsides to this process:
Periodic spikes in load: These large queries impact the latency and ultimately the user experience, which is why a lot of companies tend to schedule spikes in low traffic periods.
Network provisioning: Sending all that data puts a lot of strain on your network. And because you have big spikes in network costs and bytes that you’re sending over the network, you have to provision your network to be able to handle peak traffic and peak batch sending of data.
Delayed business decisions: Business decisions based on the data are delayed by your polling frequency. So if you update your data every night that means you can’t query what happened yesterday until the next day.
Using change data capture to stream data from your primary database to your data warehouse solves these three problems for the following reasons:
CDC does not require that you execute high load queries on a periodic basis, so you don’t get really spiky behaviors in load. You may need to add additional log levels to produce scannable transaction logs which can marginal performance overhead.
Because the data is sent continuously and in much smaller batches, you don’t need to provision as much network in order to make that work, and you can save money on network costs.
Because you’re continuously streaming data from your database to your data warehouse, the data in your warehouse is up-to-date, allowing you to create real-time insights, and giving you a leg up on your competitors because you’re making business decisions on fresher data.
Updated about 1 year ago