A data pipeline is effectively an automated assembly line for taking raw data from different sources, processing it so it becomes usable, and moving it to where it will be used.
Why do you need data pipelines?
For the same reason you can’t take a hunk of metal and say, “Okay, now you’re a car! Get into drive!”
A car may be made out of a hunk of metal, but it needs to undergo some heavy processing before you can use it as a car.
The data that powers your predictive analytics, data visualizations or machine learning models may be made out of the raw data that appears in your XML files, spreadsheets and SaaS platform databases – but it, too, needs to undergo some serious processing before you can use it.
And just like it would be a pain in the neck to have to go all the way to Detroit to purchase a car, you want your usable data to be transported to where you plan on using it, without having to do that manually.
Enter data pipelines.
A data pipeline is any automated, linked set of processes that ingests data from a source system, performs the processing needed to make the data usable for an intended application, and then loads it into the target system to be used.
Data pipeline tools enable users to define and manage automated data pipelines for their data environment.
What types of automated data pipelines are there?
There are two distinct categories of automated data pipelines:
- Batch processing pipelines, normally used for historical data
- Streaming pipelines, normally used for real-time data
Batch processing data pipelines are usually based on ETL (extract, transform, load) processes. They handle large quantities of data at scheduled intervals, with high latency but also high reliability. These pipelines take everything that is in the source datasets at the scheduled time, perform the specified transformations, and load it all into the target system.
Streaming data pipelines are usually based on CDC (change data capture). These pipelines pick up any change in the source data as it occurs and classify it as an event. Events are grouped into topics, which are continually streamed out to the target systems that are defined as recipients of those topics. CDC pipelines have lower latency than ETL pipelines, but they are not considered to have the same level of reliability.
What kinds of processes can be part of the data pipeline architecture?
Data pipeline processes include:
- Aligning with target system schema and structure
Not all data pipelines need to process the data in between ingesting and moving it. Many cloud data pipelines are ELT (extract, load, transform), as opposed to ETL pipelines. A cloud data pipeline like this first moves the ingested data into the target cloud-based repository, then conducts all the necessary processing there.