Data has become the norm. It’s literally everywhere, but are we really able to fully understand what our data is telling us? Are we even seeing the whole picture?
Probably not. Unless you’re using a tool that provides automated data lineage, the story you’re able to glean is, well, let’s just say incomplete. Why? Keep reading to find out.
Let’s Define Data Lineage
What is data lineage? We can define Data Lineage as the data lifecycle or the data journey. This lifecycle includes where the data originates, how it has gotten from point to point, and of course where it is today. Through data lineage, organizations can better understand what happens to data as it travels through different pipelines (ETL, files, reports, databases etc.) and therefore make more informed business decisions. Data lineage also enables companies to trace sources of specific business data for the purposes of tracking errors, implementing changes in processes, and implementing system migrations to save significant amounts of time and resources, thereby tremendously improving BI efficiency.
How Do We Get Data Lineage?
We’ve already established just how important it is to understand the origin and flow of one’s data, but how do organizations get this information? Well, that’s the problem – BI teams today tend to have to map out data lineage manually since they are usually dealing with multi-vendor environments. For example, when they’ve got Informatica for their ETL, Oracle for their data warehouse and Tableau for reporting, each of which has its own metadata labeling system managed by different teams, figuring out where a specific data element in a Tableau report came from can be impossible. And if not impossible, then you can bet it’ll take the data analysts a LONG time to figure out.
What Does Data Lineage Look Like?
The image at the top of this post shows how Octopai compares and presents the lineage of two different reports (either from the same system or different systems), which clearly illustrates any differences between the reports and enables users to quickly understand exactly how any two or more reports ended up being different. Specifically, we see that an additional ETL process and table have been found in the report on the bottom of the screen that is missing in the report on the top. This is the point that the two reports began to deviate.
Data lineage is a visual representation of the overall flow of data and provides a look at how data is manipulated via the ETL process so that organizations can assess the quality of their data before it is loaded into an analytics tool. Data lineage visualization is an overview and a journey map of our data.
How Metadata Fits Into Automated Data Lineage
Not surprisingly, just as metadata’s role in the larger data governance realm has become central, metadata and metadata lineage (the metadata lifecycle) is also a key player when it comes to data lineage.
Whereas data lineage is the visual representation of the data journey, the actual data presented in the lineage must first be located and verified. This is done via none other than our dear friend, metadata. Indeed, metadata and lineage are intertwined, for it is by way of metadata that we are able to find any and all data items related to any specific report or ETL process, see all the dependencies related to them and trace their entire lifecycle. In short, metadata is to data lineage what wheels are to a car. Metadata is what makes it possible to have data lineage, and the demand for tools to manage metadata is growing rapidly.
Some Data Lineage Use Cases
Regulations. With regulations compliance, whether it’s GDPR, the California Consumer Protection Act (CCPA) going into effect in 2020 or any of the numerous personal protection compliance acts on the horizon, you need to gain a better grasp of your data, and in order to do that you must have a data lineage tool in place. You absolutely must know where every single piece of data originated, and you must have a clear understanding of every single change it encountered and stop it made along the way. Knowing the entire story behind each and every data item is a clear case of ‘knowledge is power,’ as the more information an organization has, the smarter and more able it becomes, and the more control it has over its data. This is key for compliance.
System Migration. In order to migrate seamlessly from a legacy BI tool to a modern one requires advanced data lineage as well. A data lineage tool can show which reports or ETL processes are duplicates, and which rely on data sources that are obsolete, questionable or non-existent. With this information, you can reduce the number of reports and ETL processes that you migrate. Data lineage visualization not only reduces time, effort and error in the migration process but also enables faster execution of the migration project.
Reporting Errors. If the sales team is claiming a deal flow that simply doesn’t align with the finance department, then the BI Manager is going to hear about it. The BI team has to find out why the sales numbers are different than the finance numbers, and they do this with data lineage. Automated data lineage enables them to visualize the entire data flow and determine root cause and impact analysis in just a few moments.
Updated December 2019.