In the world today, data has become the norm. It’s truly everywhere, but are we able to fully comprehend what our data is telling us? Are we even seeing the whole picture?
Probably not. Unless you’re using a tool that provides automated data lineage, the story you’re able to understand is, well, let’s just say incomplete. Why is that? Keep reading to find out.
In this article, you will learn:
- The definition of data lineage
- How do we generate data lineage?
- What does data lineage look like?
- How metadata fits into automated data lineage
- Data lineage use cases
Data Lineage Definition
We can define data lineage as data’s life cycle or the full data journey. This life cycle includes: where the data originates, how it has gotten from point A to point B, and where it exists today.
By utilizing data lineage, organizations can better understand what happens to data as it travels through different pipelines (ETL, files, reports, databases etc.). During its journey, data interacts with other pieces of information, is transformed, and is utilized in various reports. This allows businesses to make more informed decisions. It also enables companies to trace sources of specific business data in order to track errors and implement changes in processes as well as to streamline system migrations. This saves organizations significant amounts of time and resources, thereby tremendously improving BI efficiency and speeding up time to insights. Without understanding their data lineage, companies aren’t able to predict the impact certain changes might have on various reports or ETL processes throughout the data environment.
How Is It Generated?
Now that we have established just how important it is to understand the origin and flow of a company’s data, you may be wondering how organizations can obtain this information?
Well, that’s the problem – BI teams today tend to have to manually map out lineage since they are usually dealing with multi-vendor environments. For example, if they’ve got Informatica (ETL), Oracle (DWH) and Tableau (reporting), each has its own metadata labeling system managed by different teams. Trying to figure out where a specific data element in a Tableau report came from can be impossible, and if not completely impossible, you can bet it will take data analysts a really LONG time to figure out.
Instead of manually compiling all of the data sources, companies can utilize an automated data lineage tool. This allows data teams to receive relevant information from all of their different data sources, instantly.
What Does Data Lineage Look Like?
Data lineage is a visual representation of the overall flow of data. It provides a look at how data is manipulated via the ETL process. This allows organizations to assess the quality of their data before it is loaded into an analytics tool. Data lineage is primarily a visualization of the journey of different data points.
If you are still unable to picture this, refer to the image at the top of this post. It shows how Octopai’s automated BI intelligence platform compares and presents the lineage of two different reports (either from the same system or different systems – in this case it’s SSRS and Power BI).
This clearly illustrates any variations between the reports and enables users to quickly understand exactly how two or more reports ended up showing differing results. Specifically, we see that an additional ETL process and table were found in the report on the bottom of the picture, which is missing from the report illustrated on the top. This is the point where the two reports began to diverge.
Sick of all the manual mapping required to sort out inconsistencies in your data?
Automated Data Lineage can fix thatLearn How
How Metadata Fits Into The Process
Whereas data lineage is the visual representation of how the data flows throughout various systems, the actual data presented in the lineage must first be located and verified. This is done through metadata management.
Metadata is essentially the information about a company’s different assets and the relationships between them. With metadata, we are able to find all data items related to a specific report or ETL process, see all the dependencies related to it and then, trace its entire lifecycle.
In short, metadata is to data lineage what wheels are to a car. Metadata is what makes it possible to have data lineage, and the demand for tools to effectively manage metadata is growing rapidly.
Some Data Lineage Use Cases
Discovering Root-Cause of Reporting Errors. If the sales team is claiming a deal flow that simply doesn’t align with the Finance Department, you can be sure that the BI Manager will be asked to get involved. BI has to find out why the sales numbers are different than the finance numbers. They are able to visualize the entire data flow and determine root cause and impact analysis in just a few moments.
System Migration & Upgrades. Migrating seamlessly from a legacy BI tool to a modern one or upgrading to a new version of a system can be made significantly easier and streamlined by advanced data lineage that enables data teams to get full visibility into their BI environment. With automated lineage capabilities, teams can visualize which reports or ETL processes are duplicates, and which rely on data sources that are obsolete, questionable or non-existent so that they are able to reduce the number of data items that must be migrated – no need to migrate dups or obsolete reports, right? Lineage visualization not only reduces time, effort and error in this process but also enables faster execution of the migration project.
According to Forbes, lineage analysis helps identify “islands of data” that are not currently in use. This allows companies to understand the data they are actually utilizing and stop wasting money, time and effort on irrelevant stored information.
Data Privacy Regulations. When it comes to compliance, whether it’s GDPR, the California Consumer Protection Act (CCPA), or any of the numerous personal protection compliance acts on the horizon, you need to gain a better grasp of your data. In order to do that, you must have a data lineage tool in place. It is vital that you know where every single piece of your data originated.
You must also have a clear understanding of any change the data encountered along the way. Knowing the entire story behind each and every data item is a clear case of ‘knowledge is power.’ The more information an organization has regarding its data, the better prepared it will be for the future.
Updated August 2020.