Data lineage tracking is the process of actively tracing your data’s journey from one point to another point within your data landscape.
In order to track data lineage, your data systems need to leave trail markers that can be read by a data lineage tracking tool. Once that is in order, you must have a data lineage tracking tool that can read those markers and understand what that means in terms of the data’s progress and process. Good data lineage tracking tools will have clear, visual ways to convey this lineage information to users.
Different methods for how to track data lineage
Tracking data lineage is like being a law enforcement officer trying to track a suspect as he travels. There are multiple ways one could stay on a suspect’s tail:
Overhear the suspect’s travel plans:
If you eavesdrop on your suspect’s device and see him use Waze to plot a course from Manhattan (where he is now) to Philadelphia, then order an Uber to Philadephia, you have a pretty good idea of where he is going. Similarly, data lineage tracking tools can use query history and ETL scripts to see what travel instructions are guiding a data asset’s course.
Keep tabs on the suspect’s means of transport:
If you get your hand on a report or receipt issued by the Uber driver, noting his Manhattan pickup and subsequent transport to an address in Philadelphia, you’ll be able to know exactly where your suspect ended up after that ride. Similarly, data lineage tracking tools can use Kafka events emitted by source systems to know how the data was processed by that system and to where it was sent.
Put a GPS or RFID tag on the suspect that will note and record any changes in location:
It’s a little trickier and more hands-on to place some type of tracking tag on your suspect’s person. But if you can manage it, the tag will send back excellent, direct tracking information as long as it stays on the suspect. Similarly, some data lineage tracking tools use version-control systems to create lineage of data objects. Any change on an object is registered and stored, building a comprehensive, immutable audit trail for tracking what happened to the data.
Key qualities for data lineage tracking tools
The most important quality for a data lineage tracking tool is that it integrates the other components of your data stack, such as ETL, storage, analysis and reporting. If a data lineage tool cannot integrate with a component of the stack, that piece will remain a black box for the purposes of data lineage tracking, considerably lowering your data visibility.
Another key quality is the user-friendliness of the data lineage tracking tool and its user interface. Easy-to-understand visualizations are essential for non-technical users, but even technical users can benefit from clarity and ease of use, enabling them to accomplish more, faster.