Are we talking about a data pipeline or a rollercoaster?
Because believe me, data pipelines that act like rollercoasters – stop, start, speed up, slow down – excite no one.
What do you want out of a pipeline?
For most resources that are delivered through pipelines, success is the pipeline not being noticed. The moment you start thinking about the plumbing in your walls, it’s not a good sign. When you ponder the gas lines that run into your home, it’s probably because your stovetop or boiler failed to turn on.
Data pipelines are no exception. In general, you want your data consumers to be blissfully unaware of your data pipeline management. The goal is to keep them satisfied while remaining invisible.
But there are also aspects of data pipeline management that can go beyond expectations and cause consumers an appreciative thrill.
Let’s take a look at how you accomplish both.
How to satisfy data consumers
Most of what it takes to satisfy users is to keep that pipeline flowing with accurate, timely, complete data. Since the complexity of the pipeline will inevitably be ignored by the average user (when was the last time you marveled at the valves, joins, pumps, sensors and meters needed just to fill a cup of water at your sink?), it’s up to you to be on top of that complexity and make sure all the constituent parts are doing what they need to do.
Your home water pipeline moves into action whenever you open the faucet. Your data pipeline runs according to its sequencing of jobs and their dependence on each other.
When setting up a workflow within your data pipeline and workflow management tools, you’ll need to decide whether a given job should have a schedule trigger (run every X period of time, regardless of what else is happening) or an algorithmic trigger (run only when certain conditions are met). Make smart decisions by taking into account when, practically speaking, the new data is needed by its users, the amount of resources available, and how to set the conditions to minimize unexpected job fails.
We say “minimize unexpected job fails,” because jobs will fail. Guaranteed. But if at first you don’t succeed, you should have an alert to tell you to try, try again. Which brings us to…
Consistent monitoring for data accuracy and completeness
The earlier you catch a data pipeline issue, the smoother the recovery will be. That’s why you want to be able to identify leading indicators of data pipeline problems, so you can proactively address the problem before it creates a workflow backlog or (even worse) causes a user to complain.
Using a data pipeline management system that enhances your data observability is key here, especially when you depend on data from external sources over which you have little or no control.
Comprehensive data pipeline management tools should enable you to set checkpoints for intermediate stages of a data pipeline. Not reaching a checkpoint should set off an alert, facilitating an early warning that something has gone wrong and should be looked into more closely. By saving results at intermediate stages of the pipeline, checkpoints can also make it simpler and quicker to re-execute failed jobs.
Speak the same language
Make sure that your data pipelines are delivering a final product that can be read and used by the target system. Yes, that would seem to be self-evident, but all too often a pipeline indicates that the data is ready to be queried when, in reality, more work needs to be done to prepare it for its intended use.
Also important for readying data for its intended use is updating the metadata in the data catalog, so users can have an accurate picture of when this data asset was last updated and by which source, along with any other issues or relevant information. If your data pipeline framework does not take care of data catalog updates automatically, make sure you have another data management tool that does.
Always be testing
Here’s an odd piece of advice: try to break your pipeline. Not when it’s running, obviously.
Do plan specific maintenance times for your pipeline management team to test how your pipelines respond to unusual quantities or types of data. Document everything you learn – and use it to create more robust pipelines and more specific alerts. This is especially important when it comes to scaling pipelines.
An ounce of testing is worth about a ton of cure (and clean-up, and user appeasement, and…)
How to thrill data consumers
Most of the ways you can move beyond “satisfaction” to “thrill” involve communication.
Early, clear communication about problems
It is never comfortable to inform a consumer that something is wrong with the data they rely on or the delivery thereof. It’s the discomfort of volunteering the information that you have messed up. (Even if you haven’t, that’s probably how they’ll look at it, so that’s how it feels.)
But the immediate discomfort is far outweighed by the long-term gains. Your proactivity in informing your user about problems establishes you as a paragon of trustworthiness who is out for their benefit. They didn’t need to come to you; you came to them!
When you reach out, be as clear as can be. Give details; give a time frame for remediation. Err on the side of overestimating the problem and the time it will take to fix it. We know: that’s really easy to say and very hard to do when you’re under pressure to look good and deliver. But better your users get a pleasant surprise than they come to suspect you of whitewashing and not being reliable.
Prompt, personable responses
When your data consumers do reach out to you, don’t keep them waiting. Get back to them as soon as possible, even if it’s just to say, “I don’t have an answer right now, but I’m looking into it and I’ll get back to you within <whatever length of time is reasonable.>”
In addition, be a person! And a personable person, at that. I still have memories of a particular email service I used years ago. The support staff, every single one with whom I ever corresponded, all gave off a cheery, can-do attitude. And this was accomplished purely with text and a well-placed emoticon here and there! I don’t even remember how good the actual features of the service were, but my overall memory of the service is 100% positive, thanks to the support staff’s personable responses.
Yes, if you’re a data engineer, you’re probably a hard data person. But you do know how to use a well-placed emoji, right? 😜
Share how you’re improving
Let your consumers know how the way you are dealing with an error is going to make them have better data in the future. If your approach to data pipeline issues is to delve into root cause analysis and identify the cause of the pipeline issue, improvements from each error should be par for the course. The key is to share it with your users, so they can appreciate what you’re doing for them behind the scenes.
Don’t wait for problems
When your communication is focused on bad news, that’s not good news for your relationship. But it doesn’t have to be that way. Proactively reach out to your data consumers to get feedback on pipelines. Ask about their needs and wishes. Collaborate to create a new pipeline for their benefit.
This does take time from your busy schedule. But the impact of such communication can be thrilling for your users… and for you.
To stand out, be outstanding
We’ve moved from pipeline success being “functional and unnoticed” to being “proactive, communicative and noticeable.”
Either one will make your data pipeline management a success. But staying at the first will merely satisfy your data consumers, while adding on the second will likely thrill them.
The choice is yours.