Jim Powell: Hello, everyone. Welcome to the TDWI webinar program. I’m Jim Powell, Editorial Director at TDWI, and I’ll be your moderator. For today’s program, we’re going to talk about Using Advanced Data Lineage to Improve Hybrid BI Environments. Our sponsor is Octopai. For our presentation today, we’ll first hear from Dave Stodder with TDWI. After Dave speaks, we have a presentation from Amnon Drori with Octopai.
Before I turn over the time to our speakers, I’d like to go over a few basics. Today’s webinar will be about an hour long. At the end of their presentations, our speakers will host a question and answer period. If at any time during these presentations, you’d like to submit a question, just use the Ask a Question area on your screen to type in your question, and send it over.
If you have any technical difficulties during the webinar, click on the Help area located below the slide window, and you’ll receive technical assistance. If you’d like to discuss this webinar on Twitter with fellow attendees, just include the #TDWI in your tweets. Finally, if you’d like a copy of today’s presentation, here’s the Click Here for a PDF line there on the left middle of your console.
In addition, we are recording today’s event and we’ll be emailing you a link to an archived version, so you can view the presentation again later if you choose, or share it with a colleague. Again, today we’re going to be talking about Using Advanced Data Lineage to Improve Hybrid BI Environments. Our first speaker is David Stodder, Senior Director of Research for Business Intelligence at TDWI.
As an analyst, writer, and researcher, Dave has provided thought leadership on key topics in BI, analytics, IT, and information management for over two decades. Previously, he headed up his own independent firm, and served as vice president and research director with Ventana Research. He was the founding Chief Editor of Intelligent Enterprise, a major publication and media site dedicated to the BI and data warehousing community, and served as Editorial Director there for nine years.
With TDWI Research, Dave focuses on providing research-based insight and best practices for organizations implementing BI, analytics, performance management, and related technologies and methods. Dave, I’ll turn things over to you here.
Dave Stodder: Thank you very much, Jim. Welcome to everybody today. We have got a good webinar Using Advanced Data Lineage to Improve Hybrid BI Environments and thanks to Octopai for sponsoring. We got a lot of interesting things to talk about today that revolve around data lineage, so I’ll just get going. I’ll be speaking as Jim mentioned, and then we’ll move on to other parts of our presentation.
Be thinking about questions you might want to ask later on as we go along. To set the scene for data lineage, I think it’s good to look at more broadly at trends that are going on and how they impact what organizations are doing, and why data lineage is important. One of the big things that comes to mind is certainly digital transformation. We’re seeing in our research that affecting really lots of different kinds of operations, applications, and processes in particularly over the past year with the cloud being so important to remote workers and to different ways of organization.
Organizations have had to respond to the pandemic and so forth by adjusting how they approach markets and interact with customers and business partners. Migration to the cloud has been intense in the past 12 months or so. We expect it to be the same going forward. This is causing a lot of replacing of legacy systems or really, I think we’re going to first be talking about today’s a lot of hybrid environments where, of course, legacy systems often don’t really go to die, they hang around, and so it creates a situation where there’s going to be integration between the on-premise legacy systems, and those that are in the cloud.
Digital transformation driving a lot of new kinds of business applications and processes. This, of course, is increasing lots of new types of data. Just in this very simple illustration, you can see that data is now really in the center of interactions with people, the development of business processes, and the use of technology platforms and data platforms. Analytics is also a big driver.
As organizations begin to move their business applications and processes to different platforms, migrate to the cloud, try new ones, one of the top priorities is to develop analytics. This is putting a lot of emphasis on knowing more about the data. Of course, that really is behind everything we’re going to be talking about today. Data insights for improved resiliency and improved efficiency, particularly, in response to the situations that many organizations have been in the last few months.
This shows some research around, we just backed that up a little bit that the high demand for cloud services and platforms. I hope this is interesting that, for example, Enterprise BI, reporting, and dashboard, which are often can be legacy systems in organizations, because they were established early on. We see here that over half of organizations in our research have these systems now running in cloud-based or Software as a Service, SaaS platforms.
Of course, again, as I mentioned before, often the older systems don’t completely go away. It’s a matter of trying to integrate all of this, so that users, and there are many more different types of personas that are in the user base, can get their business intelligence, their reporting, the dashboards. We see here also that right very close behind at 51%, business-driven, self-service BI and analytics, so these are the systems that really put up more on the business side with less involvement of IT.
We’ve got here basically enterprise systems and then the self-service systems, both at more than half. You can see just the important platforms, data warehousing, 41% have data warehouses in the cloud, 39% data lakes in the cloud. Again, to capture lots of different reasons for the data, and lots of different types of data, of course, data lakes being there often to handle the different mixture of data.
Maybe not knowing quite what the data is, more exploration needing being done before it can be really safely used in reporting and analytics, and so forth. Lots of data pouring into the cloud. You can maybe think about where I’m going with this is that you need to know where that data came from. This is becoming an issue in organizations. You can see here just how it ranks up with some of the other kinds of processes going on predictive analytics, pretty good product percentage there too, and so forth.
Hybrid BI, hybrid is really about having on-premises systems, cloud-based systems. We also do see in our researches, usually. not just one cloud system, but multiple cloud systems and often because of piecemeal expansion, that’s common, as I mentioned before, we’ve had self-service systems that can sometimes be siloed as well as enterprise systems.
Organizations, from the early days, has been moving into the cloud in more of a piecemeal fashion. It might be one department here or around one project here. Maybe they have established a data lake. Again, look at data they want to explore first before putting it into the data warehouse, or just explore for data science. It becomes this thing rolling into the cloud, but not in a very planned way.
There’s different infrastructures, different APIs being used, some working with containers and more modern cloud architecture, some not, security issues, and so forth. There’s a lot of complexity. Again, it makes administration and data lineage being part of that task very complicated, because there’s lots of different data integration, governance, which I’ll be talking about becomes a factor, and it becomes more difficult.
The broad business impact of this just to pick out two examples, and thinking about financial services. Decisions about loans, decisions about different types of accounts that customers may want to open up and/or that you may want to offer to different customers. It’s important to be able to see that where the data came from about customers, about their accounts, decisions being made, and then being able to understand the relationships of the data from across sources, because you’ve now got a mixture of sources up there.
Often, one of the first questions that’s asked is when looking at things like this is, where did that data come from? Then, thinking about online retail, there’s nothing to can aggravate customers more than if you don’t have good data about them, and they have to go through it all over again. Customer-facing excellence, particularly, as these sources multiply and you’ve got a hybrid environment, can be difficult to achieve.
There tend to be more data inconsistencies across these touchpoints, as more customers are interacting with your organization across different channels. This can be a problem too. The last thing you want is an outward-facing chaos as you’re working with customers. Governance is also, of course, a big driver behind what we’re talking about today, and has become certainly increasingly important as years have been going by, as things like the General Data Protection Regulation, GDPR, then California’s Consumer Privacy Act, of course, other states, other counties have have other ones, and so governance, again, just backing up again, mainly about articulating rules and policies for protecting sensitive data.
That’s a big issue if you don’t know where the data is coming from, don’t know where it is now. This can be very difficult. GDPR and CCPA, other regulations are putting a lot of pressure on organizations, certainly, that are working with customer consumer data, but also, any data that could be turned into personally identifiable information. It’s also about trust in the data, so it’s our trust.
We mentioned that customers, consumers want to know that you’re taking care of their data, but inwardly as well, the very personas I’m talking about, whether it be frontline workers, line of business managers, or an executive, lots of data scientists, business analysts, and different parts of the organization, they’re using all these different reports and dashboards, developing analytics.
They need assurance that they can trust, both that the data adheres to governance and regulatory principles, and they’re not going to get the organization or some cases themselves in trouble for using the data, but also that they can trust elements around the quality of the data, its sourcing, where it came from, and so forth. It all comes down to, of course, not just the financial penalties, but also the reputation.
The reputation is almost the most difficult thing to recover if there’s been a problem. Organizations want to know that they’re working with companies that can deliver that kind of value, that they want to know that their data is protected. It is mentioning here as a challenge in doing this in terms of when you have hybrid multi-cloud on-premises and multiple cloud systems, not knowing where the sensitive data is, who might be sharing it.
We can see in our research that 45% say that regulatory compliance and protecting sensitive data is a top challenge with hybrid multicloud environments. That was one of the top two or three. Talking about data lineage, what is it? Well, basically, it’s about knowing where the data comes from, where it is now, and then knowing who is using it and sharing it, and what’s happening to it, how it’s been changed, and so forth.
It’s critical for governance because most of the GDPRs, in particular, but others are similar, have requirements around having a data inventory, knowing where the data is, and being able to respond to audits, and being able to control this sensitive data, and so, right at the top of the list for reasonings to know the data lineage is, of course, governance. There is two major types of data lineage to talk about. One is horizontal data lineage.
This is really about that tracking the data from its source, and knowing where it came from to the target destinations. This is where organizations often need to begin, as they start to develop this data inventory, and be able to be responding to audits and having more control, is to just simply know what was the source of the data, and what was the target or targets, and be able to answer questions around data ownership.
Then, who brought the data in, or what process or application was bringing in the data, how it is being consumed, how it is being shared, just the fundamental idea of what is the data’s journey in the organization and maybe outside the organization. Vertical data lineage, it goes along with horizontal. That’s about once it is sourced, once you know the information about where the source came from, then it’s about what happened to the data? What kind of transformations have occurred to the data?
What calculations are based on this data? How has it been aggregated? How might it have been changed along the way? How has the data been enriched? All of the kinds of things that different types of those personas, the different users to meet different requirements they have for analytics or reporting might be doing with the data once they have it. This is critical when it comes to analytics. There are certainly regulatory and governance reasons to know vertical data lineage.
It’s also important for analytics to understand where this data came from that’s in the analytic model, the predictive models and in the calculations. Can we begin to track it back if there’s questions about it? Data lineage, of course, really demands both. Often, the regulatory issues put pressure on the horizontal data lineage to know just where the data came from, where it is now first, but the vertical data lineage obviously is very important for some of those reasons, as well as for improving analytics and improving the understanding of data visualizations, dashboards, and so forth in the organization.
They come together to be key to tracking the data’s journey and then knowing what’s been done with it. Now, to help support efforts to record data lineage, metadata management is critical. They often go together in organization. If you have metadata management, it makes it certainly a lot easier. In fact, it’s really a key step in knowing the data lineage and tracking data lineage, and controlling where the data is.
Metadata management can be for the enterprise. That’s often where we hear about that an entire enterprise working with, for example, a data catalog, but it’s also important– It can be important for departments and for different projects. There’s lots of little different levels where you can be critical. If the organization, for example, is not ready to do enterprise metadata management, don’t be shy about working with it for a particular project for a particular department.
It’s about gathering and documenting the metadata, both technical and business metadata. Just technical about the data definitions and so forth, but also, some of the business definitions around the data. This is all key to knowing, particularly, as you’re looking at a hybrid environment where there’s lots of different sources of data, and lots of different applications and so forth, the data location, the ownership, the administration.
A core technology for metadata management is a data catalog. We’ve seen a lot of growth in data catalogs over the last few years, and certainly, the past year, and we certainly expect that going forward. Data catalogs use metadata and other information, lots of other business information, some of the crowdsource information to provide an inventory of the data, and the ability to find the data to make it easier for users to find the data.
Then, of course, for data lineage tracking. Data catalog can be critical for that because it’s certainly a very important first step and like a Rosetta Stone to be able to figure what the data is, and how we can find it. Some of the modernizing trends around this to help these metadata systems, crawl metadata, and all these different systems, whether it be traditional on-premises systems, cloud data sources.
The data lakes, text, image, and so forth is automation and use of AI to handle that kind of diverse data scenarios, and of course, scale, to be able to do it faster, so that it’s less manual, and then being more consistent, so that you begin to be able to establish consistency in collecting the metadata, and developing the data catalog. Looking at some of our research, some of the top priorities in metadata management, of course, at the top is really just to make it easier to search for and find data, this being a common goal.
We can see 75%, that’s at the top when we ask, “What are your goals for setting up a centralized metadata catalog, and glossary of metadata repository?” We can see here 43% are saying that centrally monitored data lineage or data usage and lineage. That’s it, that’s a pretty substantial number, seeing that this is an important reason. Of course, number two in the list is 57% saying they want to improve governance, security, and regulatory adherence.
As I just pointed out, data lineage is critical to doing that, so they go together in any case. This is interesting to see. You can see inventory and the data assets, half saying that this is one of the most important goals that they have. Of course, as we see that in the regulations, that’s really one of the first steps organizations have to take to adhere the GDPR and some of the other regulations.
Here you can see, probably, measured against your own organization for priorities. Some other research around this for, specifically, around data lineage. We’ve asked some questions, one being, how well can people use the data catalog to find data and understand data relationships, and then to tap it about data and its lineage? We don’t see a lot of satisfaction right now, so there’s a lot of room for improvement here.
Just 27% or 27% say they’re dissatisfied. On the other hand, 23% are satisfied, 21%, don’t know, either they’re in the middle or neither satisfied nor dissatisfied. Then, we see that 29% don’t even have a data catalog or just don’t know. This is an area where organizations really need to devote attention, and to be able to move forward with tracking data lineage.
How well can organizations address requirements for data governance and lineage tracking? 43% saying they need a major upgrade. Many cases, we see that this is around that– It’s a manual process and it’s inconsistent. Maybe they’re even using spreadsheets, particularly, in small to medium-sized organizations, and they know they’re not doing a very good job. They’re looking for technology platforms that can help them, and this often requires a major upgrade. Only 12% being very satisfied.
Then, the third question here, how satisfied with governance and tracking data lineage across hybrid– How satisfied are you, basically, with governance and tracking data lineage across a hybrid environment? This hybrid environment that many organizations have with the on-premises and cloud-based BI and data warehousing systems, 42% again, saying they need a major upgrade here and just 9% very satisfied.
This is clearly an area where organizations see that they need to be paying a lot of more attention, both in terms of best practices and technologies. How does data lineage contribute, summing up some of the thoughts we’ve had so far here? Governance and regulatory adherence, so keeping track of sensitive data, being able to respond to audits, defense against legal actions, this issue of explainability.
Many cases with CCPA, for example, consumers can bring a legal action against an organization if they feel their data is being misused. This is important to answering those kinds of actions, and being able to respond to audits. Knowing the data origins, ownership and administration, all critical to these types of activities. Ensuring effective, consistent procedures in sourcing new data. Data lineage tracking can become part of this data ingestion process.
Often, for example, if a organization will set up a data lake, and just be wanting to stream data in or flow it in, and then being worried about governance later, and though often governance can then be haphazard, incomplete. We’re starting to see many organizations starting to look at how they can begin to track data lineage, as they’re ingesting the data. This is an interesting development.
Then, dealing with different regulations, different data classifications, and taxonomies. This is complex too, as I mentioned, there’s a lot of different regulations, both industry regulations and the data protection and privacy regulations. Data lineage can be very helpful to this aspect of governance. Then, this goes along with the metadata management, of course, to be able to handle all the different ones and classification systems there at work.
Customer service and engagement, I’ve mentioned that. Proactive improvement of data accuracy, so you can use data lineage to see where there are flaws in the customer data. Be able to track back where that came from, be able to correct it at the sources. Well, just looking closely the issue of data trust. This is a major contribution that data lineage can provide, is improving data trust, as I mentioned earlier.
We can see in our research here, we ask just about– To improve users’ trust in the data, which of the following actions are being undertaken in your organization? Data quality, of course, being at the top, not surprising there. As I mentioned, data lineage can be very important to improving data quality. We see, specifically around tracking data lineage, including changes to the data transformations and calculations, 41% seeing that as actions that could be helpful in improving data trust.
Lineage can really be helpful to a lot of the activities that are on this list to improve accuracy, and make it more consistent and faster. Data lineage with Cloud migration and Change. This is getting to the hybrid issue. Hybrid environments, continuous data lineage monitoring is important as part of that data moving and migration from on-premises systems to the cloud.
It’s a good idea to put that in place because, often, as organizations are moving systems to the cloud, they lose that data lineage. They lose track of what it is, and so this should be something that’s being considered, as organizations are moving data to the cloud. Then also, the process of examining these data silos that are forming in the cloud, or may have been, in the case, on-premises as well, this is an opportunity to bring them into the scope of governance, and understand the data lineage there, so important for that.
Moving and migration of BI reporting and analytics to the cloud. There’s the data sources, there’s also the whole enterprise reporting systems and self-service reporting systems being able to track how the data flows into these systems, and using data lineage technology and practices for that. Then, try to get the big picture, the holistic metadata management view of what is going on to try to bring more efficiency and consistency to ETL and ELT processes, so you’re not doing things that aren’t valuable.
You can see beyond those single workloads and project workloads, and be able to spot inconsistencies and discrepancies across different processes, and try to bring some rationality and consistency to it. Change management, as I mentioned, it’s important as you’re moving these systems to the cloud, but just broadly, trying to help. Often, it comes up as there’s a lot of changes in data platforms and movement, what happened to the data here?
We can’t track it back anymore. This is a way of being able to spot what kind of reports or applications could be affected if you’re making a move into the cloud, and be able to use lineage for that visibility, anticipate, and remedy those problems in a proactive way. Behind all this is, is automation, in terms of advancing on the technology. As I noted before, many cases in organizations, there’s just too much manual effort.
In fact, a lot of times we see it being done on spreadsheets, or being captured in a database somewhere, but it’s inconsistent. It’s not even clear who’s really in charge of it. Automation can be very valuable to bringing greater consistency and relieving people of– It’s often very time-consuming, tedious chores of tracking this data lineage. It’s not a lot of fun, or feeling very innovative to do this. It’s a good idea to start thinking about how software automation could be helpful here.
In our research, we see, as I mentioned, organization looking for major upgrades. This is where they’re looking really to do, is to automate, so automate the process of developing data catalogs and metadata management, and automating the tracking of data lineage. 72% say data governance could be successfully improved through this automation of these systems, and specifically, improving data lineage and metadata management. Very high percentages. We can see where that’s stacking up there.
I’m going to offer closing recommendations. One is to make it a priority to improve understanding of data lineage, to look at where you can use automation to relieve people of these manual, time-consuming, often tedious tasks, and also just to build in consistency and be able to handle the speed, the scale, and complexity of the demands that organizations have.
Track and monitor data lineage as part of cloud data movement and migration. Don’t think of it as an afterthought, but think of it as you’re beginning to move these systems to the cloud, as you’re beginning to stream data, say, into a data lake, think about it early on, because then it’ll certainly relieve problems that come up later. Think about it as part of change management. Use data lineage to spot all the outcomes of change, and reduce delays and errors that often come up there.
A few others, improve project organization for data lineage. We’re thinking about, “We’re going to do data lineage and try to track it.” You obviously can’t do everything all at once. My number two item here, often prioritizing data for governance. That’s often really the driving reason to begin to take this a lot more seriously, because the organization needs to be respond to governance and to regulations.
Prioritize it for governance, for sure, but also for key BI reports, and dashboards, and for analytics. As I mentioned, you can’t do it all at once, so you look for quick wins in this area, start to gain confidence, and also gain management support for these things. Plan on– This as something that’s going to be continuous. The data lineage tracking and metadata management and so forth, it’s a continuous process. Not something you can just address once, and then walk away from it.
Address data lineage and metadata management at the business level, not just IT technical. This is where we need to get the buy-in, get support from upper management, particularly, because this is something that often has to serve cross departments, and cross platforms, and lots of different users impacted. It can be difficult to get backing for these things.
I think it’s pretty clear to be able to show how data lineage knowledge can impact the business, because if you don’t have it, then you have poor governance. Data quality problems that maybe, to be able to get to the root cause of those problems would be easier if you had good data lineage tracking. Then, certainly in the customer and partner relationships, customers and partners, depending on your organization, to be able to take care of their data and have good data.
Then, establish business side ownership and accountability. This is important because it’s not just an IT matter, where the data is sourced, and how it’s being used. These are, obviously, we’re talking about what the users in the organization are doing with the data. They are looking at new sources they want to bring in. This concept of data ownership and accountability is very important to be bringing in from the beginning.
Well, thank you very much. I think that’ll close out my presentation. Although, actually, before we move on, we’ve got a poll question to involve our audience here. Let’s turn around the question to those in our audience today. What is your biggest challenge in tracking and managing data lineage? Number one, is it scalability? You have lots of data, lots of users, you have lots of workloads, and so it’s just difficult to scale up data lineage to handle these.
You have a hybrid environment. You have on-premises and cloud systems that are making data lineage tracking difficult. Cloud data migration and movement is causing it to lose track of data lineage. As I mentioned before, often when these systems are moved to the cloud, the lineage is lost. Project scope and prioritization, so you want to do data… it’s unsure where to where to begin even within governance, what kind of data sources, what kind of data systems do you need to be looking at for governance, security, or regulatory concerns, for example. Too much manual work, not enough automation, so you don’t really have the good technology you need for others. If there’s another one, if you can use a Q&A format to write in and answer.
Let’s take a look at some of the answers here. Pretty much all of these issues are concerns, it looks like 27%, saying scalability, is the issue. It’s just simply the amount. 24.3% saying it’s the hybrid environments, so you have all these systems. Just 5.4% is saying the cloud migration itself is a difficult problem. 16.2% saying the private scope and prioritization. Also, 24.3 saying too much manual work and not enough automation, so it’s interesting to see. Good, we have some answers. We’ll take a look at those as we move into our Q&A. Thanks very much for all that. Jim, back to you.
Jim: Thank you, Dave, just a quick reminder to our audience, if you have a question you can enter it at any time in the “ask question” window. We’ll be answering audience questions in the final portion of our program. Our next speaker is Amnon Drori with Octopai. Amnon is the CEO and co-founder of the company, a leader in metadata management automation for business intelligence. He has over 20 years of leadership experience in technology companies. Hello, Amnon.
Amnon: Hi, thank you. Thanks, David for this really great information, I just want to say hello. I’m going to shut down my camera just to save some bandwidth, and I’m going to move forward to sharing what Octopai does. I want to pick up on what David said about all these drivers and compelling events. This is something that we recognize already five years ago. We used to lead BI groups in other large companies like insurance and banking, and healthcare.
At that point of time, we’ve seen the growth of data and the complexity of data, combining this with new generation of environments, what you just said about moving from on-prem to the cloud, or changing generations of traditional BI tools to more modern ones. At some point of time, five years ago, we felt that if we’re not going to do something about it, we’re going to come to some kind of a threshold where the way we used to work is not going to be sufficient enough and good enough. This is where we very much related to your point about automation.
I’m going to share my screen to show a couple of things. You should be able to see my screen at this point. If not, let me know. The essence of Octopia is basically working with any organization that they have, what we call business intelligence and analytics team. In practical terms, what we do, we enable organizations to really not worry about what kind of tools they have in their BI landscape.
What we wanted to do is to enable organizations as they add, change, modify, modernize their BI environments in order to enable their business users to get better access to their data, either to take business decisions, or they need to trust the data that then serves the organization as part of regulatory or audited. We wanted to automate the entire management of the BI, we wanted to be able to offer clients, the way we we wanted to have five years ago, an option to understand the BI with a software.
What we’ve done is that we’ve started to analyze a huge number of business intelligence systems that involves different vendors, and each one of these systems are doing different things to the data. The way we deliver our offering is in the cloud, which is very, very robust. It enables the client just to extract metadata that takes about 30 minutes, upload them to their accountant Octopai, and from that point on, they start to use Octopai and enjoy analyze metadata shrinkwrap in form of products. Now, today, we’re going to talk only about data lineage.
Obviously, there are other offerings that we can do, but the whole point is to serve these use cases for these type of beneficiaries that at the end of the day, collectively, are responsible for the data movement process within the organization. What I want to do, I want to go here, and I’m going to show you a live environment of our demo that can basically illustrate exactly what I mean. This is how it looks like. This is our website, and this is our demo environment. By the way, should you want to have a personalized demo, by all means let us know. We’d be more than happy to schedule 20 minutes to 30 minutes showing you what we’ve got.
In this case, you can see an environment that has about 400 ETL processes shipping data, and extract data from the data sources is stored in about 2,500 database tables views that there are 24 reports using part of the data that is being stored here after running these ETL processes. Just to give you some numbers, some of our clients, not the big ones, have about 1,000 ETL processes that are storing data in about 10,000 tables of views, and there’s about 1000 reports on top of it. If you do a very quick math, we’re talking about 10 billion possibilities of data pipes, or connections or lineages.
Now, it’s very, very difficult to document all of that, I actually questioned the fact that it can be done at all. The problem starts when you start the need to find a data mapping. For example, if a business user picks up the phone to his analyst and say, “For some reason I’m looking at my report, and data is missing, or the data is not complete, or the data doesn’t make sense,” what is it that we’re doing the business intelligence and analytics language? We’re translating the business problem into a BI capability, and the capability’s lineage, and actually, to be more accurate reverse lineage.
We want to find the report quick, and we want to understand how the data landed in that report by identifying the exact tables and views where data is stored, and also the exact named ETL processes that are shipping the data, and then land it in the report. This is how you do it with Octopai. In this case, you can see that we have a section of reports and analytics. You just go here and find the report that you’re looking for. For the sake of the example, we’re going to look at a report called customer products, that should include the name of the clients that bought which products that I ever sold.
I’m going to start typing in the name of the report. Customer products. As I’m typing in, Octopai already found that report for me. Here it is. That’s step number one, I can see it’s running in SSRS, so I know exactly where this report is generated, I can access it, I can see where it is, and so on and so forth. Now, I want to understand, the data that lands in that report, where is it coming from? Which, out of the 2,500, and almost 400 ETLs, are explicitly responsible for lending the data in that report? The only thing I need to do is click “lineage”, and this is what we get.
Within few seconds, you get the following picture. This report, colored in green, and you can see the legend here, is actually based on a view, which is based on these tables. Here are their names. The data lands on these tables by running these ETL processes. In one shot, you can see ETLs, databases, and you can see reports in one screen. The second thing notable here is that you can see SSIS, ETL processes, Microsoft, Informatica, and data stage. At any given point of time, you can click and get more details about that specific object. It’s very, very easy.
A cool thing about it is that you can always drill to each one of these objects as you like. If this is a view, and I want to understand a visual of that view, I’m going to get a mapping of the view, which maybe I want to see exactly how to track certain fields within that view, but also I can document it. We talked about regulations, we talked about audits, we talked about documentation, setting the BI with proper information, to be able to control what is it that you have in your BI. Very, very valuable. The second thing is, if you want to drill into what we call the deep lineage, what’s going on in this ETL, you can always click and dive in.
Just like we did with the view in the SSIS language, I’m going to drill into the container. I can see that a low-data warehouse ETL contains maps, that’s great, but I want to dive in into the object level, to the column level. This is what I get. You can track a certain field from the source table, could be CRM, through the transformations, all the way to the destination table. Again, you can print, you can share, you can document. Everything that runs in your BI is captured and viewed within Octopai within seconds. The last thing I want to show you how automation helps you is the fact that you can do lineages from anywhere to anywhere.
What we’ve done here is the lineage from the report backwards because we wanted to understand if something is wrong with the report of the business user by understanding the data flow, share this with the BI for them to see if something had been changed. It can be also the other way around. What if I want to ship new data that have been generated, or I want to modify certain ETL process, and I want to enrich that ETL process? One of the popular question is what would be the impact of a planned changes that I want to do to this ETL or to this ETL? Is this report is the only report is going to be affected by it? Well, let’s see.
How long would it take to understand within the 23 remaining reports which one of them are related to this ETL? Now, in other practices, this type of a lineage right here may take few hours or a day, but let’s do the lineage on the other way around. This is what I get. What you see here is the opposite picture. If I were to do any changes, modifications, ramifications to this ETL or to one of the maps in this ETL, I do want to know all the possible relationship, either directly or indirectly that may be affected by anything that I’m planning on doing with this ETl.
Now if you’re an ETL developer, if you’re a data architect, if you want to make sure the things are documented, planned, assessed correctly and to avoid problems that might happen to one of the business users, what we call use case number one, then you can see that all this mapping is just with a click of a button. In this case you can see very clearly here’s the report that I used for the previous use case. If you remember, that was running on SSRS but maybe the VP sales is looking a report called sales report. This one actually is being generated in business object.
The fact that two users are looking at two different reports generated in two different reporting systems, probably from two different departments, are related to anything they are planning and doing this ETL, it can lead to a lot of time of mapping, inaccuracies, multiple cycles of development that from our standpoint, you can save all of that. You can save errors and you can save time due to the fact that you automated your entire data lineage by properly analyzing metadata for a variety of different tools of different vendors.
Now, from other best practices or examples that we’ve seen, this map that you’ve see in front of you may take anything from 10 days to two weeks. Now the question is, how many lineages are you performing or want to perform ,and you’re not capable of doing this because you’re working in different ways and automation? I just want to show you some examples. The power of automation, the power of using software, the ability not to worry about your landscape, rather than get intelligence about your landscape wherever it may look like.
It can be any collection of any system that you have in your landscape that can be a collection with your vendor. We offer, and then what we see, organizations are modernizing themselves due to different use cases and drivers moving from traditional to new tools, and the BI moving from on-prem to the cloud, enriching their BI environment. They just don’t worry about the systems that we have. Let’s also analyze the entire BI and provide you insights so you can get the service and be more accurate, faster, and serve the business with the more, I would say, modern tools.
That’s it for us. That was really quick info about Octopai. We’ll be more than happy, if you want to see personalized demo, or further discuss about how to automate your lineage within your organization. That’s you guys.
Jim: Thanks, Amnon. Let’s move into our Q&A period now and answer some audience questions. This first one is for Dave. Governance seems to be the main driver behind tracking data lineage. How do I convince my management that we need to track data lineage for other objectives such as those involving analytics?
David: Yes, that’s a good question. I think Amnon really showed some good examples of where it can be useful. Certainly I think you can demonstrate the business cost of not knowing the data lineage, particularly as the organizations, they try to be more analytics-driven. They want their decisions being– using the data and the impact of analytics to be greater. As I mentioned before, questions that often come up immediately is where did the data come from? How is it transformed? To be able to answer those kind of questions, it’s important to have the data lineage. Those really can rise to business understanding of what the problem would be.
Then I also want to complement Octopai, I think, on the very visual interface, because I think there you can even show that to business users to see the lineage. Often the seeing makes it a lot easier to understand and understand why it’s important, and particularly as you’ve got different types of users looking at it, whether it be a business executive who doesn’t have a lot of deep knowledge about the data. Or you could have analysts or even data scientists who are looking at it who do.
That’s important too. If you’re a able to visualize it. That can be helpful in explaining it to upper management and making them see the role it can play in the analytics. Of course governance itself has a big business impact, so that shouldn’t be too difficult to explain, to get the executives concern about what it might mean in terms of reputation and financial penalties.
Jim: Amnon, do you have any advice to add?
Amnon: Yes, I think that David brought really interesting points. What we see our clients, I would say, share about their capability to get budgets to data lineage is both on the upside, and then the risk and the efficiency. The cost of not having automated data lineage or data lineage at all might involve risk, but if you know that you need lineage and you need the mapping and you need to choose between do this manually versus continue to do this in an old fashion other than replace it with automation. It’s really easy to quantify this.
When we do trials and we show an analysis of prospects environments, they actually use Octopai to benchmark examples of their mapping versus how does it look like with Octopai or any other automation. When they take some of the examples that I just showed and say, “We had to lineage our system for this use cases, and it took us five days here, 10 days here, 12 days here. Now we can replace all that with five seconds,” it’s very, very easy to quantify. Either on the benefit side saving money, or on the risk side of not being automated, room for errors, increasing risks and so on and so on.
Jim: Our next question is for you, Amnon. What about compliance? Is it possible in highly regulated environments to use lineage for data traceability?
Amnon: The answer is yes. We find a lot of our clients in the financial sector, in the banking sector and insurance sector, that they are a pharma, healthcare that are under a regulated environment or audited. In some cases, the BI manager or the CDO have to prove that the data movement process is controlled. For that, they’re using Octopai to map this and actually print it out as far as their evidence that they have automation in place in order to understand the data mapping whenever is needed.
This is something that we hear our clients. Also we hear that from Deloitte, we hear that from PWC. They are asking their inclines to prove that they have a good control of data movement process. They also use Octopai or any other form of automation to help their clients to get better control of the data. The answer is yes.
Jim: Thanks. Dave, you mentioned spreadsheets in your presentation. There’s a lot of excel and additional offline work with all kinds of tools that produce results without obvious disclosure of the methods and lineage. How can data lineage for these processes be handled better?
David: Let’s see. Let me just look at the question a little more closely. A lot of excel offline work. I think that that actually has in common that in many cases where organizations are tracking data lineage, they are doing it through excel and through custom administration of it, and maybe capturing some of the metadata. I think that if you’re going to continue to work with these processes without more robust metadata management, it could be challenging just because you’re a little bit vulnerable to not having captured all the metadata that you need, not being able to handle the scalability issues.
Then vulnerable also to whoever set up the system, the spreadsheets, and so forth. If they’re gone and they don’t– There might be a power user who doesn’t tell others exactly how to track this. I think that probably in our research we see that that can be sometimes a source of difficulties in organizations. I think that it’s possible to continue to use best practices around making sure you’re making priorities around governance data, maybe particularly reports and dashboards that are being used, and maybe around core critical analytics project, say around some forecasting that needs to be done, and be able to just focus on those areas and make sure you’ve got that. Then build out from there.
Often, I think organizations eventually come to a point where they need a better platform to manage the metadata and to make it possible to do data lineage across lots of different sources, and then of course to handle different use cases, the different personas they are working with the data from, those that are just looking at a dashboard or people who are doing more deep analytics. That would probably be my answer there, is probably thinking about evolving to more robust technology, but it’s certainly some things can continue to be done the [unintelligible 00:52:59]
Jim: Amnon, this next question is for you. Why do you focus on lineage just for BI and not for the entire organization?
Amnon: It’s the classic question, can you solve all the problems in the world? We needed to start somewhere, and we wanted to enable the business users through the BI because they are as close to the business side to, first of all, understand the BI. We, as the origin came from managing BI. We didn’t understand that the BI is getting more complex and more complicated. First of all, we wanted to focus on that. We feel that we’re doing a very good job on that note. Even this is a changing environment. Definitely, we’re going to expand beyond the BI at some point of time.
Jim: Dave, here’s one for you. Do we need to have metadata management first before we can track any data lineage?
David: I think as Amnon really demonstrated in his demonstration is that they really do go together. As you’re expanding out the ability to track data lineage and to account for where the data is, and doing proper inventories and so forth, you do want to have an ability to store that and document that information, and make it easier for different types of users in the organization to find it.
That’s where metadata management really goes right along with it. As I mentioned in my previous answer, you could do some data lineage tracking just using a spreadsheet. In fact, we often see some organizations beginning with metadata management by containing that in the spreadsheet as well because obviously, it’s there, it’s part of their suite of applications anyway.
Again, as the scale issues come up and then trying to develop more consistency around it, it often starts to look like a good idea to move on to a more metadata platform that can handle lots of different more scale, more use cases, faster, has automation, and so forth. That’s why I guess the answer would be probably it’s good to have them going together, is seeing the organization needs to manage metadata and being one of the primary purposes so that you can track the data lineage across different sources and document it properly. They go together, I’d say.
Jim: Amnon, we have a couple of questions that want to drill down into your product, here’s the first one; is Octopai a platform service which has governance compliance and glossary built within a software, or can it only help with platform maintenance?
Amnon: If I understand the question correctly, we are not the data governance solution, we are the BI management platform, but we do help data governance to enjoy metadata for their needs. If that’s the proper answer to your question. Data governance is another compelling event from our point of view. At this point of time we’re not intending to become, I would say, data governance solutions, but we do have cooperation with other vendors who are focused on data governance because data governance is not addressing the use cases that we address, the BI.
Jim: Does this work for on-premises infrastructures, for example, on-premises power BI reporting server or on-premises SQL servers?
Amnon: The answer is yes. We serve hybrid environments either on-prem or cloud, or both. We have clients who use Octopai to analyze their on-prem systems or native cloud systems, or both. If you remember the list of tools that we support, you can see more traditional tools that are naturally on-prem like DataStage and Informatica, and other traditional tools. There are some ones that natively run in the cloud. The answer is we analyze on-prem, we analyze cloud, and both.
Jim: Here’s a question for both of you, does data lineage include the employee ID, for example, of a person who ran a report and used the data, or is it simply log that the application such as Tableau was used?
David: Amnon, do you want to try to answer that first?
Amnon: Yes. Could you repeat the question please, again?
Jim: Sure. Does data lineage include the employee ID, for example, of someone who ran a report and used presumably the data, or does it simply log that Tableau was used?
Amnon: It very much depends on the client. We typically do not show the name of the user or things like that. We can only show at this point that this report in Tableau was used, and we don’t expose who. There are some clients who also share with Octopai the usage metadata. In that case, we are capable of doing this. It’s very much on a preference of the client. If you don’t share the usage metadata, then we can show that this report generated in Tableau, here’s the structure and this is how the data flow works.
Jim: We’re just about out of time here, so let me take a moment to thank our speakers today. We heard from David Stodder with TDWI, and Amnon Drori from Octopai. Also again, thank you to Octopai for sponsoring today’s webinar. Please remember that we recorded today’s webinar and we will be emailing you a link to an archived version of the presentation. Feel free to share that with colleagues. Don’t forget, if you’d like a copy of today’s presentation, use the “click here” for a PDF line.
Finally, I want to remind you that TDWI offers a wealth of information including the latest research, reports, webinars, and information on business intelligence, data warehousing, and a host of related topics. I encourage you to tap into that expertise at tdwi.org. With that, from all of us here today, let me say thank you very much for attending. This concludes today’s event.