David Marco: Start streaming this. Anne Marie, I’m going to hand the baton to you before we get you or your family members in trouble. Kick it off, and let’s get it going on data catalogs.
Anne Marie Smith: Hello, everyone. Welcome to this session of Data managementU.com’s webinar series. I’m Anne Marie Smith, and I’m the host/moderator for this event. As usual, we have some housekeeping points to cover. If you have a question to pose to our presenters, David Marco and Amichai Fenner today, please put it in Q&A. We monitor Q&A more closely than we do the chat. I want to make sure we address all the questions that we can get to in today’s session.
You’re here because you’re interested in data management. If you are interested in pursuing a course in data governance, metadata management, or data management in general, please head on over, after the webinar, todatamanagementeducation.com and look at the courses we have available. If you are on LinkedIn and follow social media, please do that. Register for data management news webpage on LinkedIn. We’re streaming live to YouTube. If you decide you really want to watch this webinar again, please go to our channel on YouTube, EWSolutions, and find the webinar that you want. You can download–
David: Anne Marie, this is a valuable feature. There are times when I cannot sleep at night. I put on a webinar, I’m out cold in minutes. It’s a fantastic service that we offer, but please continue.
Anne: It is. It does help me with insomnia as well. Octopai has a white paper on our website right now, for the presentation today. PDU certificates. If you attend the entire hour session, you will receive in about a week to 10 days, a certificate for one professional development unit that can be used for a variety of certifications. I want to thank our sponsor today, Octopai. I am truly looking forward to seeing what they have to present on data catalogs. We also thank our partner organizations for helping us to promote these valuable educational sessions.
Today’s webinar is about how we’re going to boost analytics and make more effective business decisions with a data catalog. We’ll learn about the major benefits of a data catalog and about how integrated data lineage can support a data catalog. How you unlock the value of data assets, and how to leverage business and technical users’ experience through the implementation of a data catalog. We’ll explore some typical use cases for effective BI analytics. You can see we are already in an interactive mode for discussions. That’s the normal way we do things here at Data Management U.
If you have a question, once again, please post it in Q&A. We will endeavor to answer all the questions we can during our time. I’d like to introduce today’s main presenter, Amichai Fenner, who is the product manager for Octopai, our sponsor, and our presenter today. Amichai supports companies with data lineage and data catalog implementations. He has a deep experience in enabling the greater use of data through these technologies. At this point, I’d like to introduce our other presenter, David Marco, president of EWSolutions who will walk us through some of the introduction to the data catalog environment to support what Amachai will present.
David, I’ll stop my sharing and let you share.
David: Excellent. First off, welcome, everybody, to our monthly webinar. This one is about data catalogs. We are excited to talk about this. This is just a background slide on EWSolutions and some of the client partners we’ve had. This is a topic passionate to my heart. I actually worked at the first company that ever had a product that they sold for data catalogs way back in– I don’t want to name the year, it was quite a while ago.
David: It really was. I have seen this segment of the marketplace just grow radically over the decades now. To me, it’s an exciting time to be a practitioner. Never before have we had tools that have as much capabilities as the tools that are available today. This is just a personal background slide. Please don’t hesitate to follow me on LinkedIn. If you have a question on something, here is my email address at the bottom of the screen. Let’s look at data catalogs. Before we can understand a data catalog, we need to understand metadata management.
What is metadata? Metadata is a type of data, first and foremost. Metadata digitally describes the who, what, when, where, how, and why of an organization’s data, processes, applications, assets, business concepts, and other things of interest. I’d like to say that metadata is knowledge. It’s just all that stuff that data isn’t. Let me evolve this concept a little bit more. These fundamentals of the who, what, when, where, how, and why of data, are foundational for any data management effort. You cannot build a data governance program without the technical foundation of metadata management.
You cannot build a master in reference data management program without these fundamentals. Good luck doing advanced analytics or machine learning. None of those machine learning algorithms work when we do not have a precise understanding of our data. A good way to think about it is data indicates the facts, but metadata gives the story. It’s very easy to say, “Hey, here are the facts. A couple hobbits walked around, found a ring, chucked it into a volcano.” Those are the facts, but that’s not quite the level of the story that Lord of the Rings is, is it? The metadata is all that cool stuff Tolkien wrote in Lord of the Rings.
Anne: Starting with the definition of what is a hobbit.
David: Exactly correct.
Anne: How big is the ring?
David: Exactly, and what kind of a ring? What does it look like? What are its purposes? More importantly, what are its powers, Anne Marie?
Anne: It has powers? Wow.
David: Yes, corrupting. It provides invisibility. If my Lord of the Ring’s memory is correct, it also corrupts its wielder. That is not good. That is a cursed item, as we might say. Anyway, so, good metadata on that ring is needed. I would think if Frodo had the complete metadata, the story of that ring, I would think he would have looked at Gandalf and said, “This is your job, buddy.”
Anne: “Never mind, bye-bye.”
David: Exit stage left. [chuckles]
David: Anyway, I like to think about it like this. Outside of my Lord of the Rings example, if I were to go up to an analyst and give them data, 28, 7, 40, and say, “Make some decisions on your company,” they would be lost. They couldn’t do it because what did I give them? I gave them content, but what didn’t I give them? Context. Maybe those are sales figures for their unit for the last year in millions. Content without context is meaningless. Data without metadata is equally meaningless. This is how analytics programs fail. Many of these programs are great at throwing data at a viewer or an analyst, but they don’t give a story.
They don’t provide the context. This is one of the reasons data catalogs have become so popular, but let me keep going. I have much to talk about, but oh so little time. In our field today, there are many terms we’ll hear about. We’ll hear terms like metadata repository, and managed metadata environment, and business glossary, and data dictionary, and data catalog. I just want to spend a little bit of time defining some of the basics here, and then I’m going to hand it over, the presentation over. First off, when you hear about a business glossary, I want you to think business metadata.
That’s what it holds. When we define a business term, which is a classic challenge company’s have– I have an opportunity to consult at quite a few different organizations, whether it’s a large federal agency or a Global 2000 company. I will tell you, I never walk in the door where, if you ask just a basic question, “What do we mean by customer? What is a customer? What is a product? What is the location?” These basic business terms in most companies are not even defined. What you find is there’s a great deal of debate, and discussion, and a misunderstanding of these throughout the organization.
A good business glossary provides a capability for data stewards to define business definitions for business terms, to put in valid domain values. Security and risk classifications, almost every company is worried about PII. We do a lot of federal work as well. One of our federal clients, they are not even subject to PII laws, yet they still classify their data as PII or classify their PII data because they realize, at some point, at least in the court of public opinion, you really don’t want to have a data breach for critical, personally identifiable information.
Synonyms, abbreviations, acronyms, business groups, responsible data stewards, these are all marvelous examples of the type of business metadata we store in a business glossary.
Typically, when we use the term business glossary, we are thinking about some type of automated tool that helps us manage these, and for our data stewards to collaborate on this information. Because the reality is when we talk about valid domain values and business definitions, no one person can define all these. Let me progress. My goodness, I have so much more to talk about.
Data catalog. This term has been around a long time, but of late, it has really gotten to be a much more highlighted term in our industry. Really, it’s come to mean, and to be an evolution of our classic metadata management tools. These are tools that will find and manage large amounts of metadata and stored in all our various systems and social media. Centralizing your business and technical metadata in one location is absolutely vital because we need to provide that full view of our metadata across the organization, and we need that business and technical metadata to be integrated.
A good data catalog integrates both business glossary and data definition functionality. Right now I’m doing a data governance tool evaluation for one of our client partners. To me, this is one of those pieces of functionality. If the tool doesn’t have it, if our glossary is not integrated with the data dictionary, I don’t really view it as a tool. I personally as a professional would want to move forward. When we look at other cool data governance, our data catalog functions, and features, I mentioned about having that unified view of our data assets.
Having control for data security is important. One of the problems we have in many of our companies is we let everybody see all the data. That’s a problem when you have people’s phone numbers, social security numbers, addresses, and names altogether. Having a good tool, a great data catalog that provides monitoring, auditing, data lineage, absolutely important. What is data lineage? How data flows from its onset, which is data heritage, where did it originate at, all the way through our systems. That’s what data lineage does. Data lineage, many would say, is the most expensive part of any IT budget.
Manually figuring out what data we have, what does it mean, where does it come from, valid domain values, all vital for the work that we do. Here’s just a little bit of an example of a classic data dictionary. Think data dictionary more your technical metadata; column names, field names, transformation rules, titles, formats, field widths. All of these things technical folks like myself really enjoy. Let me wrap this up because we have so much more we have to speak about. If you’re building a data management program in your company, don’t build it based on manual processes.
As a data management professional, it is your job to remove them, not to add to them. Data catalogs must have capability for any company looking to do data management or federal agency looking to do this and really grow. Be smart, be diligent, be patient, you will achieve amazing results. None of this is easy, but through diligence and good effort, you can achieve amazing results with that. I’m going to stop sharing, and I’m going to pass it over.
Amichai Fenner: All right. Can you see my screen?
David: Yes, I see it.
Amichai: Perfect. Thank you, David, for those wonderful insights. Thanks, everyone for joining. As mentioned, my name is Amichai Fenner, and I’m a Product Manager here at Octopai. Today, we’re here to talk about literacy that creates business value. I think as data professionals, we focus on creating one source of the truth for data. We’ve been doing that for a long time. That’s great, but as the data landscape grows, and it spreads, and becomes multi-vendor, mixed cloud, and on-prem environments– Even actually, our teams now are working.
Part of us were working remote, no one is in the same place. The data literacy is really, really important so that we can all get the value from the data we’ve worked so hard on gathering, creating, organizing, and so on. One source of the truth about the data is what enables us to get the value from that data we’ve worked so hard on without it being– David: Amichai?
David: Can I jump in quickly?
Amichai: Go for it.
David: You mentioned data literacy. Is that something that you’re hearing more and more from your customers?
Amichai: Yes, for sure. Data literacy is basically our customers understanding that there’s a lot of information about their data that they need documented in whatever different way they see fit, but they understand that they have information they need about the data, and referring to that as the literacy.
David: Yes, because we are seeing that now. We just won a piece of business where we’re working with a large federal agency on their data literacy. It is something that more CDOs are talking about. That we need to up our data literacy, that most people really don’t understand how to work with and interact with their data. I just thought it was so interesting that you brought it up right out of the gate.
Amichai: Yes, for sure. Without it, what happens is without that, basically, we create bottlenecks and consuming the data that we’re creating. We’ve been working so hard as data professionals on relieving those bottlenecks by creating self-service and so on. This is really the enabler of that self-service, and of people being able to make the efficient, accurate use of that data, yes, you would say. I’m really honored to show you how Octopai is taking this to practice. First, I’m going to give you a quick brief about Octopai, and then we’ll have a live demo about it, about the actual catalog.
First of all, BI and Analytics teams face major challenges. We have lack of visibility into how the data is getting from one place to another, what the data means, how it can be used, where it’s being used. All these different questions about the data, which are really not transparent unless we have proper tools to help us do these things. Ooh, sorry about that.
A little bit about our platform. Our platform is a SaaS platform. It’s automated.
Meaning, it automatically harvests the metadata from all the different systems in your BI landscape. It gives you the ability to really read into the BI landscape and understand how the data is being created, populated, being used, and what it means. It’s easy to use. Its refresh is automated, so that means that it’s not something that you build one time and then doesn’t change and becomes obsolete. It’s constantly updating. Intuitive and user-friendly. I guess you’ll be the judge of that, but since we have so many different types of users in our organization, we have many different types of data citizens.
You have more technical users, more business users. They need a platform where they can easily all look at one place and understand what the data means and how it got there. Oh, a little bit about why BI groups need transparency. As you can see here on the left, this is typical BI landscape. You’ve got your source systems with different ETL systems that are used to bring that data and ingest it into the databases, analysis layers, reporting layers. These are some of the tools that we support, by the way. You can see more of those on the website.
Understanding how the data is getting from one place to another over here, that’s a true challenge along with where the data can be found in the different systems, and what it means, who’s responsible for it, and so on. Here are some of the reasons that customers are coming to Octopai. They’re looking for one source of the truth. They are seeing inefficient use of data. They’ve got loss of tribal knowledge about the data within their company, got lack of independence in using the data. Of course, regulations and compliances that you’ve mentioned, David.
They want to create new business processes that are based on the data that they already have, but they don’t know how to locate it. They need to fix broken processes and locate and understand what those processes mean, and what they are doing. Of course, perform impact analysis to understand if they make a change anywhere, what else would be impacted. Fix reporting errors, which happen occasionally in the BI landscape. Not too often, but those happen too. Octopai basically provides a platform where our users can have a discovery, which helps them understand where the data is located within all the different systems.
We provide data lineage, which shows how the data flows from one place to another at different levels. Octopai has three different types of lineage where you can actually see column-level lineage, end-to-end, all the way from the source system to all the different reporting and analytics tools. I’ll try to get to that today as well if time permits., and, of course, the data catalog that we’re all here to see right now. Since we’re running late on time, I’m going to try to get to this later. Let’s talk a little bit about the automated data catalog.
We’re talking about one source of the truth about our data. Here are the key capabilities that Octopai provides for a data catalog, and what we believe is a must for any data catalog. First of all, you need the automation. Automation means that we automatically harvest the assets from the different systems, centralize them in one place where they’re easily accessed, easily managed, and so on. They’re also automatically refreshed, so it’s not something that goes sterile, as mentioned before. It’s something that keeps updating all the time.
Without this, it’s pretty much impossible to create a data catalog that’s useful. Trying to manually harvest the assets from all the different systems that are constantly changing is a true challenge. Next, we need collaboration and democratization. The data catalog, the way I look at it is it’s really the hub where everything comes together. It’s where business goes to understand what data they can use. It’s where self-service users go when they want to understand what data they can use to build their analytics, and build their reports, and so on.
It’s the place to have all these different conversations about the data where it’s all in context, providing that true transparency into what data is out there, what it means, how it can be used. We also believe that to achieve this, you must have integrated data lineage at all different layers to support different use cases you may have in the BI environment. Whether you’re looking to understand how data became the way it is in a certain column, looking to understand what is dependent on a specific object in your landscape, understanding what the logic looks like for a specific process, that’s all lineage.
Anne: Amichai, we have a question.
Anne: “What is integrated data lineage?”
Amichai: Great question.
Anne: “What’s the difference between that and regular data lineage?”
Amichai: Sure. Thank you for that question. Integrated data lineage is basically the concept that a data catalog on its own, without a multi-layered lineage, is going to be very hard to maintain. Integrated over here means I can, at any point, see the lineage for any data asset, and at any point if I’m looking at lineage, I can easily see the definitions of that asset from the catalog. All right?
Anne: Okay, thank you.
Amichai: Combining the two, combining the catalog with the lineage creates that value, and that completeness and understanding the data for all different types of users. I’ll demonstrate that in a few moments. Great, thank you for that question. Lastly, over here it’s monitoring. Catalogs can be automated, but you will have people that are responsible for different assets, and they need the capabilities to monitor exactly the health of their catalog, the health of the assets they’re responsible for, and so on. Let’s see what the data catalog looks like in practice.
Let me stop sharing the screen.
Amichai: I believe you can all see my screen now?
Amichai: Perfect, great. This is Octopai’s automated data catalog over here. I’m going to start out by just explaining that we have different layers in our catalog. As we started mentioning before, you have many different types of users that are interested in understanding the data that’s out. You also have many different types of tools in your BI landscape. Octopai harvests the assets from all the different tools, whether they be reporting tools, database tools, ETL tools. We also allow creating assets within the actual catalog. For reports, we actually divide the assets into different layers.
We’ve got the actual reports, we’ve got presentation layer. This is what the actual end-user sees in his report. We’ve got the semantic layer where all the logic happens, the magic. We’ve got the physical layer, which actually ties back to the database over here. The reason I’m showing this to you is to give everyone an idea that you have users that are only interested in specific types of assets. They can focus. Octopai gives them the option to focus exactly on the assets that are relevant to them. Why is that important?
In an average data catalog, you’re going to have hundreds of thousands of assets, so being able to navigate through everything to locate exactly the asset that you’re interested in that’s relevant to you as a user, whether you’re a business user, a self-service user, or a technical user, a data steward, is key for the success of a data catalog. Automatically harvesting all these assets, and anyone can set this up so they see exactly what’s relevant to them. Next, what I’d like to show you is our catalog through a use case. This use case is something that anyone who works with data is probably very familiar with.
We have a business user who’s interested in using data that has to do with sales. Basically, he’s looking for a sales report over here. Normally, without a data catalog, this user would reach out to the BI team asking them where they can find such and such data. You would have some back and forth between the BI team and this user before they understand exactly what report they can use. Usually, it doesn’t really end there. They have questions about the report, and questions about the different assets. Eventually, they understand what asset they can use.
Except, that entire conversation that they had, took a long time. It wasn’t documented anywhere. The same questions are going to come up a week from now, and take all this time up again. Here’s how to do it with the Octopai’s data catalog. We’re a business user looking for a sales report, so naturally, I’m going to search for sales. I’m going to filter out reports. I get a list of all the different reports from the different tools. We can see we’ve got SSRS, and Power BI, and Tableau. This is a multi-vendor landscape. That’s not a challenge for Octopai.
I can see this badge over here that indicates these assets have actually been approved. I may want to lean into these first. I’m going to choose this Power BI report over here, and take a look at it. We can see here, we’re looking at a Power BI report called Order by Sales Rep Summer. Up top, we can see these different tags this report has been tagged with. It’s been tagged with Sales, Salesforce, EMEA, GDPR, Project Number, PII, must include some PII information, its line of business orders over here. We can see its rating.
This is the user rating of 4.5 with two users. If I click on it, I can actually see what users gave rating over here. This is important because remember, we have collaboration options here within the platform, and users always need to collaborate about data. This gives them an idea also, who they may want to collaborate with, whether it be a high score, a low score to understand if this is data they can use for their needs. Next, over here, we can see the status. We knew, of course, we selected the asset that was approved to print the badge over here.
We’ve got some built-in statuses over here, which is also customizable to match our internal language.
David: I wanted to jump in just quickly because you’re showing here something very valuable. Most organizations have hundreds, actually, thousands of different BI reports available. Trying to sift through these is a nightmare. I’ve actually seen some statistics on it. Data scientists typically spend more than half of their time trying to find the data they need rather than building up beautiful AB case studies and calculating P-value, whatever the heck P-value is. This ability to find the data to sift through, I think is extremely valuable. I just wanted to chime in because I think this is a very real-world kind of situation.
Amichai: For sure. That’s exactly why customers are coming to us. You mentioned data scientists. Any organization now who’s looking to be data-driven, who’s got all these different data initiatives, is lost without a way to understand what data is out there and locate it, understand who’s responsible for it, easily be able to use it, collaborate about it, understand how it’s being used. Yes, for sure. Thanks for that input. Next, over here we can see, it’s also indicated as sensitive over here. The combination of being sensitive and PII makes sense.
It gives you an idea of why this information, or why this report includes sensitive information. This right here already gives your user a good idea of whether they want to keep on looking into this asset, whether it’s relevant to them according to how it can be used, its status, what it’s associated with, and so on. Our business user wants to keep on looking into this asset. He can see here a full description that’s been documented about this asset. I’m not going to read the whole thing, but it’s a long description about this report. We’ve got a technical description about the report.
A little bit shorter, yearly sales for last year at account level. We’ve got a calculation description used differently, of course, according to different types of assets. Over here, it’s been used to describe all data filtered for EMEA only. Makes sense. We saw the tag up here for EMEA. Perfect. If there is a description in the source system, Octopai automatically harvests it over here along with the asset. That’s going to show up here, as well as calculations for semantic layer, and so on. It’s a report, so, of course, it’s got no data type. We can see its path over here.
We can see the source system has been documented as Salesforce over here. We can see two really important roles about this data. We can see the data owner who’s Sophia. We can see the data steward, Jeff. Now we know who we need to reach out to if we have a business question about this data asset, or if we have some type of technical information or issue with this data, we know exactly who we need to collaborate with. We can see who updated it when, when it first entered the system. This, by the way, is how we manage those tags you saw up here.
You know what? Our user looks into this, and they’re like, “Okay, this makes sense. I think I can use this report, but I still got questions about that.” Data citizens have questions, no matter how long the description is. This is where the collaboration starts. We click over here on the Post option, and you can see here, the user, in this case, was Holly. She’s reached out to Sophia Davis by mentioning her here in the post, asking, “Does this report represent fiscal or calendar year?” Legit question. Sophia reaches out to Holly, mentioning Holly over here, saying, “The report uses fiscal year.”
Now, anytime a user is mentioned, by the way in the post, they get, of course, automated notification by email, with the message, with the asset, with a link to continue collaborating about the asset within Octopai. The benefit of this is– Well, there are a few. First one is, this entire conversation actually happened in context. All Holly had to do, she didn’t need to go to her email and start describing where she found this data, what it was called, and so on. It’s just straight here from the post while she was looking at the asset, so everyone knows exactly what they’re talking about.
Of course, it saves time. Sophia got an automated message, so that was instant. When they continue collaborating over here in context, this entire conversation is actually preserving this tribal knowledge that otherwise would get lost in their emails or other types of chats. The next user that logs in and has a question about this report will also know that the report uses fiscal year. If this was information that’s important enough to include in the description right over here, Sophia would do that as the data owner. If not, this information is still documented over here, preserving that tribal knowledge that’s so valuable.
Of course, preserving tribal knowledge is a huge challenge for many organizations. As different roles advance and so on, within the company, this information is lost. Besides for wasting time on answering the same question over and over, eventually, Sophia may not be in the position to answer these questions anymore, and all that information that she had gets lost. Next. Down here, we can see all the different assets that are automatically from our analysis, from Octopai’s analysis, automatically linked to the report. These are the presentation columns from the report.
You can see them over here, including the asset TotalDueSum, which is a presentation column that automatically linked to the report. There are different types of links in Octopai. This is an automated link. If I click on it, it’s a hyperlink so it’s going to link me directly to that asset over here to continue understanding the data that’s included in this report. Now, I’m looking at the specific presentation column called TotalDueSum. I can see all the information we saw previously, and again, collaboration. This time, Holly reaches out to Jeff. Jeff is the data steward over here, saying, “Hey, it looks like the sum is rounded.
Can you let me know if the amount is rounded up or down? The numbers aren’t matching up with other reports.” She’s got a question now about how this data got here, what happened to it, and so on. Without a proper lineage tool that’s completely integrated with the catalog, you would need to start hunting for this information to understand how it got there, where else it is being used to understand how it’s showing up different numbers in different places, in different reports. The good news is that in Octopai, this is completely integrated and happens in a click of a button.
Let me show you how. Any asset is automatically linked to its relative lineage. Over here, remember we’re looking at a column in the report, in the presentation layer of the report, and by clicking on the three dots over here, I get two options. I get the option to search for it in discovery. Discovery is another module in Octopai. We can cover that in a separate session. We’ve got the option to click here and get to the end-to-end column lineage for this specific asset. I’m going to go ahead and click on that. Jeff, over here, the data steward, is trying to understand how the data got where it is and what else is using this data.
You can see here, the column-to-column lineage here, end-to-end, starting from the TotalDueSum in the report we were just looking back at, drilling back all the way to the original source system, through all the different processes that took place in manipulating and populating that data through the different objects, all the way onto this report. We can also choose any asset we see within the lineage to say, “Hey, I see for instance that the data is coming from TotalDue in DWH FactSales. Let me see what else is using this column.”
Click over here, override the column lineage, and see, hey, okay, this time I started from the DWH FactSales. I can see there are other objects using this data as well. I can see here, there are analysis objects that are using this column. I can see here another report using this column directly from the table as opposed to a view that was being used here in between the object and the report. Now I understand exactly what else is impacted by this column directly.
David: Amichai, this is really, I think, important functionality. Maybe I could share a client partner example. There was a bank that we were working with, and they needed to expand a key field with one of their partners. Their partner was a credit card company. As a developer, you’re always expanding a key field. That’s part and parcel with what you do, but it ended up being a fascinating case study because the financial services provider that we were working with had fantastic data lineage across all of their major systems, but the credit card company didn’t.
It took the credit card company– They had 4 people work for 5 months to identify 122 changes they needed to make in their environment. Not do the changes, only identify them, where companies that have invested in this, I think it took less than 15 minutes to spit out a report to actually see. They actually had to make 144 changes if my memory is correct on it. I just wanted to bring that kind of real-world example because we do this stuff all the time.
Amichai: For sure. Yes, thanks for sharing that. Naturally, the reality is that when it takes that long to manually map, and it does take that long to manually map it out, it’s actually not accurate by the time you’re done. Everyone knows the data landscape is constantly changing. Processes have changed by the time you’ve done your analysis. It’s, I wouldn’t say obsolete, but definitely not accurate by the time you’re done with that manual work. Thanks for sharing [crosstalk].
David: No problem.
Amichai: Yes. Next. Remember, Holly had a question about how that data was rounded. She had a few questions in that post. You can see here the processes that are involved in populating this data. This here, for instance, is an SSIS data flow called FactSales. If I double-click on it, we can see it over here. We can see exactly its path, and so on. If I click over here, it actually gives me an option to see inner system lineage as opposed to lineage we were looking at now, which tells you, across the board, across your entire landscape, how did the one column get from one place to another.
This actually breaks down this specific process at column level, showing you the logic of what happened to that column within the process. Over here, you can see TotalDue came from SIG FactSales, went through this union, all transformation over here. Another union over here within the SSIS, and all the way down to this target table within this SSIS data flow task.
Anne: Amichai, we have a question.
Amichai: Oh, sure.
Anne: The question posed is, “Is all of this automatically discovered, or does a team have to manually document this? How do you ingest all this metadata?”
Amichai: Great question. All the metadata for the lineage and for populating all the different assets is automated. Doing this manually is almost impossible. It’s all automated. We have a client that connects to the different BI systems in the landscape, extracts metadata. Metadata only, no data. Then we have complex processes on our side, an analysis that knows how to put all this together. I’m assuming that at least some of our participants over here are familiar with different ETL tools such as SSIS. Visualizing something like this isn’t possible, even if you have amazing knowledge of SSIS.
It just doesn’t map out this way, all the way from source to target. You have to keep on clicking on lots of little boxes. You don’t get one picture that shows you the entire thing, and that’s the real benefit of using Octopai to understand this. You can see exactly where the column went through, the different tasks, the different components it went through within the package to understand exactly where it is, where things may have gone wrong, or may need to be adjusted, and so on. Thank you for that question.
Anne: We have another question.
Anne: “Since the systems are ingesting technical metadata, the business metadata has to be entered manually. Is that correct?”
Amichai: It’s partially correct. If the business metadata already exists in the source systems, we automatically harvest that as well and show it in Octopai. The reality is, though, that most of that business data is not out there. If it is, we can automatically harvest it from the systems. Octopai can also automatically load it from different spreadsheets, and so on if they exist. Most of the companies, however, that we’ve been in touch with, have very limited existing documentation, business definitions documented, usually in spreadsheets that aren’t necessarily up to date, and so on.
That’s the idea of putting all this in a catalog where it can be managed up to date, available to everyone, and so on.
Anne: Thank you.
David: I want to jump in. I have the privilege of getting a chance to see Martin’s question. Martin, this ain’t easy. When you talk about building out business definitions and all this, as your question implies, it is a lot of work. What I think is relevant, and what’s critical, is you don’t want to do this via spreadsheets. A lot of times when we work with companies, we see it sitting in Word docs, spreadsheets, notes, all kinds of different places. Having it integrated, I think is really important. To answer your question, it is a ton of hard work because most companies didn’t put in the brain work to do it.
Some of the definitions, Amichai, I’m sure you see this, you bring in a spreadsheet, it says Customer ID, and then the definition is ID of the customer. Awful. That definition tells you nothing. It has no value whatsoever. Just to Martin, I read his original question, a bunch of hard work, but you want to be able to do it once, not over and over again. Amichai, I will let you continue.
Amichai: Thank you for that for sure. To expand a little bit on what you just described, the reason companies that have spreadsheets are reaching out to us is because it’s hard for them to manage. It’s all manual work. Not all users have access to that spreadsheet or appropriate permissions to that spreadsheet, and so on. They have multiple spreadsheets sometimes that have conflicting definitions, and so on. It’s really hard for their users, even if they have access, to filter out and get exactly to the assets that they wanted. Remember, your landscape is going to have hundreds of thousands of assets within it.
Next, over here, back to our use case over here. We’re looking at the inner lineage for the TotalDue over here for the specific SSIS data flow task. Just to show you, the third type of lineage that Octopai provides is cross-system. Now let’s say we see the information is flowing into DWH FactSales. If I double click on it, I’ll get all the additional information about it such as schema, database, and so on, which also integrates directly to the object lineage. If I click on it, I’m going to get to the cross-system lineage for this object over here. We can see DWH FactsSales over here is actually used by the stored procedure over here.
It’s also used in this measure group in OLAP. It’s used in this tabular table over here, tabular model. It’s also used in these different reports from different systems, used in a view that’s used in a report. This is giving you the cross-system lineage for the asset, which is four completely different use cases. [crosstalk]
Anne: We have one more question. We have another question.
Amichai: Oh, yes.
Anne: Since we’re almost at the top of the hour, I wanted to make sure we got the question in. “Is it possible to export lineage metadata to other solutions? How difficult is it to provide in a customized and usable way?”
Amichai: Yes. First of all, we have different ways of downloading the analysis. We have direct collaboration with some partners, but the reality is that Octopai’s lineage is, to the best of my knowledge, the deepest and the most complex. Other platforms don’t always know how to ingest the level of detail that Octopai knows how to provide, but, yes, it can all be exported. Whether you can upload it, that depends on the other system.
Anne: Thank you so much.
Amichai: Sure. Over here, for instance, just to complete the integration topic over here, again, any asset you see in Octopai, whether it be in a lineage, a cross-system lineage, inter-lineage, end-to-end lineage, is integrated completely with the catalog. You can see over here I’m looking at a view, which is called vSales Customer Products. Actually, let me grab this view over here, which is vSale Products Summary. If I double-click on it, we see the information over here. I can click on the integration button with the automated data catalog.
By clicking on it, I’m back here in the catalog looking at the view’s definitions over here, understanding what exactly it means, automatically linked to its asset, and everything comes together. At any point in the system, you can always connect to understand either how the data is being used, where it can be found, how it got there at any different level, whether it be process, column, object, and so on. Last word here about the platform. You can see here we’ve got these built-in filters at all kinds of different levels.
The idea over here is to give your users the easiest, simplest way to navigate through the entire landscape so that they can pinpoint exactly the asset they’re looking for. From the hundreds of thousands of assets you’ve got over here, you’re going to say, “Hey, I only want to see assets that have been approved. I’m only looking for reports right now. I don’t care whether the information is sensitive or about the rating right now.” I can also say, “I’m only interested in assets that have been tagged as Sales for instance,” or I can say, “They’ve been tagged as Sales,” or, let’s say with a specific project or with GDPR for instance.
Maybe I’m a data owner or steward who’s in charge of managing these assets, I can look exactly at the assets that are relevant to me, hit Apply, and see all of the assets that correspond with the filter that I created over here, which resembles kind of an online shopping experience for your different users. The reason we did that is it needs to be really simple for any type of user, whether they’re really technical or completely business user type. That’s what I had to show to you here within the use case. Let me just pop up a quick slide for a moment, and I’ll see if we have any additional questions.
Anne: No, we do not at the moment.
David: We’re four minutes from the hour, so you can do the slide, and then, Anne Marie, you’ll wrap it up.
Anne: That’s right.
David: Take us home.
Amichai: Okay, one moment.
Amichai: Okay. Sorry about that. Now that we understand what the catalog looks like, here is what you can do once you have the catalog in place. You got one source of the truth about your data, as we said, and your users can understand where they should look for their data, does the data matter, what this data represents, is it relevant and important, how they can use this data. The end result is right over here. There’s more use of data, efficient use of data, accurate use of data, which ultimately drives data monetization, which is what we’re all really out to do from my perspective.
We maximize the value of data by staying up to date through automation, providing full visibility into the data journey through lineage, and preserving tribal knowledge through collaboration. Thank you, guys, so much. My name again, Amichai Fenner. You’re welcome to all reach out to me. This is my email here below.
Anne: Amichai, thank you so much. I really appreciate it. I know everybody at the Data Management U community appreciates Octopai’s sponsorship of Data Management U. This was a really valuable presentation. I thoroughly enjoyed it. I hope that everybody else got a lot out of it. Let’s just see. We have one more question from Martin again. “Can the users enrich metadata through the collaboration inside Octopai?”
Amichai: I’m sorry, again? “Can users–”
Anne: “Can users enrich the metadata through the collaboration feature inside Octopai?
Amichai: Of course. There are different types of users in Octopai, different level of permissions, and so on, and that’s exactly how a data catalog evolves, through the collaboration.
Anne: Thanks again to everyone. Have a wonderful rest of your day.
David: Take care. Goodbye, everybody.
Amichai: Bye-bye, take care.