ANNOUNCEMENT: Octopai has reached Microsoft's Co-Sell Partner Status for Microsoft Azure Customers: Read More

Decoding the Mystery: How to Know if You Need a Data Catalog, Data Dictionary, or Business Glossary?

Play Video

There’s so much confusion out there about the difference between a Data Catalog, a Business Glossary, and a Data Dictionary. Well, we’re here to help clear it all up and help you understand which is right for your organization’s needs and challenges.


Watch our webinar with Dataveristy during which Malcolm Chisholm, President of Data Millennium, and Amichai Fenner of Octopai delve into the differences, use cases, and more. 

Video Transcript

Shannon Kempe: Hello and welcome. My name is Shannon Kempe and I’m the Chief Digital Manager at DATAVERSITY. We would like to thank you for joining this DATAVERSITY webinar, Decoding the Mystery: How to Know if You Need a Data Catalog, a Data Dictionary, or a Business Glossary. Sponsored today by Octopai. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we’ll be collecting them by the Q and A in the bottom right-hand corner of your screen or if you’d like to tweet, we encourage you to share highlights of your questions by Twitter using #DATAVERSITY. If you’d like to chat with us with each other, we certainly encourage you to do so and to open and access the Q and A or the chat panel, you find those icons in the bottom middle of your screen for those features.

Just to note that the chat defaults to send the host and panelists but you may absolutely change that to chat with everyone. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speakers for today, Dr. Malcolm Chisholm and Amichai Fenner. Amichai is a product manager with Octopai with over seven years experience working as a full-stack BI expert. He has expertise in BI methodology and architecture, as well as technical skills in various BI tools, from ETLs to Reporting and Analytics.

He currently has the development of Octopai’s automated data catalog. Malcolm is the president of Data Millennium. He’s a thought leader, author speaker in data governance and data management. Malcolm has over 25 years of experience in data-related disciplines. It has worked in a variety of sectors, including finance, manufacturing, government, pharmaceuticals, and telecoms. With that, I will give the floor to Malcolm to start today’s webinar. Hello and welcome.

Malcolm Chisholm: Thank you very much. Shannon, and it’s great to be here with the community today to speak about this important topic. If we can go to the first slide, I think it’s a very hot topic today, but it’s one which exists in the historical context, and that is the shift to Data-Centricity. If you think about when the computerized age really took off, which was in the mid-1960s, for many years thereafter, the focus was on automating the enterprise’s processes. Data was thought of as a byproduct of automation. There was very much a rush to do things like automate the books and records of an enterprise. Because if you think about a bank in those days, people had to go into a bank and a human being would write in a ledger how much money you were taking out or putting into your account. That’s completely unthinkable today.

The banks couldn’t scale and these clerical operations were very expensive. That’s when computers came in and provided a tremendous amount of efficiency. It also allowed scale. It allowed speed to increase, but the focus was on automating processes. To some extent, that process has stuck with us today, but the reality has changed over the decades. Today we can find packages to automate practically anything. The core problem we’re faced with is getting value out of data. Data is increasingly at the heart of business models. You can think of companies like Google and Facebook and Twitter, whatever you might think about them, they have- they’re working with data. That’s what they get value out.

That is what they monetize. Now not all enterprises are directly monetizing data, but they are using data to do things like predictive analytics, they’re BI, et cetera, run their companies. That’s what’s happening today, albeit with somewhat of a process-centric mentality, still, in IT. We have massive data volumes and data from not only internal enterprises sources as well, but external enterprise sources as well. Data acquisition has become a whole subdiscipline of data management and data governance that’s affecting the way in which we need to process data and feeds into our artificial intelligence and ML technologies which are increasingly available and are increasingly within reach of in small to medium-size enterprises.

This makes everybody very data-hungry. Data is at the center of the modern technological world it’s been called the new oil, the new gold, the fuel that runs our digital or economy or information age. That has a profound impact because we need to know how to manage it, which is where data catalogs, business class, glossaries, and data dictionaries come in.

Next slide, please. What is metadata? Well, metadata is the information we need to understand and manage the data assets we have, but metadata isn’t like a single uniform substance. There’s a lot of different kinds of metadata. That tends to be reflected in these different kinds of tools which manage different kinds of metadata.

The business glossary does things like managed terminology for both information and data concept managed definitions, manages classifications. Terminology is important. What do we call something? We don’t want different report labels on all our reports for the same piece of information. It would be nice to have that standardized. Well, you can do that with a business glossary where you’ve got a single place to go and look up their actual terms, maybe the synonyms and homonyms in there that get disambiguated. Then we have definitions. Definitions are not anymore like the couple of lines we saw in printed dictionaries. Definitions when it comes to data management and data governance are much more than that.

They are facts of business significance. Think about a metric like a stock on hand. How is that calculated mathematically? What’s the methodology for that calculation? Do we do it in the morning? Do we do it in the evening in our shop? How is it to be done? Those facts of business significance that help us to understand data are in part in the business glossary and classifications. We can classify data all kinds of ways. The need for that is growing. We’ve seen that with data privacy and even within data privacy regulations, you see sub-classifications.

For instance, the California Privacy Rights Act has 12 classifications of personal information which you have to disclose to data subjects when you tell them what you are doing when you collect their personal data. A lot of things going on in business glossary, data dictionary, very familiar, goes back a long way. Typically, it’s the schema table and column information, the structural metadata that is, from our relational databases, but that’s grown up too over the years. Now we have data profiling information in there where you can do simple things, like say what’s the minimum value, the maximum value?

You can get all kinds of information out of profiling. Data universes, which we’ll come to again in a moment, and then other relational objects like View. We see these stored in a data dictionary. Traditionally, I think it’s fair to say the business glossary has been much more on the business side of the house and the data dictionary has been used by IT professionals. Certainly, I did when I was more active in development activities. Then we have the data catalog and the data catalog is an agglomeration- can we just go back one slide, please? On major data assets files, data sets, which are logical groupings of data reports, other data assets, and it attaches definitions to these data assets at this more aggregated level. Can we go to the next slide, please?

That’s a brief look at them, but it’s what do you need? Well, depends on who you are. Traditionally, usage by roles look like this. The business glossary has certainly been used by business users and increasingly by self-service users. I think self-service business intelligent has really been a pioneering area in self-service, but we’re starting to see self-service come- expand into other areas with the idea of data democratization that anyone can use the data if they can get hold of it to improve the business for the benefit of the companies or enterprises we work for. Data architects also use business glossaries and so do data governance professionals.

The data catalog, you’re now having to look at more technical level data. Certainly, a self-service user, particularly in BI, might need to use that to look up columns that they intend to use. Data architect clearly is spending a lot of time dealing with columns or attributes if you’re at the logical level. I certainly did when I was doing a lot of data modeling. Data engineers, the guys now who are moving data around and building data pipelines and integrating things, they need it too.

Again, the data governance professionals have an interest in it. Maybe again for something like privacy. There’s Data Catalog, the Data Dictionary much more on the technical level again. The Data Architects need it, the Data Engineers need it, now our friends, the DBAs obviously need that because they’re primarily working at this level. Again, you’ll find the data governances professionals need it also. They need everything. Traditionally we’ve seen different roles, used these things in different ways, but I think what we’re seeing today is more of a coalesce of the usage. It’s probably still role-based, but it’s getting more subtle.

I think it’ll be interesting to see how this develops over the next few years as we get more and more technology available. Data becomes even more prominent in just regular business usage. I think one of the things that we’ve seen in the past is when somebody goes into a new role, a new job, they ask, “Well, what do I do?”. Increasingly they’re asking, “What data do I work with, I need to understand it.” These three capabilities are providing the answers that are needed to understand the data that each of us as business people now will have to work with. Next slide, please.

Data Catalogs, which I think are emerging as the primary vehicle today, need content. This is something that you have to think very carefully about. You can put out a data catalog just as a capability and say, “Hey folks, here’s our bright new shiny data catalog”. Please start putting in information into it. Well, that’s not going to necessarily work too well because you are asking people to take time and effort to put the metadata into the data catalog, but there’s not enough content yet to get any value out of it. That’s not going to be received very well.

I think this is something we have to think very carefully about in the data profession. We have tools and they have great capabilities, but if it’s just a capability and it’s devoid of a lot of what makes it useful, the content, then we are likely to have a problem. As we can see from this graph, each of us in our enterprises needs to understand the minimum level of content, the minimum viable content needed for business adoption past which the business says, “Aha, this actually does, may not have everything, but it’s got a lot of stuff. That’s very valuable to me and I can use it. Now I’m going to get value out of the data catalog and that, in turn, gives value to our data, manage into data governance programs”. The way to do that is to have an initial provisioning stage, which is done with automation because it can only really be done with automation.

We’ll see a bit more of that later too. This is something that I really want to emphasize. Please think not just in terms of capabilities, but level of content, minimum viable content too. Next slide, please. A little bit more about data universes, as we’ve talked about data definitions and people say, “Yes, I know what an employee is”. I’ve got the definition, but is that all that you would want to know? Here we have three databases, a Global Combined Employee Database, the Canadian Employee Database, and the US Employee Database. These have a, some data movement and integration between them. Okay. The definition of employee may well be the same for all of these databases.

They’re all employed from a data definition perspective, but the data universe, the population of things they contain is different. The Canadian Employee Database only contains Canadian employees, contractors, interns, whatever, directors, members of the board, the US employee database, same thing, but just for the US, and then the Global Combined Database has everything in it. Populations, what the data contains is never going to be given to you from a definition alone. A definition explains the concept. It doesn’t explain the extension, the coverage, the universe, what the populations are, how are we going to get that information, that has to be in data dictionaries and, or data catalogs also.

As you can kind of see here, we can infer it from lineage. We’re also already starting to see that lineage is providing a role in helping us collect metadata to improve our understanding of our data assets. That’s important because data universities are very often overlooked. Again, certainly, I’d really encourage everybody to think about in their day-to-day work. Next slide, please.

Okay. What we are seeing, as I mentioned earlier, as a trend today, is a move towards consolidation. The data catalog appears currently to be the point at which this consolidation of metadata is happening. It turns out, I think you can see this as a bit of a busy slide. The whole product in the data catalog is more than the individual parts summed together. Here’s how it might work. We have a business glossary and it has business terms in it. The business terms will have definitions. Synonyms will be identified, et cetera, and so on. We have a database which we can understand through in terms of the structural metadata that we’re going to harvest from it, which would traditionally be held in a data dictionary, and then something we haven’t really discussed yet, but reports are out there. Reports would be seen today in a data catalog. There are ways of capturing report metadata, well, what happens in reports?

Well, in reports we see labels, report headings, and the actual fields that are populated into a report. There’s a linkage between those fields and the databases from which they come. We may have a column that’s just called CL in our database, but Customer Last Name in the reports. Now what this is solving here, which is very important to note, and may often be overlooked, is this traditional, tremendous chasm in metadata management was to say, “Okay, I have my business terms and I’ve got a sea of those and I have my database columns, and I’ve got another ocean of those.

How do I relate the two? Am I going to be able, am I going to have people do this manually?” That’s not going to be easy to do at all, unless you want to spend vast amounts of money. By the time you’ve got through it, everything will have changed. You’ll have to do it all over again. That’s impractical, but you can see that there is data catalog functionality that’s going to unite all of these different items of metadata, and it pushed them into this consolidated view, which we can now see where we’ve got business terms, reports and database structural information altogether in one place. That gives us, that provides tremendous value in terms of metadata management.

We have to manage these highly complex, gigantic environments of information that we have today. You would not, for instance, have an oil refinery that you would run without controls that measured fluid levels, temperature, pressures, and you’d have them in a control room and all would be monitored. We need to do the same and more for our metadata environments. Not just operational monitoring, but to get this additional value unlocked from the data that we’re dealing with. Next slide, please. Here’s our problems, kind of hinted at them already. How do you collect the metadata. Difficult. How does all the metadata get related? How do you establish relationships among it?

That’s a key question. How do you keep it updated? Also, we’ve hinted at that one. Let’s have a look at one solution. You’re going to go to automation because you cannot do this manually. The scale and complexity of data ecosystems is just too large for human effort. I have seen it done. It’s an enormously expensive, time-consuming effort that gets lots of people angry and doesn’t necessarily yield great results. Just to get everything documented, before you establish the relationships among it and then how do you find the relationships?

This is a really difficult nut to crack but let’s take a look at the next slide and see how you might be able to do it well. At a very high level, data lineage can do this. It can help you. Why? You think about data lineage in its full extent, you’re starting with something in a database, it flows through processes. We should come back to talk about processes in a minute. Yes, those are ETL processes, they’re technical processes, or they could be SQL scripts or something else, I get it. They’re nevertheless processes, and then they maybe go to another database, another hop skip jump through other processes.

Eventually, we hope, ending up in reports, it’s very interesting, we find columns that just dead end and nobody knows what they’re used for. Ending up in reports, which as we’ve seen earlier, which is where we can make that link between a database column, which is attached to a report label so we now understand, aha, this database column is– This report label rather, is the business term that goes along with this database column, and then I can go backwards through that chain of data lineage and say, “Oh, all these things are therefore the same column, because they are all feeding through without transformations, without changes, all the way back to the original source.” The book of record.

This also provides other things that are important, such as the processes. The business processes today are really represented by data flows, which get implemented, again, by things like, may seem trivial to business people, but they are not. ETL. Maybe not trivial, but rather overly technical, SQL scripts, and ETL processes, but those represent business processes. If you want to do business process reengineering, and you want to know what processes you have, a good deal of that information lies in the process steps in the data lineage.

You can see again, for instance, think about GDPR, you need to have a process register for what you’re doing with personal information. This can help you. We’re starting to see that we have the automation to gather this information, we have the capacity to populate the Business Glossary, the Data Catalog in the Data Dictionary. Because the lineage has these relationships, which again, yes, I know they’re flow relationships, but they’re also logical relationships because like data element is populating, like data element are being transformed in some way. We’re making the relationships too.

That gives us this consolidated, integrated whole view of metadata, which is what we want to see in the Data Catalog is increasingly, the place where you go to get that. We get the terminology, we get the semantic set of Business Glossary functionality, we get the structural metadata from Data Dictionary-type technologies, but it’s becoming manifested in the Data Catalog, the one-stop-shop for everybody now. Everybody is going to include the citizen data scientists, the citizen developers, the citizen analysts to work with the data. That’s why this new paradigm is so important, it’s all coming together. Data Lineage could harvest metadata and build the relationships among it. Next slide, please.

Data traceability. This is something else I just want to bring out, which is another major reason that we need data catalogs because I think as we’re all probably aware, traceability for impact assessment is very much needed. It’s to say if I change something upstream, what is it going to change downstream? That’s important and going the other way, traditional lineage, something broke, what’s feeding into it that could have broken, for instance, my ETL processes? I want to point out that data traceability is becoming more of a general data governance requirement.

It’s beyond the realm of the technical folks who are very important, don’t get me wrong. There’s things out there such as BCBS, Basel Committee for Banking Stability 239, which says, “Look, guys, you are doing things like reporting on risk or reporting on capital adequacy ratios, you have to prove that the data that you’re reporting actually came from operational systems, without changes, without people doing manual changes to it, and so on.” We’ve got to see the flows from the operational systems to the risk reports, and that’s got to be proven. You see that traceability itself is going beyond the realm of the more technical environment into very much business needs that are in many enterprises. Something else that you will see increasingly overlaid in the story of the Business Glossary, Data dictionary, and Data Catalog. Next slide, please.

In conclusion, the Business Glossary, Data Dictionary, and Data Catalog have different foci or focuses in terms of the metadata they manage. That’s been very traditional, but there are relationships. The Business Glossary gives us meaning but the automation is going to be needed to harvest particularly technical metadata of the kind we see in Data Dictionaries and Data Catalogs. Data Lineage is a great way to do that and it also helps us to create trust in the data because of this full traceability. I know people will talk about data quality as being important for trust, but traceability is too. The Data Catalog then becomes the place where all this information is integrated and it’s the one-stop-shop to understand and collaborate about data. That’s a very brief overview of this very complex topic. Hope that helped. With that, I’ll send it over to Amichai.

Amichai Fenner: Great. Thank you, Malcolm, for that really in-depth comparison. Thanks, again, everyone, for joining. Malcolm, as you mentioned, data teams face major challenges. These challenges include a lack of visibility and control of data, and of course, lots of knowledge that it’s just scattered throughout the entire data ecosystem. The main causes of these challenges include the ever-growing amount and diversity of data and tests that the team is faced with, alongside with the growing demand for the business, as it becomes more and more data-driven, as you mentioned earlier.

It’s to not only make decisions based on accurate data, but also incorporate data within the company’s offerings, such as in product recommendations, and so on. To be able to meet these challenges, the company expects users to be more self-sufficient. In most cases, though, without proper tools and processes in hand, data citizens are not truly independent in using the data. There’s no ultimate single source of the truth about the data. There’s tremendous loss of tribal knowledge, which is all that knowledge that the different subject matter experts share in undocumented ways, or at least not widely accessible ways. We’ll take a look soon at how we address these in Octopai.

These are all points that you should keep in mind when you’re evaluating any data literacy platform. You just make sure that it alleviates these challenges. I’m sure that this is familiar to many of you, if not all. This is what we refer to as the data hunt. Business reaches out to the data team asking about data, and then there’s a whole undocumented loop of communication, collaboration that prevents the data users from quickly and accurately using the data independently. This process wastes a lot of time for data team members. What’s worse is that this process basically repeats itself every so often, and for the same data, and we all go through this exercise again. This is what our customers have shared as main drivers for implementing optimized data catalog. It’s easy to see that successfully adopting a catalog is a win-win for all data citizens, technical and business users. Everyone can easily see what’s in it for them. Everyone gets time back to do the job they were hired to do instead of hunting for data or explaining data redundantly, depending on what side of the data you’re on, of course, consuming, using, creating, or maintaining.

An effective way, some may say the only way, as you touched on this, Malcolm, to truly achieve data literacy is by leveraging automation. Without automation, most attempts fail.

By the time you’re done manually centralizing an inventory becomes pretty much stale. Because that process is just so time-consuming, keeping the inventory up to date without automation is almost impossible.

Octopai automates creating data discovery, which describes where data is used, data lineage which describes how data flows through the different systems, what are the sources of the data? what happened to it? It basically most commonly serves use cases such as root cause and impact analysis. We’ll take a brief look at that if we got some time but today, we will dive deeply into the data catalog. Let me go ahead and share a demo.

This here is what Octopai’s data catalog looks like. Basically, it’s the one-stop shop for the data. Let’s start out by just briefly running through the different layers of data assets that Octopai automatically harvests. We harvest assets automatically from the different reporting tools, different databases, different ETL systems. These are just samples of the different technologies that we support automation for.

The reason I’m showing this to you is, it relates back to what you were describing before, Malcolm. We have different types of users that are all going to end up collaborating in one catalog instead of in siloed systems, instead of having technical users using a dictionary and our business glossary for the business users, just maintaining all these many different tools, which has really not proven effective in the past few years. We’re going to want everyone to work in the same place, but we want to help each type of user focus on the type of assets that’s relevant to them. Different users can set this up to see the types of assets that are relevant to them. Let me give you an idea of what that is.

If I’m strictly a business user, I’m interested mostly in reporting tools, specifically, presentation layers of reporting tools. That’s those final, different columns and KPIs that show up on reports. I’m also interested in the actual reports. If I’m a self-service user, I’m probably also interested in the semantic level where all the logic is, as well as the physical end, which is how it relates back to the databases. You can see that every user has their own type of assets and layers of assets that they would be interested in using.

Octopai enables focusing exactly on the type of asset that you’re interested in, which is super important because in an average catalog, to give you guys an idea, there are going to be around a million assets, 1 million assets, that means you got to get good capabilities to focus on the assets you’re interested in, be able to search through them, filter through them, and we’re going to go through that in a moment. Now we understand that this entire inventory has been created automatically from our entire ecosystem, from the ETLs, from the reports, from the databases.

Let’s run through this with a use case. Say we’ve got a business user who’s interested in a sales report, and he wants to know which report would match his needs. In Octopai, what you would do is use the filter over here, which is the same as in any marketplace, to filter out and say, “Hey, I’m only looking for reports at the moment that have been tagged as sales,” and click on Apply. We can also add the term summary for instance, and say, “Here we go.”

Here’s the order-by-sales rep summary. Great. That’s what I was looking for. I’m looking for a sales report that is summarized. Let me go ahead and click on this and see the different definitions for this report. By clicking on it, what the user sees right away is, all these different tags that this report has been associated with and this Power BI report has been associated with sales, Salesforce, EMEA, GDPR, a specific project, it’s been associated with PII, it’s been associated with orders.

We can see the rating over here, which has been branded as 4.5 by two different users. By clicking on it, you can actually see who’s been using it and rating. These are typically users with high engagement, which he may want to collaborate with about this data, and we’ll show you how you collaborate within Octopai in a few moments. You can also, of course, rate through this functionality as well.

Next, you can see the status is approved, so this asset has been approved for use. By the way, that’s why I’ve got this badge over here to make it easy for him to select these assets from the list. We can see it’s been flagged as sensitive. This over here gets the business user an idea of whether this is the type of report that he should be further looking into.

Next, we can scroll down and see the different descriptions that were provided for this report. There’s this long description over here that says, “Yearly EMEA sales report contains detailed sales information by sale–” long description. It’s got this technical description over here, just all shorter, “Yearly sales for last year at account level.”

We can see here an origin description. For tools that support origin descriptions within the actual tool, Octopai automatically harvests them and shows them right over here. We can see the calculation description, so since this is a report, we’ve entered here the filter condition, “All data filtered for EMEA only.” Again, if it’s a logical data asset, and it’s already got a calculation in it, for semantic layers, for reports, for instance, the origin calculation will already show up over here.

We can see the asset as a report. We can see a datatype when it’s relevant. We can see the path to it, the source system it’s been documented for. We can see two really important roles about this data asset. We can see the data owner responsible for the business aspects of this asset and the business definitions. We can see the data steward responsible for the technical aspects of this asset and the correctness of it. We can see who updated and when, and so on.

Next, down here, we can see all these assets that have been linked to this report. Since this is a report, Octopai automatically links assets that come from this report, the different KPIs, the different columns, and so on. We’ll take a look at this in a moment. We can also add additional links here, right within Octopai. Say you want to relate this report to some type of project, which is also an asset in Octopai. You can click on the Add, add the specific project here, to the linked assets for the report.

Let’s say as a business user, this report really seems to answer my needs. I still have some questions about the data. I’m sure that all of you are familiar with that. In Octopai, we’ve got this built-in collaboration that allows your users to collaborate within the platform. Let’s take a look at this example. We’ve got Holly Miller over here, who’s reached out to Sophia, right here. She’s the data owner. “Does this report represent fiscal or calendar year?” She’s got additional questions about this report. We can see Sophia has mentioned Holly over here, replying that the report uses fiscal year.

By mentioning each other, they each get notification with a link to continue collaborating about the data right here within the catalog. What’s great about that is that not only is everyone collaborating in one place and everyone knows what exactly they’re talking about, everyone’s on the same page, this gets saved. This is tribal knowledge that otherwise gets lost. This is 10 other users that end up reaching out to Sophia. If they didn’t have this available to them asking the same question and Sophia replying to each of them separately, and maybe having to even check for it separately.

What happens when Sophia gets promoted to a different role and she’s the subject matter expert? Now who knows this information and needs to look it up? By documenting this here in context, giving the option to collaborate about in context over here, it has huge benefits by preserving all that terminology and all of those discussions and really creating that tribal knowledge.

Let’s reach back down to these linked assets. Assuming that Holly wants to continue investigating this report and feels this report is a good fit for her needs, now she wants to see what it includes. She reaches down here to the linked assets and sees, for example, the asset total due sum. By clicking on it, she actually goes to now look at the details for total due sum, an additional column within that Power BI report. Of course, it’s linked to the actual report. It’s also linked to the semantic layer of where the logic lies for this column. All these same attributes that we just spoke about exist here as well, including the tags rating and so on.

Now, she’s got additional questions. She reaches out to Jeff Smith this time, asking, “It looks like the sum is rounded. Can you let me know if the amount is rounded up or down? The numbers aren’t matching up with other reports.” What this means is, Jeff now got notified to answer right here within the catalog. He probably needs to check this out. The way to check this out, well, that’s traceability.

In Octopai, we have a lineage that’s basically built in and integrated to the catalog. What he can do in this case is click on the three dots over here simply, click on the End to End Column Lineage and go directly to the Column Level Lineage for the total due sum in this report and be able to trace the data flow all the way back through the different database objects, the different ETLs over here through another database object, through additional ETLs, all the way back to its original source over here.

This is completely I would say technology agnostic. You’re connecting databases of different types, maybe Oracle and SQL Server and Snowflake all in the same Lineage, different types of ETL Tools. You may be using both SSIS and ADF, for instance, that’s all the same to Octopai. We bring everything to that unified view, the visualization to see everything at one level. We can see total due is ultimately coming from the total due in sales order header in Adventure Works in this demo-environment.

Let’s say that’s not enough. Jeff, we said, the technical user, he’s the steward. He wants to see the logic within these specific data flow over here of this ETL, this SSIS data flow. He can then click over here and say, “Take me to the Inner System Lineage to visualize the entire Column Level Lineage for this process and see the logic for that.” I’m going to go ahead and click on the Inner System lineage. What he sees over here is this is actually the column level mapping of the entire package that shows how the data is getting from any column, all the way to its target through all the different components and transformations within this data flow to the final destination in this SSIS package.

Once he’s here, let’s say he wants to see, “What else is using this table? DWH Fact Sales. I click on it. You can see here it’s DWH Fact Sales in Schema-DBO in a Database E2E_Dwh_Sales. This is the component name that was given here within the ETL. Let me go ahead and close that. By clicking on the three dots over here, he can say, “Let me see the lineage for this entire table as a whole, not at column level this time.” Click on the lineage object and see how this table is being populated by these two different ETLs, again, from completely different systems being used in these analytic models and OLA and tabular.

It’s also being used in these different views, in this procedure. This view is actually being used for these reports over here. By the way, this all ties back together, as you probably guessed by now. If I click on this view, I can say, “Hey, let me see what the definitions are for this view.” That’s easy. Click on it, click on the ADC View Automated Data Catalog. Now, we’re looking at all the different definitions for this view in SQL Server. It’s easy to see how all the different types of users collaborate in this one space that answers all these different types of needs, whether they’d be more technical or more business oriented.

Let me go ahead and share my slide again. [silence] Here we go. Basically, as you can see, the data catalog creates independence in using data while preserving that tribal knowledge through collaboration, as well as the traceability through the lineage. An effective catalog will enable data citizens to independently answer questions such as, “Where should I look for my data? Does this data matter? What does this data represent? Is this data relevant and important? How can I use this data?” I can go on and on. That is where true value is, adopting a collaborative data catalog is the ultimate enabler of any data driven organization. [silence] I think we’re ready for Q&A?

Shannon: Okay. Malcolm, thank you so much for this great presentation. It’s been fabulous. We got a lot of questions coming in. Just to answer the most commonly asked question, just a reminder, I will send a follow-up email by end of day Thursday for this webinar, with links to the slides and the recording along with anything else requested. Diving in here, what is the difference between lineage and traceability?

Malcolm: I think I brought that up, so I’ll give you my definition. Lineage is standing at the far end of a data flow, like in a report and saying, “Where did this data come from?” and trying to look upstream, impact standing upstream somewhere and saying, “I’m going to change something. I wonder what it can affect downstream from where I am.” That’s my contribution to the topic.

Amichai: Certainly. Malcolm, I agree. That’s a really good definition. I think that also you can look at lineage as part of the traceability for the data. As Malcolm mentioned, lineage will provide you with that data flow and understanding the origins exactly of the data. There are additional aspects to traceability, but that’s going to be the backbone of it.

Shannon: Awesome. Which of three uses apply to data privacy professionals? Again, I think that was part of your section there, Malcolm?

Malcolm: The data catalog, the business glossary, and the data dictionary, all going to be important for data privacy professionals. You can think about the data catalog at the data set level. You would want to know what are the points at which data is given to service providers that might include personal information? What data sets are we giving to service providers? because we would have to pass on a data subject access request to them. That’s an example of the data catalog.

The data dictionary is going to be, “Well, what are our actual data elements that contain personal information? Where are they?” The business glossary would be, I have a business term called, I don’t know, employment, previous employer’s name. That might be in two or three tables in a human resources database, but previous employer’s name is subcategorized as employment history. Employment history is one of those– if I remember correctly, I’m sure people will correct if I’m not, I think it’s one of those categories that you have to disclose it to people from whom you collect personal information under the CCPA or CPRA, is going to be shortly. You can see that there’s different uses for these three capabilities for data privacy professionals, but they’re all used in some way, albeit different ways.

Shannon: Amichai, anything you want to add there?

Amichai: I think that described it really well. Thanks.

Shannon: Awesome. A term thrown around like data domains linking to specific lines of businesses, where do those fit? I find there are some aspects from a business and technical perspective. Is it in data catalog, or business glossary?

Amichai: I’ll take that one. I think that’s a really good question. I believe that the answer is that everyone needs to look at the same system. The last place you want to be is in a place where you’re maintaining different systems for different types of users. Maintaining one is difficult enough. The catalog ultimately should be the place where the technical users reach out to the business users who can have their answers there as well, and then everyone can collaborate in that place to answer all those different use cases.

Malcolm: I always think that the word, ‘domain’ is the most overused word in data management and data governance. Could you repeat the question for me, please?

Shannon: Yes, sure. A term thrown around, ‘data domains’ linking to specific lines of businesses, where do those fit? I find there are some aspects from a business and technical perspective. It is in data catalog or business glossary?

Malcolm: Depends what you mean by data domain. Some people think it’s like reference data, like list of valid values. Others will say it’s subject areas. If it’s subject areas, then probably in something like a business glossary. No, actually I take that back. Probably in the data catalog. Anyway, that’s my thoughts on it, Shannon.

Shannon: I love it. So many great questions coming in. I’m just trying to move rapidly through them here. Is the consolidated view data catalog a metamodel because it has detailed row based for each instance of first, middle, and last name?

Malcolm: I

Amichai: Go for it.

Malcolm: By definition, a product like Octopai is dealing with metadata and housing in a structured way. It has a metamodel. Anything that is going to house metadata has to have a metamodel because we would define a metamodel as the data model for metadata. I think that would answer the question, is yes, you do need a metamodel.

Shannon: Amichai, anything you want to add to that?

Amichai: Yes, I think that when we move to think of catalogs, in a way, we stop looking at the technical aspects really of what’s going on behind the scenes. We start speaking of it in terms of the value and the different types of users and use cases that they can really use for it. Different catalogs use different frameworks. Everything we’ve showed today, which speaks about being able to provide the definitions, traceability, collaboration, those are all the things that you really should be looking for in it.

Shannon: We anticipated this. There’s quite a few questions on what Octopai connects to. Is there a good resource, Amichai, for that kind of thing?

Amichai: Certainly, you can find them on our website, Octopai.com. You’ll see all the different supported systems. Those are systems that we support for automation of harvesting the different assets. Octopai can also ingest assets externally. Both of those options are available. You can see them in the website. You’re welcome to reach out, of course, if there’s any additional questions about it.

Shannon: Perfect. With automation, is manually added metadata preserved or overwritten? What about with product updates? How are customer attributes, for example, manually added metadata preserved?

Amichai: Perfect. Yes, anything added manually is, of course, preserved, everything automated gets overwritten in a way, meaning that if you’ve got some type of original calculation in the example I gave before, and then we’ve got a description for this calculation, the description that’s been provided in Octopai manually, that gets preserved, of course. The actual calculation, if it changes in the actual metadata, that gets updated automatically.

Shannon: Do we need to model the reference data in Power BI before we upload the sheet in data catalog tool?

Amichai: Not at all. Octopai does that entire process as part of the automation.

Shannon: Awesome. I love it. Can Octopai do data lineage for Python ETL code? Data lineage is normally SQL-based. Nowadays all Cloud ETL happens using Python data frames. Do you have a solution for this?

Amichai: Octopai will automatically, of course, create the lineage for the SQL and for all the different ETL tools that we support, which are many. Python code is not supported with the automation, but the lineage for it can be injected to reflect it with the same visualization and enrich the already automated lineage that you all have from the rest of the ecosystem.

Shannon: Awesome. How does the tool manage multilingual definitions?

Amichai: We are just in the completing steps of adding additional customizable attributes which will support exactly that use case.

Shannon: So many great tool questions coming in. In maintaining the data lineage up to date, what steps would be performed automatically, and what steps need to be performed manually?

Amichai: Great. Extracting the metadata, analyzing it, all the machine learning happens automatically on our end. There’s really no effort for all the different tools that we support automation for that’s manual. That’s all automated. As I mentioned before, if you would like to enrich that lineage that’s already created, that’s possible to do manually through our UI, or through dedicated APIs, and so on.

Shannon: I’m going to try and squeeze in one more question here at least. I’ll get any questions we don’t have time for over to Octopai. Does it make sense to store reports in the catalog when the reports are just views of some database at the end? Losing report is a small loss, no value, but losing data is a big loss.

Amichai: Oh, that’s a really good question. Yes, an asset that’s important to one role is not necessarily the asset that’s important to another role. Since the catalog serves so many different types of users and so many different types of data citizens, you’re going to want any type of data asset that needs to have some type of description and so on, and needs to be associated with any other terms. You want to have that documented in the catalog so that it really is the one source for all of the different types of use cases and all the different types of data assets, regardless of the importance of that actual data to a specific individual.

Shannon: Amichai, thank you so much. Malcolm, thank you so much. As always, another great, great presentation. I’m afraid that is all the time we have for today. Again, I will get these questions over to Octopai if there are any remaining questions we didn’t have time to get to. Again, I will send a follow-up email by end of day Thursday with links to the slides, the recording, and additional information that you all have been asking for. I appreciate it so much. Thanks to all of our attendees for being so engaged in everything we do. As always, another great webinar with you all. I hope you all have a great day. Thanks so much, everybody.

Malcom: Thank you.

Amichai: Thanks a lot, everyone.

Video Transcript

Shannon Kempe: Hello and welcome. My name is Shannon Kempe and I’m the Chief Digital Manager at DATAVERSITY. We would like to thank you for joining this DATAVERSITY webinar, Decoding the Mystery: How to Know if You Need a Data Catalog, a Data Dictionary, or a Business Glossary. Sponsored today by Octopai. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we’ll be collecting them by the Q and A in the bottom right-hand corner of your screen or if you’d like to tweet, we encourage you to share highlights of your questions by Twitter using #DATAVERSITY. If you’d like to chat with us with each other, we certainly encourage you to do so and to open and access the Q and A or the chat panel, you find those icons in the bottom middle of your screen for those features.

Just to note that the chat defaults to send the host and panelists but you may absolutely change that to chat with everyone. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now let me introduce to you our speakers for today, Dr. Malcolm Chisholm and Amichai Fenner. Amichai is a product manager with Octopai with over seven years experience working as a full-stack BI expert. He has expertise in BI methodology and architecture, as well as technical skills in various BI tools, from ETLs to Reporting and Analytics.

He currently has the development of Octopai’s automated data catalog. Malcolm is the president of Data Millennium. He’s a thought leader, author speaker in data governance and data management. Malcolm has over 25 years of experience in data-related disciplines. It has worked in a variety of sectors, including finance, manufacturing, government, pharmaceuticals, and telecoms. With that, I will give the floor to Malcolm to start today’s webinar. Hello and welcome.

Malcolm Chisholm: Thank you very much. Shannon, and it’s great to be here with the community today to speak about this important topic. If we can go to the first slide, I think it’s a very hot topic today, but it’s one which exists in the historical context, and that is the shift to Data-Centricity. If you think about when the computerized age really took off, which was in the mid-1960s, for many years thereafter, the focus was on automating the enterprise’s processes. Data was thought of as a byproduct of automation. There was very much a rush to do things like automate the books and records of an enterprise. Because if you think about a bank in those days, people had to go into a bank and a human being would write in a ledger how much money you were taking out or putting into your account. That’s completely unthinkable today.

The banks couldn’t scale and these clerical operations were very expensive. That’s when computers came in and provided a tremendous amount of efficiency. It also allowed scale. It allowed speed to increase, but the focus was on automating processes. To some extent, that process has stuck with us today, but the reality has changed over the decades. Today we can find packages to automate practically anything. The core problem we’re faced with is getting value out of data. Data is increasingly at the heart of business models. You can think of companies like Google and Facebook and Twitter, whatever you might think about them, they have- they’re working with data. That’s what they get value out.

That is what they monetize. Now not all enterprises are directly monetizing data, but they are using data to do things like predictive analytics, they’re BI, et cetera, run their companies. That’s what’s happening today, albeit with somewhat of a process-centric mentality, still, in IT. We have massive data volumes and data from not only internal enterprises sources as well, but external enterprise sources as well. Data acquisition has become a whole subdiscipline of data management and data governance that’s affecting the way in which we need to process data and feeds into our artificial intelligence and ML technologies which are increasingly available and are increasingly within reach of in small to medium-size enterprises.

This makes everybody very data-hungry. Data is at the center of the modern technological world it’s been called the new oil, the new gold, the fuel that runs our digital or economy or information age. That has a profound impact because we need to know how to manage it, which is where data catalogs, business class, glossaries, and data dictionaries come in.

Next slide, please. What is metadata? Well, metadata is the information we need to understand and manage the data assets we have, but metadata isn’t like a single uniform substance. There’s a lot of different kinds of metadata. That tends to be reflected in these different kinds of tools which manage different kinds of metadata.

The business glossary does things like managed terminology for both information and data concept managed definitions, manages classifications. Terminology is important. What do we call something? We don’t want different report labels on all our reports for the same piece of information. It would be nice to have that standardized. Well, you can do that with a business glossary where you’ve got a single place to go and look up their actual terms, maybe the synonyms and homonyms in there that get disambiguated. Then we have definitions. Definitions are not anymore like the couple of lines we saw in printed dictionaries. Definitions when it comes to data management and data governance are much more than that.

They are facts of business significance. Think about a metric like a stock on hand. How is that calculated mathematically? What’s the methodology for that calculation? Do we do it in the morning? Do we do it in the evening in our shop? How is it to be done? Those facts of business significance that help us to understand data are in part in the business glossary and classifications. We can classify data all kinds of ways. The need for that is growing. We’ve seen that with data privacy and even within data privacy regulations, you see sub-classifications.

For instance, the California Privacy Rights Act has 12 classifications of personal information which you have to disclose to data subjects when you tell them what you are doing when you collect their personal data. A lot of things going on in business glossary, data dictionary, very familiar, goes back a long way. Typically, it’s the schema table and column information, the structural metadata that is, from our relational databases, but that’s grown up too over the years. Now we have data profiling information in there where you can do simple things, like say what’s the minimum value, the maximum value?

You can get all kinds of information out of profiling. Data universes, which we’ll come to again in a moment, and then other relational objects like View. We see these stored in a data dictionary. Traditionally, I think it’s fair to say the business glossary has been much more on the business side of the house and the data dictionary has been used by IT professionals. Certainly, I did when I was more active in development activities. Then we have the data catalog and the data catalog is an agglomeration- can we just go back one slide, please? On major data assets files, data sets, which are logical groupings of data reports, other data assets, and it attaches definitions to these data assets at this more aggregated level. Can we go to the next slide, please?

That’s a brief look at them, but it’s what do you need? Well, depends on who you are. Traditionally, usage by roles look like this. The business glossary has certainly been used by business users and increasingly by self-service users. I think self-service business intelligent has really been a pioneering area in self-service, but we’re starting to see self-service come- expand into other areas with the idea of data democratization that anyone can use the data if they can get hold of it to improve the business for the benefit of the companies or enterprises we work for. Data architects also use business glossaries and so do data governance professionals.

The data catalog, you’re now having to look at more technical level data. Certainly, a self-service user, particularly in BI, might need to use that to look up columns that they intend to use. Data architect clearly is spending a lot of time dealing with columns or attributes if you’re at the logical level. I certainly did when I was doing a lot of data modeling. Data engineers, the guys now who are moving data around and building data pipelines and integrating things, they need it too.

Again, the data governance professionals have an interest in it. Maybe again for something like privacy. There’s Data Catalog, the Data Dictionary much more on the technical level again. The Data Architects need it, the Data Engineers need it, now our friends, the DBAs obviously need that because they’re primarily working at this level. Again, you’ll find the data governances professionals need it also. They need everything. Traditionally we’ve seen different roles, used these things in different ways, but I think what we’re seeing today is more of a coalesce of the usage. It’s probably still role-based, but it’s getting more subtle.

I think it’ll be interesting to see how this develops over the next few years as we get more and more technology available. Data becomes even more prominent in just regular business usage. I think one of the things that we’ve seen in the past is when somebody goes into a new role, a new job, they ask, “Well, what do I do?”. Increasingly they’re asking, “What data do I work with, I need to understand it.” These three capabilities are providing the answers that are needed to understand the data that each of us as business people now will have to work with. Next slide, please.

Data Catalogs, which I think are emerging as the primary vehicle today, need content. This is something that you have to think very carefully about. You can put out a data catalog just as a capability and say, “Hey folks, here’s our bright new shiny data catalog”. Please start putting in information into it. Well, that’s not going to necessarily work too well because you are asking people to take time and effort to put the metadata into the data catalog, but there’s not enough content yet to get any value out of it. That’s not going to be received very well.

I think this is something we have to think very carefully about in the data profession. We have tools and they have great capabilities, but if it’s just a capability and it’s devoid of a lot of what makes it useful, the content, then we are likely to have a problem. As we can see from this graph, each of us in our enterprises needs to understand the minimum level of content, the minimum viable content needed for business adoption past which the business says, “Aha, this actually does, may not have everything, but it’s got a lot of stuff. That’s very valuable to me and I can use it. Now I’m going to get value out of the data catalog and that, in turn, gives value to our data, manage into data governance programs”. The way to do that is to have an initial provisioning stage, which is done with automation because it can only really be done with automation.

We’ll see a bit more of that later too. This is something that I really want to emphasize. Please think not just in terms of capabilities, but level of content, minimum viable content too. Next slide, please. A little bit more about data universes, as we’ve talked about data definitions and people say, “Yes, I know what an employee is”. I’ve got the definition, but is that all that you would want to know? Here we have three databases, a Global Combined Employee Database, the Canadian Employee Database, and the US Employee Database. These have a, some data movement and integration between them. Okay. The definition of employee may well be the same for all of these databases.

They’re all employed from a data definition perspective, but the data universe, the population of things they contain is different. The Canadian Employee Database only contains Canadian employees, contractors, interns, whatever, directors, members of the board, the US employee database, same thing, but just for the US, and then the Global Combined Database has everything in it. Populations, what the data contains is never going to be given to you from a definition alone. A definition explains the concept. It doesn’t explain the extension, the coverage, the universe, what the populations are, how are we going to get that information, that has to be in data dictionaries and, or data catalogs also.

As you can kind of see here, we can infer it from lineage. We’re also already starting to see that lineage is providing a role in helping us collect metadata to improve our understanding of our data assets. That’s important because data universities are very often overlooked. Again, certainly, I’d really encourage everybody to think about in their day-to-day work. Next slide, please.

Okay. What we are seeing, as I mentioned earlier, as a trend today, is a move towards consolidation. The data catalog appears currently to be the point at which this consolidation of metadata is happening. It turns out, I think you can see this as a bit of a busy slide. The whole product in the data catalog is more than the individual parts summed together. Here’s how it might work. We have a business glossary and it has business terms in it. The business terms will have definitions. Synonyms will be identified, et cetera, and so on. We have a database which we can understand through in terms of the structural metadata that we’re going to harvest from it, which would traditionally be held in a data dictionary, and then something we haven’t really discussed yet, but reports are out there. Reports would be seen today in a data catalog. There are ways of capturing report metadata, well, what happens in reports?

Well, in reports we see labels, report headings, and the actual fields that are populated into a report. There’s a linkage between those fields and the databases from which they come. We may have a column that’s just called CL in our database, but Customer Last Name in the reports. Now what this is solving here, which is very important to note, and may often be overlooked, is this traditional, tremendous chasm in metadata management was to say, “Okay, I have my business terms and I’ve got a sea of those and I have my database columns, and I’ve got another ocean of those.

How do I relate the two? Am I going to be able, am I going to have people do this manually?” That’s not going to be easy to do at all, unless you want to spend vast amounts of money. By the time you’ve got through it, everything will have changed. You’ll have to do it all over again. That’s impractical, but you can see that there is data catalog functionality that’s going to unite all of these different items of metadata, and it pushed them into this consolidated view, which we can now see where we’ve got business terms, reports and database structural information altogether in one place. That gives us, that provides tremendous value in terms of metadata management.

We have to manage these highly complex, gigantic environments of information that we have today. You would not, for instance, have an oil refinery that you would run without controls that measured fluid levels, temperature, pressures, and you’d have them in a control room and all would be monitored. We need to do the same and more for our metadata environments. Not just operational monitoring, but to get this additional value unlocked from the data that we’re dealing with. Next slide, please. Here’s our problems, kind of hinted at them already. How do you collect the metadata. Difficult. How does all the metadata get related? How do you establish relationships among it?

That’s a key question. How do you keep it updated? Also, we’ve hinted at that one. Let’s have a look at one solution. You’re going to go to automation because you cannot do this manually. The scale and complexity of data ecosystems is just too large for human effort. I have seen it done. It’s an enormously expensive, time-consuming effort that gets lots of people angry and doesn’t necessarily yield great results. Just to get everything documented, before you establish the relationships among it and then how do you find the relationships?

This is a really difficult nut to crack but let’s take a look at the next slide and see how you might be able to do it well. At a very high level, data lineage can do this. It can help you. Why? You think about data lineage in its full extent, you’re starting with something in a database, it flows through processes. We should come back to talk about processes in a minute. Yes, those are ETL processes, they’re technical processes, or they could be SQL scripts or something else, I get it. They’re nevertheless processes, and then they maybe go to another database, another hop skip jump through other processes.

Eventually, we hope, ending up in reports, it’s very interesting, we find columns that just dead end and nobody knows what they’re used for. Ending up in reports, which as we’ve seen earlier, which is where we can make that link between a database column, which is attached to a report label so we now understand, aha, this database column is– This report label rather, is the business term that goes along with this database column, and then I can go backwards through that chain of data lineage and say, “Oh, all these things are therefore the same column, because they are all feeding through without transformations, without changes, all the way back to the original source.” The book of record.

This also provides other things that are important, such as the processes. The business processes today are really represented by data flows, which get implemented, again, by things like, may seem trivial to business people, but they are not. ETL. Maybe not trivial, but rather overly technical, SQL scripts, and ETL processes, but those represent business processes. If you want to do business process reengineering, and you want to know what processes you have, a good deal of that information lies in the process steps in the data lineage.

You can see again, for instance, think about GDPR, you need to have a process register for what you’re doing with personal information. This can help you. We’re starting to see that we have the automation to gather this information, we have the capacity to populate the Business Glossary, the Data Catalog in the Data Dictionary. Because the lineage has these relationships, which again, yes, I know they’re flow relationships, but they’re also logical relationships because like data element is populating, like data element are being transformed in some way. We’re making the relationships too.

That gives us this consolidated, integrated whole view of metadata, which is what we want to see in the Data Catalog is increasingly, the place where you go to get that. We get the terminology, we get the semantic set of Business Glossary functionality, we get the structural metadata from Data Dictionary-type technologies, but it’s becoming manifested in the Data Catalog, the one-stop-shop for everybody now. Everybody is going to include the citizen data scientists, the citizen developers, the citizen analysts to work with the data. That’s why this new paradigm is so important, it’s all coming together. Data Lineage could harvest metadata and build the relationships among it. Next slide, please.

Data traceability. This is something else I just want to bring out, which is another major reason that we need data catalogs because I think as we’re all probably aware, traceability for impact assessment is very much needed. It’s to say if I change something upstream, what is it going to change downstream? That’s important and going the other way, traditional lineage, something broke, what’s feeding into it that could have broken, for instance, my ETL processes? I want to point out that data traceability is becoming more of a general data governance requirement.

It’s beyond the realm of the technical folks who are very important, don’t get me wrong. There’s things out there such as BCBS, Basel Committee for Banking Stability 239, which says, “Look, guys, you are doing things like reporting on risk or reporting on capital adequacy ratios, you have to prove that the data that you’re reporting actually came from operational systems, without changes, without people doing manual changes to it, and so on.” We’ve got to see the flows from the operational systems to the risk reports, and that’s got to be proven. You see that traceability itself is going beyond the realm of the more technical environment into very much business needs that are in many enterprises. Something else that you will see increasingly overlaid in the story of the Business Glossary, Data dictionary, and Data Catalog. Next slide, please.

In conclusion, the Business Glossary, Data Dictionary, and Data Catalog have different foci or focuses in terms of the metadata they manage. That’s been very traditional, but there are relationships. The Business Glossary gives us meaning but the automation is going to be needed to harvest particularly technical metadata of the kind we see in Data Dictionaries and Data Catalogs. Data Lineage is a great way to do that and it also helps us to create trust in the data because of this full traceability. I know people will talk about data quality as being important for trust, but traceability is too. The Data Catalog then becomes the place where all this information is integrated and it’s the one-stop-shop to understand and collaborate about data. That’s a very brief overview of this very complex topic. Hope that helped. With that, I’ll send it over to Amichai.

Amichai Fenner: Great. Thank you, Malcolm, for that really in-depth comparison. Thanks, again, everyone, for joining. Malcolm, as you mentioned, data teams face major challenges. These challenges include a lack of visibility and control of data, and of course, lots of knowledge that it’s just scattered throughout the entire data ecosystem. The main causes of these challenges include the ever-growing amount and diversity of data and tests that the team is faced with, alongside with the growing demand for the business, as it becomes more and more data-driven, as you mentioned earlier.

It’s to not only make decisions based on accurate data, but also incorporate data within the company’s offerings, such as in product recommendations, and so on. To be able to meet these challenges, the company expects users to be more self-sufficient. In most cases, though, without proper tools and processes in hand, data citizens are not truly independent in using the data. There’s no ultimate single source of the truth about the data. There’s tremendous loss of tribal knowledge, which is all that knowledge that the different subject matter experts share in undocumented ways, or at least not widely accessible ways. We’ll take a look soon at how we address these in Octopai.

These are all points that you should keep in mind when you’re evaluating any data literacy platform. You just make sure that it alleviates these challenges. I’m sure that this is familiar to many of you, if not all. This is what we refer to as the data hunt. Business reaches out to the data team asking about data, and then there’s a whole undocumented loop of communication, collaboration that prevents the data users from quickly and accurately using the data independently. This process wastes a lot of time for data team members. What’s worse is that this process basically repeats itself every so often, and for the same data, and we all go through this exercise again. This is what our customers have shared as main drivers for implementing optimized data catalog. It’s easy to see that successfully adopting a catalog is a win-win for all data citizens, technical and business users. Everyone can easily see what’s in it for them. Everyone gets time back to do the job they were hired to do instead of hunting for data or explaining data redundantly, depending on what side of the data you’re on, of course, consuming, using, creating, or maintaining.

An effective way, some may say the only way, as you touched on this, Malcolm, to truly achieve data literacy is by leveraging automation. Without automation, most attempts fail.

By the time you’re done manually centralizing an inventory becomes pretty much stale. Because that process is just so time-consuming, keeping the inventory up to date without automation is almost impossible.

Octopai automates creating data discovery, which describes where data is used, data lineage which describes how data flows through the different systems, what are the sources of the data? what happened to it? It basically most commonly serves use cases such as root cause and impact analysis. We’ll take a brief look at that if we got some time but today, we will dive deeply into the data catalog. Let me go ahead and share a demo.

This here is what Octopai’s data catalog looks like. Basically, it’s the one-stop shop for the data. Let’s start out by just briefly running through the different layers of data assets that Octopai automatically harvests. We harvest assets automatically from the different reporting tools, different databases, different ETL systems. These are just samples of the different technologies that we support automation for.

The reason I’m showing this to you is, it relates back to what you were describing before, Malcolm. We have different types of users that are all going to end up collaborating in one catalog instead of in siloed systems, instead of having technical users using a dictionary and our business glossary for the business users, just maintaining all these many different tools, which has really not proven effective in the past few years. We’re going to want everyone to work in the same place, but we want to help each type of user focus on the type of assets that’s relevant to them. Different users can set this up to see the types of assets that are relevant to them. Let me give you an idea of what that is.

If I’m strictly a business user, I’m interested mostly in reporting tools, specifically, presentation layers of reporting tools. That’s those final, different columns and KPIs that show up on reports. I’m also interested in the actual reports. If I’m a self-service user, I’m probably also interested in the semantic level where all the logic is, as well as the physical end, which is how it relates back to the databases. You can see that every user has their own type of assets and layers of assets that they would be interested in using.

Octopai enables focusing exactly on the type of asset that you’re interested in, which is super important because in an average catalog, to give you guys an idea, there are going to be around a million assets, 1 million assets, that means you got to get good capabilities to focus on the assets you’re interested in, be able to search through them, filter through them, and we’re going to go through that in a moment. Now we understand that this entire inventory has been created automatically from our entire ecosystem, from the ETLs, from the reports, from the databases.

Let’s run through this with a use case. Say we’ve got a business user who’s interested in a sales report, and he wants to know which report would match his needs. In Octopai, what you would do is use the filter over here, which is the same as in any marketplace, to filter out and say, “Hey, I’m only looking for reports at the moment that have been tagged as sales,” and click on Apply. We can also add the term summary for instance, and say, “Here we go.”

Here’s the order-by-sales rep summary. Great. That’s what I was looking for. I’m looking for a sales report that is summarized. Let me go ahead and click on this and see the different definitions for this report. By clicking on it, what the user sees right away is, all these different tags that this report has been associated with and this Power BI report has been associated with sales, Salesforce, EMEA, GDPR, a specific project, it’s been associated with PII, it’s been associated with orders.

We can see the rating over here, which has been branded as 4.5 by two different users. By clicking on it, you can actually see who’s been using it and rating. These are typically users with high engagement, which he may want to collaborate with about this data, and we’ll show you how you collaborate within Octopai in a few moments. You can also, of course, rate through this functionality as well.

Next, you can see the status is approved, so this asset has been approved for use. By the way, that’s why I’ve got this badge over here to make it easy for him to select these assets from the list. We can see it’s been flagged as sensitive. This over here gets the business user an idea of whether this is the type of report that he should be further looking into.

Next, we can scroll down and see the different descriptions that were provided for this report. There’s this long description over here that says, “Yearly EMEA sales report contains detailed sales information by sale–” long description. It’s got this technical description over here, just all shorter, “Yearly sales for last year at account level.”

We can see here an origin description. For tools that support origin descriptions within the actual tool, Octopai automatically harvests them and shows them right over here. We can see the calculation description, so since this is a report, we’ve entered here the filter condition, “All data filtered for EMEA only.” Again, if it’s a logical data asset, and it’s already got a calculation in it, for semantic layers, for reports, for instance, the origin calculation will already show up over here.

We can see the asset as a report. We can see a datatype when it’s relevant. We can see the path to it, the source system it’s been documented for. We can see two really important roles about this data asset. We can see the data owner responsible for the business aspects of this asset and the business definitions. We can see the data steward responsible for the technical aspects of this asset and the correctness of it. We can see who updated and when, and so on.

Next, down here, we can see all these assets that have been linked to this report. Since this is a report, Octopai automatically links assets that come from this report, the different KPIs, the different columns, and so on. We’ll take a look at this in a moment. We can also add additional links here, right within Octopai. Say you want to relate this report to some type of project, which is also an asset in Octopai. You can click on the Add, add the specific project here, to the linked assets for the report.

Let’s say as a business user, this report really seems to answer my needs. I still have some questions about the data. I’m sure that all of you are familiar with that. In Octopai, we’ve got this built-in collaboration that allows your users to collaborate within the platform. Let’s take a look at this example. We’ve got Holly Miller over here, who’s reached out to Sophia, right here. She’s the data owner. “Does this report represent fiscal or calendar year?” She’s got additional questions about this report. We can see Sophia has mentioned Holly over here, replying that the report uses fiscal year.

By mentioning each other, they each get notification with a link to continue collaborating about the data right here within the catalog. What’s great about that is that not only is everyone collaborating in one place and everyone knows what exactly they’re talking about, everyone’s on the same page, this gets saved. This is tribal knowledge that otherwise gets lost. This is 10 other users that end up reaching out to Sophia. If they didn’t have this available to them asking the same question and Sophia replying to each of them separately, and maybe having to even check for it separately.

What happens when Sophia gets promoted to a different role and she’s the subject matter expert? Now who knows this information and needs to look it up? By documenting this here in context, giving the option to collaborate about in context over here, it has huge benefits by preserving all that terminology and all of those discussions and really creating that tribal knowledge.

Let’s reach back down to these linked assets. Assuming that Holly wants to continue investigating this report and feels this report is a good fit for her needs, now she wants to see what it includes. She reaches down here to the linked assets and sees, for example, the asset total due sum. By clicking on it, she actually goes to now look at the details for total due sum, an additional column within that Power BI report. Of course, it’s linked to the actual report. It’s also linked to the semantic layer of where the logic lies for this column. All these same attributes that we just spoke about exist here as well, including the tags rating and so on.

Now, she’s got additional questions. She reaches out to Jeff Smith this time, asking, “It looks like the sum is rounded. Can you let me know if the amount is rounded up or down? The numbers aren’t matching up with other reports.” What this means is, Jeff now got notified to answer right here within the catalog. He probably needs to check this out. The way to check this out, well, that’s traceability.

In Octopai, we have a lineage that’s basically built in and integrated to the catalog. What he can do in this case is click on the three dots over here simply, click on the End to End Column Lineage and go directly to the Column Level Lineage for the total due sum in this report and be able to trace the data flow all the way back through the different database objects, the different ETLs over here through another database object, through additional ETLs, all the way back to its original source over here.

This is completely I would say technology agnostic. You’re connecting databases of different types, maybe Oracle and SQL Server and Snowflake all in the same Lineage, different types of ETL Tools. You may be using both SSIS and ADF, for instance, that’s all the same to Octopai. We bring everything to that unified view, the visualization to see everything at one level. We can see total due is ultimately coming from the total due in sales order header in Adventure Works in this demo-environment.

Let’s say that’s not enough. Jeff, we said, the technical user, he’s the steward. He wants to see the logic within these specific data flow over here of this ETL, this SSIS data flow. He can then click over here and say, “Take me to the Inner System Lineage to visualize the entire Column Level Lineage for this process and see the logic for that.” I’m going to go ahead and click on the Inner System lineage. What he sees over here is this is actually the column level mapping of the entire package that shows how the data is getting from any column, all the way to its target through all the different components and transformations within this data flow to the final destination in this SSIS package.

Once he’s here, let’s say he wants to see, “What else is using this table? DWH Fact Sales. I click on it. You can see here it’s DWH Fact Sales in Schema-DBO in a Database E2E_Dwh_Sales. This is the component name that was given here within the ETL. Let me go ahead and close that. By clicking on the three dots over here, he can say, “Let me see the lineage for this entire table as a whole, not at column level this time.” Click on the lineage object and see how this table is being populated by these two different ETLs, again, from completely different systems being used in these analytic models and OLA and tabular.

It’s also being used in these different views, in this procedure. This view is actually being used for these reports over here. By the way, this all ties back together, as you probably guessed by now. If I click on this view, I can say, “Hey, let me see what the definitions are for this view.” That’s easy. Click on it, click on the ADC View Automated Data Catalog. Now, we’re looking at all the different definitions for this view in SQL Server. It’s easy to see how all the different types of users collaborate in this one space that answers all these different types of needs, whether they’d be more technical or more business oriented.

Let me go ahead and share my slide again. [silence] Here we go. Basically, as you can see, the data catalog creates independence in using data while preserving that tribal knowledge through collaboration, as well as the traceability through the lineage. An effective catalog will enable data citizens to independently answer questions such as, “Where should I look for my data? Does this data matter? What does this data represent? Is this data relevant and important? How can I use this data?” I can go on and on. That is where true value is, adopting a collaborative data catalog is the ultimate enabler of any data driven organization. [silence] I think we’re ready for Q&A?

Shannon: Okay. Malcolm, thank you so much for this great presentation. It’s been fabulous. We got a lot of questions coming in. Just to answer the most commonly asked question, just a reminder, I will send a follow-up email by end of day Thursday for this webinar, with links to the slides and the recording along with anything else requested. Diving in here, what is the difference between lineage and traceability?

Malcolm: I think I brought that up, so I’ll give you my definition. Lineage is standing at the far end of a data flow, like in a report and saying, “Where did this data come from?” and trying to look upstream, impact standing upstream somewhere and saying, “I’m going to change something. I wonder what it can affect downstream from where I am.” That’s my contribution to the topic.

Amichai: Certainly. Malcolm, I agree. That’s a really good definition. I think that also you can look at lineage as part of the traceability for the data. As Malcolm mentioned, lineage will provide you with that data flow and understanding the origins exactly of the data. There are additional aspects to traceability, but that’s going to be the backbone of it.

Shannon: Awesome. Which of three uses apply to data privacy professionals? Again, I think that was part of your section there, Malcolm?

Malcolm: The data catalog, the business glossary, and the data dictionary, all going to be important for data privacy professionals. You can think about the data catalog at the data set level. You would want to know what are the points at which data is given to service providers that might include personal information? What data sets are we giving to service providers? because we would have to pass on a data subject access request to them. That’s an example of the data catalog.

The data dictionary is going to be, “Well, what are our actual data elements that contain personal information? Where are they?” The business glossary would be, I have a business term called, I don’t know, employment, previous employer’s name. That might be in two or three tables in a human resources database, but previous employer’s name is subcategorized as employment history. Employment history is one of those– if I remember correctly, I’m sure people will correct if I’m not, I think it’s one of those categories that you have to disclose it to people from whom you collect personal information under the CCPA or CPRA, is going to be shortly. You can see that there’s different uses for these three capabilities for data privacy professionals, but they’re all used in some way, albeit different ways.

Shannon: Amichai, anything you want to add there?

Amichai: I think that described it really well. Thanks.

Shannon: Awesome. A term thrown around like data domains linking to specific lines of businesses, where do those fit? I find there are some aspects from a business and technical perspective. Is it in data catalog, or business glossary?

Amichai: I’ll take that one. I think that’s a really good question. I believe that the answer is that everyone needs to look at the same system. The last place you want to be is in a place where you’re maintaining different systems for different types of users. Maintaining one is difficult enough. The catalog ultimately should be the place where the technical users reach out to the business users who can have their answers there as well, and then everyone can collaborate in that place to answer all those different use cases.

Malcolm: I always think that the word, ‘domain’ is the most overused word in data management and data governance. Could you repeat the question for me, please?

Shannon: Yes, sure. A term thrown around, ‘data domains’ linking to specific lines of businesses, where do those fit? I find there are some aspects from a business and technical perspective. It is in data catalog or business glossary?

Malcolm: Depends what you mean by data domain. Some people think it’s like reference data, like list of valid values. Others will say it’s subject areas. If it’s subject areas, then probably in something like a business glossary. No, actually I take that back. Probably in the data catalog. Anyway, that’s my thoughts on it, Shannon.

Shannon: I love it. So many great questions coming in. I’m just trying to move rapidly through them here. Is the consolidated view data catalog a metamodel because it has detailed row based for each instance of first, middle, and last name?

Malcolm: I

Amichai: Go for it.

Malcolm: By definition, a product like Octopai is dealing with metadata and housing in a structured way. It has a metamodel. Anything that is going to house metadata has to have a metamodel because we would define a metamodel as the data model for metadata. I think that would answer the question, is yes, you do need a metamodel.

Shannon: Amichai, anything you want to add to that?

Amichai: Yes, I think that when we move to think of catalogs, in a way, we stop looking at the technical aspects really of what’s going on behind the scenes. We start speaking of it in terms of the value and the different types of users and use cases that they can really use for it. Different catalogs use different frameworks. Everything we’ve showed today, which speaks about being able to provide the definitions, traceability, collaboration, those are all the things that you really should be looking for in it.

Shannon: We anticipated this. There’s quite a few questions on what Octopai connects to. Is there a good resource, Amichai, for that kind of thing?

Amichai: Certainly, you can find them on our website, Octopai.com. You’ll see all the different supported systems. Those are systems that we support for automation of harvesting the different assets. Octopai can also ingest assets externally. Both of those options are available. You can see them in the website. You’re welcome to reach out, of course, if there’s any additional questions about it.

Shannon: Perfect. With automation, is manually added metadata preserved or overwritten? What about with product updates? How are customer attributes, for example, manually added metadata preserved?

Amichai: Perfect. Yes, anything added manually is, of course, preserved, everything automated gets overwritten in a way, meaning that if you’ve got some type of original calculation in the example I gave before, and then we’ve got a description for this calculation, the description that’s been provided in Octopai manually, that gets preserved, of course. The actual calculation, if it changes in the actual metadata, that gets updated automatically.

Shannon: Do we need to model the reference data in Power BI before we upload the sheet in data catalog tool?

Amichai: Not at all. Octopai does that entire process as part of the automation.

Shannon: Awesome. I love it. Can Octopai do data lineage for Python ETL code? Data lineage is normally SQL-based. Nowadays all Cloud ETL happens using Python data frames. Do you have a solution for this?

Amichai: Octopai will automatically, of course, create the lineage for the SQL and for all the different ETL tools that we support, which are many. Python code is not supported with the automation, but the lineage for it can be injected to reflect it with the same visualization and enrich the already automated lineage that you all have from the rest of the ecosystem.

Shannon: Awesome. How does the tool manage multilingual definitions?

Amichai: We are just in the completing steps of adding additional customizable attributes which will support exactly that use case.

Shannon: So many great tool questions coming in. In maintaining the data lineage up to date, what steps would be performed automatically, and what steps need to be performed manually?

Amichai: Great. Extracting the metadata, analyzing it, all the machine learning happens automatically on our end. There’s really no effort for all the different tools that we support automation for that’s manual. That’s all automated. As I mentioned before, if you would like to enrich that lineage that’s already created, that’s possible to do manually through our UI, or through dedicated APIs, and so on.

Shannon: I’m going to try and squeeze in one more question here at least. I’ll get any questions we don’t have time for over to Octopai. Does it make sense to store reports in the catalog when the reports are just views of some database at the end? Losing report is a small loss, no value, but losing data is a big loss.

Amichai: Oh, that’s a really good question. Yes, an asset that’s important to one role is not necessarily the asset that’s important to another role. Since the catalog serves so many different types of users and so many different types of data citizens, you’re going to want any type of data asset that needs to have some type of description and so on, and needs to be associated with any other terms. You want to have that documented in the catalog so that it really is the one source for all of the different types of use cases and all the different types of data assets, regardless of the importance of that actual data to a specific individual.

Shannon: Amichai, thank you so much. Malcolm, thank you so much. As always, another great, great presentation. I’m afraid that is all the time we have for today. Again, I will get these questions over to Octopai if there are any remaining questions we didn’t have time to get to. Again, I will send a follow-up email by end of day Thursday with links to the slides, the recording, and additional information that you all have been asking for. I appreciate it so much. Thanks to all of our attendees for being so engaged in everything we do. As always, another great webinar with you all. I hope you all have a great day. Thanks so much, everybody.

Malcom: Thank you.

Amichai: Thanks a lot, everyone.