Data profiling is the process of analyzing the characteristics of datasets in order to discover patterns, relationships, or issues with the data. The resulting data quality profiling reports are used for drawing conclusions about how best to use the data or what needs to be corrected before use.
Why would you do data profiling?
While the use cases of data profiling are many (see the next section for a few of them), almost all use cases fall into the following two categories:
To know you shouldn’t use a dataset
Data profiling will show you when your dataset has missing information, structural issues, a high likelihood of errors, or other data quality problems. That’s the kind of thing you want to know about before you (mistakenly) use it to power business operations or decision-making.
To fix a dataset so it can be used
Sometimes you just can’t fix a broken dataset and the only thing to do is put it sadly to the side. Other times, happily, the dataset can be cleansed, corrected, adjusted or otherwise made fit for use. In these cases, data profiling will point out where the issues are, enabling you to find and fix them.
When would you use data profiling?
Some specific use cases for data profiling include:
- Evaluating if a potential data project is worthwhile, or can be undertaken with the data at hand
- Improving the organization’s overall data quality
- In preparation for migration, in deciding what datasets to migrate, requirements for the target system and any corrections or transformations that must happen during the migration process
- After migration, to check that datasets migrated successfully, retaining their level of data integrity
- Determining whether collections of data or schemata can be merged and the most efficient way of doing so
- Better understanding of relationships among data in order to optimize queries
Data profiling processes and techniques
Data profiling processes fall into one of three categories:
- Structure discovery – making sure the general pattern of the data (e.g. is it a string? How long? Numbers, letters or both?) and the basic statistics of the data (e.g. mean, median, mode, standard deviation) indicate that the data is valid. This is usually accomplished through column profiling.
- Content discovery – making sure the individual pieces of data are where they are supposed to be and in the right format. This is usually accomplished through data rule validation.
- Relationship discovery – establishing the connections, relationships and dependencies between different data sources, such as two database tables or two cells within the same table. This is usually accomplished through cross-column profiling (within one table) and cross-table profiling (between two or more tables).
Data profiling metrics explained
How do you understand the data profiling report that you’re given at the end of the data profiling process? How do you apply all those numbers and ranges to practically improve your data quality and processing?
Let’s break down some of the more common data profiling metrics:
- String length (minimum, maximum, average) – how many characters are in the fields of a given column. String length metrics are helpful both for checking data validity (e.g. if the column is for zip codes and you have a minimum string length of 3 and a maximum string length of 12, you know some of those fields have invalid entries) and for optimizing your data transfer and storage (e.g. selecting appropriate data sizes during migration to a target database).
- Aggregate values (minimum, maximum, median, mean, mode, extreme) – give you a bird’s-eye view of the distribution of values and outliers in a given column. Aggregate value metrics are helpful for checking data validity (e.g. if the column is for age, and the maximum is 175, at least one of the fields has an invalid entry).
- Count and percent (filled, null, distinct, unique) – give you an overview of the completeness and variability of the data. Count and percent metrics are helpful for identifying natural keys as well as missing or unknown data.
- Pattern and frequency distributions – do fields match a set pattern (e.g. email addresses, phone numbers). Pattern and frequency distribution metrics are important for validating data, especially that which is used for outbound communications and for which an error will result in delivery failure.