Data validation is checking a dataset to make sure that it isn’t defective. Data validation confirms that the dataset is good to be used, and does not contain any errors or problems that will cause it to be unusable or to give you inaccurate results.
Validating data before use is as necessary as running basic quality checks on manufactured products before delivering them to customers – or on your raw materials before using them in products. If you manufacture cars, and a given car has holes in the tires or an engine that won’t run on standard fuel, you need to catch it before it reaches a dealership. Your customers are not going to be happy if you put defective products into their hands.
Your data is your product, or at the very least the raw material out of which your product is built. Your end result will be only as good as the quality of the materials, making data validation a critical part of any data management process.
Types of data validation checks (with examples)
There are multiple types of data validation checks that can be run on datasets to ensure they meet your usability standards. Depending on the nature of the data, validation checks are run on an entire dataset, or on a segment such as a table, column, row, or field, or on a specified range within one of those segments.
Types of data validation checks include:
- Data type validation: confirms that the data matches a specific data type. For example, a column that is supposed to contain a boolean expression (i.e. true or false) should show only two values. A column that is supposed to contain integers (e.g. “house number”) should not have any decimal points.
- Constraint validation: confirms that the data is within a specified range. For example, if the fields in a column are supposed to contain between 5 and 12 characters, constraint validation will point out any fields that contain 4 or 14 characters.
- Consistency validation: confirms that the data format is consistent within the range. For example, that dates are always entered as DD-MM-YYYY, and not as MM-DD-YYYY, or vice versa.
- Code validation: confirms that data that is supposed to be based on a standard coding system actually matches the available codes in the system. For example, a column that is supposed to contain country codes from the ISO 3166 international standard, Alpha-2 code type, can contain US, but not UN (because that’s not a valid country code) and not USA (because while that is a valid country code according to the ISO 3166, it’s the Alpha-3 code type and not the Alpha-2).
- Presence validation: confirms that the data does not contain null values. For example, a column that is supposed to contain age, and is required, should not have any fields without inputs.
- Uniqueness validation: confirms that there are no duplicates in the data fields in the specified range. For example, a column that is supposed to contain a unique customer ID number should not have any numbers that appear twice. Uniqueness validation is important in checking that tables contain primary keys.
- File format validation: confirms that the file format and structure is compatible with the programs and systems that will store and use it. For example, if a system can only deal with spreadsheets, format validation will identify JSON files as being invalid.
Where in data processing does the data validation process fit?
The data validation process can come into play any time a dataset passes from one system to another. One of the most important times is at the entry point into your data environment, so you can identify and filter or fix bad or unusable data before it goes into your systems. Often data validation checks are built into the ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes.