Harvest Data Quality Information

First you import the data quality information directly from the third-party tool using the appropriate import bridge:

From a commercial tool such as the import bridge from Informatica Cloud Data Quality
From any other unsupported tools (or in house DQ) using the import bridge from the Meta Integration Data Quality (DQ) CSV Format.

The result is a Data Quality type model and contains data quality information related to a specific data store (e.g., database).

Steps

Create a new Imported Model using the data quality import bridge for the particular source of that data. In this case, we will import from the Meta Integration Data Quality format directory.

Example

In this case, we will import from the Meta Integration Data Quality format directory. The data quality information will be for the Data Lake file system files. Thus, the Data Lake model is already imported and in the configuration:

So, import the data quality model to be named "Data Lake DQ" using the Meta Integration Data Quality import bridge.

Click OPEN and take a look at the data quality model as imported.

Go to More Actions and select Show in Object Explorer.

Here you may see the rules used to determine the data quality metrics:

Click the Data Quality tab.

The next step is to stitch this model to the actual imported model for the Data Lake.

The original DQ score values are always calculated at the column level first, and then propagated to the table and schema levels by calculating the average value for each individual dimensions grouped by their historical period.

Note that the overall score value is separately calculated by averaging all the nested DQ overall scores. The highest nested status recorded is used to propagate the DQ status."

Meta Integration Data Quality (DQ) CSV Format

The Meta Integration generalized DQ solution allows you to augment previously imported file system or relational data stores with Data Quality information as well as Data Profiling and Data Sampling information using either a CSV file format imported by bridge "Meta Integration Metadata Management (MM) Data Quality (CSV)" or by extending some Java interface directly.

A screenshot of a computer Description automatically
generated

The Data Quality CSV format includes multiple CSV files organized in directories as follows

Data Quality CSV Format

As the csv file contents are derived from different aspect of the source tool(s) and can be using different methodologies to produce this information, they cannot be directly correlated across these different aspects, including:

Data Quality Rules (e.g., Rule Application.csv)
Data Sampling (e.g.,Data Sampling.csv)
Profiling (e.g., Tables.csv)
Rules (folder with Data Quality information)
Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)
Id (folder unique identifier)
Name (folder name)
Parent Folder Id (identifier of the parent folder, optional)
Description (folder description, optional)
Rule.csv (provides DQ rule declarations)
Id (rule unique identifier)
Name (rule name. Name must be unique under the parent folder.)
Folder Id (identifier of the parent folder)
Description (rule description, optional)
Operation (rule operation, optional)
Dimension (rule dimension name. See the list of the possible values below)
Url (hyperlink on the rule in the original DQ tool, optional)
Rule Application.csv (allows to specify DQ rule instances and associate them with the individual columns)
Id (rule application unique identifier)
Rule Id (identifier of the associated DQ rule)
Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
Table (column path, table name. It can also be a file, or partitioned directory name)
Column (column path, column name)
Description (rule application description, optional)
Threshold (rule application threshold. It is a floating point value that is used to compute DQ Status in MM (Good, Acceptable or Bad). Use comma as a delimiter if more than one value is required)
Rule Application Column.csv (provides additional associations between DQ Rule Applications and the corresponding columns if Rule Application refers to two or more columns)
Rule Application Id (rule application identifier)
Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
Table (column path, table name. It can also be a file, or partitioned directory name)
Column (column path, column name)
Rule Measure.csv (provides rule execution results)
Id (rule measure unique identifier)
Rule Application Id (rule application identifier)
Date (the execution date when the values have been collected. See below the supported date time format)
Rows Total (the overall number of the processed rows)
Rows Failed (the number of rows that failed the DQ rule)
Profiling (folder with Data Profiling information, optional)
Table.csv (table level profiling information)
Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
Table (table path, table name. It can also be a file, or partitioned directory name)
Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
Data Update Date (the last date/time when data has been updated in the source table. See below the supported date time format)
Records All (the number of records in the source table)
Records Processed (the number of records that have been processed by the profiler)
Column.csv (column level profiling information)
Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)
Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
Table (column path, table name. It can also be a file, or partitioned directory name)
Column (column path, column name)
Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)
Count (KPI with the count of all the values in the column. Long value)
Distinct (KPI with the counter for distinct values. Long value)
Duplicate (KPI with the counter for duplicate values. Long value)
Blank (KPI with the counter for blank string values. Long value)
Zero (KPI with the counter for zero numeric values. Long value)
Null (KPI with the counter for NULL values. Long value)
Min (KPI with the minimum value in the column. Usually applicable for the numeric values.)
Max (KPI with the maximum value in the column. Usually applicable for the numeric values.)
Mean (KPI with the mean value in the column. Usually applicable for the numeric values.)
Variance (KPI with the variance value in the column. Usually applicable for the numeric values.)
Median (KPI with the median value in the column. Usually applicable for the numeric values.)
Lower quantile (KPI with the lower quantile value in the column. Usually applicable for the numeric values.)
Upper quantile (KPI with the upper quantile value in the column. Usually applicable for the numeric values.)
Avg length (KPI with the average value length in the column. Usually applicable for the string values. Double value)
Min length (KPI with the minimum value length in the column. Usually applicable for the string values. Long value)
Max length (KPI with the maximum value length in the column. Usually applicable for the string values. Long value)
Valid (KPI with the counter for valid values. Usually computed based on the column's data type. Long value)
Invalid (KPI with the counter for invalid values. Usually computed based on the column's data type. Long value)
Data Type Invalid Values (Serialized list of the invalid values. Usually computed based on the column's data type. Use character semicolon ';' as a delimiter and backward slash '' as an escape character to serialize multiple values)
Data Type Frequency (Serialized list of data type frequencies in the format "[value],[count],[ratio]" (e.g. INTEGER,6,60.00;STRING,4,40.00). Use character semicolon ';' as a delimiter and backward slash '' as an escape character to serialize multiple values)
Pattern Frequency (Serialized list of pattern frequencies in the format "[value],[count],[ratio]" (e.g. 999-99-9999,19972,1.00000). Use character semicolon ';' as a delimiter and backward slash '' as an escape character to serialize multiple values)
Value Frequency (Serialized list of value frequencies in the format "[value],[count],[ratio]" (e.g. 987-66-3365,1,0.00005;987-66-5400,1,0.00005). Use character semicolon ';' as a delimiter and backward slash '' as an escape character to serialize multiple values)
Sampling (folder with Data Sampling information, optional)
Data Sampling.csv (mapping file that allows to lookup Data Sampling information for the individual tables/files. Is used to resolve special characters that can be part of the table names but not allowed in the file system to name the files accordingly)
*[file path recreated from 'Data Sampling.csv'] (e.g. "\JDBC\172.0.0.1\FinanceDWDimensional\dbo\Customer.csv")
Table Url.csv (file that allows to specify a hyperlink to the original DQ source for the specified table, optional)
Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)
Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)
Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)
Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )
Table (table path, table name. It can also be a file, or partitioned directory name)
Url (a hyperlink on the corresponding table record in the original DQ tool)

Data Quality CSV Format Values

Predefined Rule Dimensions
Accuracy
Validity
Completeness
Consistency
Uniqueness
Timeliness
Supported date time formats
yyyy-MM-dd HHss
Supported connection types:
Snowflake
JDBC
Microsoft Azure Synapse SQL
Databricks Delta
Amazon Redshift V2
Amazon Athena
Oracle
Amazon S3 v2
Azure Data Lake Store Gen2
Google Cloud Storage

Meta Integration Data Quality Java based interface

This data quality solution can be also extended using a Java interface that allows one to manipulate the DQ CSV source files using certain Java APIs available in MetaIntegrationDataQualityCsv.jar. See the original java library for further API documentation.

package MITI.bridges.MetaIntegrationDataQuality

public class DQRules implements Closeable {

public static DQRules getInstance(String directory)

public Rule setRule(Rule value)

public RuleApplication setRuleApplication(RuleApplication value)

public RuleMeasure setRuleMeasure(RuleMeasure value)

public void setTableUrl(Table table, String url)

public void write()

public void close() throws IOException

}

public abstract class DQObject{

public void setId(String id)

public String getId()

public String getOriginalId()

}

public class Rule extends DQObject{

public Rule(String id, String name, String description, String operation, DataQualityDimension dimension, String url, Folder folder)

public String getName()

public String getDescription()

public void setDescription(String description)

public String getOperation()

public void setOperation(String operation)

public String getUrl()

public void setUrl(String url)

public DataQualityDimension getDimension()

public void setDimension(DataQualityDimension dimension)

public Folder getFolder()

}

public class Folder extends DQObject{

public Folder(String id, String name, String description, Folder parentFolder)

public String getName()

public void setDescription(String description)

public String getDescription()

public void setParentFolder(Folder parentFolder)

public Folder getParentFolder()

}

public class RuleApplication extends DQObject {

public RuleApplication(String id, Rule rule, Column[] columns, String description, float[] thresholds)

public Rule getRule()

public Column[] getColumns()

public String getDescription()

public void setDescription(String description)

public float[] getThresholds()

public void setThresholds(float[] threshold)

public SortedSet getRuleMeasures()

}

public class Column extends DQObject{

public Column(String name, String table, String schema, String catalog, String server, DataStoreType type)

public String getName()

public String getTable()

public String getSchema()

public String getCatalog()

public String getServer()

public DataStoreType getType()

}

public class RuleMeasure extends DQObject implements Comparable {

public RuleMeasure(String id, RuleApplication ruleApplication, Date date, long rowsTotal, long rowsFailed)

public RuleApplication getRuleApplication()

public Date getDate()

public long getRowsTotal()

public void setRowsTotal(long rowsTotal)

public long getRowsFailed()

public void setRowsFailed(long rowsFailed)

@Override

public int compareTo(RuleMeasure o)

}