Data Sampling and Profiling

Overview

While technical and descriptive metadata contain a great wealth of information about objects, this is only true if the information has been documented on those elements. In many cases, that metadata is incomplete and the best way to determine what that metadata should be (e.g., semantic data type, valid values, etc.) is to look at the data itself.

MetaKarta provides the option to sample and/or profile the actual data contained in files and tables, in addition to the metadata captured from a source format or tool. One may specify the number of records to profile and how many should be maintained as a sample for visualization later, and whether to randomly sample or start at the top.

Data sampling will provide sample rows from the dataset.

Data profiling helps to discover business knowledge embedded in data itself, improve users' understanding of the data and enables them to classify data with certainty. The data profiling process creates a summary of the data a model has. The summary has mainly statistics and charts. It helps users to find out if the correct data is available at the appropriate detail level.

That information is then available when one navigates to the file or table's object page or when looking at individual fields or columns from the file or table. The application can store and display the following data profiling details for table/view and column objects:

Counts (standard and custom counts, like empty and valid rows)
Values (distinct values and their counts)
Patterns (patterns and their counts)
Data types (inferred data types and their counts)

Sampled data and the profiling results are hidden from most users by default. One must be assigned the Data Viewing capability object role assignment for the model in question. One may also hide the sample data and profiling results for specific models.

For data sampling and profiling, you may

Schedule the sampling and profiling separately using the Data sampling and profiling operation. This process will also only sample and profile what is specified in the MQL STATEMENT.
Sample and profile on demand, e.g. on a schema or table via the user interface at any subset of the model you wish to specify.
Check the Data sample and profile after metadata import checkbox and MQL Statement as part of the Import Options when defining the metadata import (harvesting) of a data source to cause the profiling and sampling to occur every time the model is imported. However, that is not the best practice, as sampling and profiling large databases could take orders of magnitude more time than the metadata import.

In a best practices environment, the application or model administrators should define the data import policy (profiling, sampling, and classification), and the users or scheduled tasks should follow that policy. Not surprisingly, companies use different frequencies for metadata and data imports, and thus it is discouraged to always run the data import (data sampling and profiling) immediately after every import of metadata.

Thus, the in the user interface, the option to import data is separate from the data import policy by displaying the policy in the Data Defaults tab and leaving the option in the Import Options tab. In this way, the option and policy are independent. The option is turned off by default, but the administrators can enable sampling and profiling according to policy. In addition, the administrators may turn off sampling and profiling by default but enable data classification and data import. Still, data sampling and profiling is required in order to perform auto-tagging via data driven data classification.

Data Sampling and Profiling Options

One may customize the data sampling and profiling request or scheduled action. Please see the technical details for clarity.

A screenshot of a computer Description automatically
generated

Data Sampling -- Enable data sampling and specify number of rows to sample.
Data Profiling -- Enable data profiling and specify number of rows to use in profiling.
Data Select Method -- with a choice of the fast method Top (the default) vs. Random (reservoir sampling when available on the database)
Profile only objects that are not profiled yet - Enable data profiling only on imported objects which have not been profiled.
Data Classification -- Enable data classification.
Hide data using Sensitivity Label -- The selected sensitivity label will be applied to all new imported objects in the scope (in order to hide them).

In addition, there are inferred sensitivity labels so that when you apply a sensitivity label to an imported object, e.g. a column, then all the imported objects "downstream" in the data flow lineage will be given at least that level of sensitivity as "Sensitivity Label Lineage Proposed". This means you will see automatic sensitive label tagging by inference across the enterprise architecture. As with "Sensitivity Label Data Proposed", the "Sensitivity Label Lineage Proposed" can be rejected, therefore stopping the propagation of inferred sensitivity labels in that data flow direction. Note that the propagation of inferred sensitivity level is also not inferred by any data masking discovered within the ETL/DI/Scrip imports involved in that data flow.

You may override the Data Select Method on a subset of the imported model. However, you must first import the model once, so you have the metadata structure of the data source. Then, you may navigate to the object page of a schema or table or file in the imported model and specify the Data Select Method for that object.

In addition, you may specify a Data Select Query on the model subset.

Data Sampling and Profiling Options on Subsets of a Model

You may override the Data Select Method on a subset of the imported model.

You must first import the model once, so you have the metadata structure of the data source.

Then, you may navigate to the object page of a schema or table or file in the imported model and specify the Data Select Method for that object.

In addition, you may specify a Data Select Query on the model subset.

Steps

Sign in as a user with at least the Metadata Management capability object role assignment on the Configuration you are in and on the model you wish to specify options.
Navigate to the object page for the model subset.
Update the two settings:
- Update the Data Select Method: with a choice of the fast method Top (the default) vs. Random (reservoir sampling when available on the database)
- Update the Data Select Query which SELECT statement SQL that should be used when sampling and profiling.
Schedule or manually invoke the sampling and profiling.

Example

Sign in as Administrator and navigate to the Staging DW.dbo.Customer table.

Update the two settings:

Schedule or manually invoke the sampling and profiling.

Scheduling Data Sampling and Profiling

The profiling process can take a lot of resources and time. You can schedule to run the process periodically and limit it either by maximum duration (e.g. 2h) or amount of data (e.g. 300Mb). The bridge saves profiling results in the MIMB cache as soon as possible. When it profiles multiple files it saves profiling results of each file as soon as they are ready. If the bridge fails it can restart at the latest point available in the cache. When the bridge detect the file did not change (e.g. checksum is the same) it can update the profile time for the file.

The bridge tries to return as much as possible to the caller when it:

Completes
Fails
Reaches the time or volume limit.

Steps

See Manage Schedules for guidance on scheduling an operation.
Specify the model to sample as the OBJECT.
Select Data sampling and profiling as the OPERATION.
Here you may specify all the profiling and sampling options.

Example

Sign in as Administrator and go to MANAGE > Schedules. Click create new. Provide "S and P" as the NAME. Specify the model to sample as the OBJECT. Select Data sampling and profiling as the OPERATION. Specify the other profiling and sampling options.

Data Sampling and Profiling On Demand

You may initiate or refresh the data profiling and sampling metadata at most levels of an imported model, including:

Tables/Files/Views (Classifier)
Schema/Package
Model
File System Folder

Steps

Go to the object page of the object of which you wish to generate or refresh the data sampling and profiling.
Go to More Actions > Generate Data Sampling and Profiling.
Here you may specify all the profiling and sampling options.
Click OK and the operation will start.

Example

Search for the Employee.csv table and go to the object page.

Go to More Actions > Refresh data sampling and profiling.

A screenshot of a computer menu Description automatically
generated

A screenshot of a computer Description automatically
generated

Click OK and the operation will start. After it has finished the sampling and profiling information is shown.

Data Sampling and Profiling Following Harvesting

Please see details in the import metadata section.

Database Specific Sampling and Profiling Considerations

Data sampling and data profiling is available on all dedicated database bridges (e.g., Oracle, SQL Server, Hive, Teradata, etc.). It is implemented by automatically including the following options in the Miscellaneous bridge parameter:

Miscellaneous = -dp 1000 -dps 10

Which would be data profiling based upon 1000 rows and sampling of 10 rows.

However, the generic database JDBC import (e.g., HP Vertica) cannot predict the proper query to access data, and therefore does not support data profiling and sampling. However, the user may try by specifying the special option in the Miscellaneous bridge parameter. This is as follows:

Miscellaneous = -dp.query-template

Examples for specific databases (to replace ):

HP Vertica:

SELECT @column FROM @schema.@table LIMIT @recordsLimit

IBM DB2 ISeries (AS/4oo):

SELECT @column FROM @schema.@table FETCH FIRST @recordsLimit ROWS ONLY

Microsoft SQL Server:

SELECT TOP @recordsLimit [@column] FROM [@schema].[@table]

Oracle database:

SELECT "@column" FROM "@schema"."@table" WHERE ROWNUM <= @recordsLimit

Oracle MySQL:

SELECT '@column' FROM '@schema'.'@table` LIMIT @recordsLimit

Data profiling and sampling requires access to the data in the data store. Thus, you must specify a user name / password with the appropriate access rights. For example:

SAP HANA database:

GRANT the SELECT privileges to a role the user has.

https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.03/en-US/20f674e1751910148a8b990d33efbdc5.html

Steps

Specify profiling information when harvesting the metadata source
Execute the harvest.

Example

Table or file sample data

Log in as a user with the Data Viewing capability object role assignment for the Data Lake.

Viewing the sample data records for the Employee.csv file on its object page in the Sample Data tab.

Sign in as Dan, who is a Data Lake Viewer:

The SSN field is obfuscated. This is due to the fact that the US Social Security Number data class has a Default Sensitivity of Classified and the Classified sensitivity label has the Hide Data flag set.

Field or column profiling data

Now, log in as a user with the Data Management capability object role assignment for the Data Lake or one that is a member of the Administrators or Data Administrators group.

Then, click the Fields tab and click Gender

Go to the Employee.csv file object page and then the Fields tab.

Click on the Gender field to go to its object page.

Data classes

A field or column may be assigned one or more data classes. E.g., the field in the example above is assigned the data class Gender. Once that data class is assigned, it is a property of the element and may be searched on, filtered by and further edited (remove the assignment).

Some actions can apply to all objects of a certain data class. In particular the Hide/Show property referenced above.

You may manage the list of data classes with proper admin privileges.

See data classification for more details.

Databricks Unity Catalog Data Sampling and Profiling

You must provide additional options in the Miscellaneous bridge parameter for Databricks. If you ask for data sampling and profiling in the UI:

You will see an error in the log to specify the option in the Miscellaneous bridge parameter.

In order to resolve this issue, you should add the "-http.path" option to the Miscellaneous bridge parameter before importing. Please see the Databricks Unity Catalog bridge documentation for more details about the format of this option.

Removing Data Sampling and Profiling Metadata

You may remove the data profiling and sampling metadata. This may be done at most levels of an imported model, including:

Tables/Files/Views (Classifier)
Schema/Package
Model
File System Folder

Steps

Go to the object page of the object of which you wish to generate or refresh the data sampling and profiling.
Go to More Actions > Remove data sampling and profiling.
The operation will start.

Example

Search for the Employee.csv table and go to the object page.

Graphical user interface, application Description automatically
generated

Go to More Actions > Remove data sampling and profiling.

After the operation has finished the sampling and profiling information is no longer shown.

Data Profiling Properties

The MetaKarta repository, API and UI support the following profiling details (row counts):

Count: number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options)
Null -- rows which are mull.
Distinct rows: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less (e.g. 10) distinct values, the data is considered to be distinct.
Duplicate rows: rows with identical values for this field
Valid rows: rows with valid contents for this field
Empty rows: null in database or empty in files
Invalid rows: rows without valid contents for this field

The valid/invalid depends upon the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value "a", which is not a valid INTEGER, it would contribute to the invalid counter.

Average length: average of the lengths of each value profiled
Min length: lowest of the lengths of each value profiled
Max length: highest of the lengths of each value profiled
Min value: lowest value
Max value: highest value
Values [value, rows]: distribution of values and their frequency
Patterns [pattern, rows]: list of different patterns of data presentation discovered in the source and frequency
Data Types [type, rows]: list of data type matches and frequency. The column data type detected by the profiler. When a column has data of different data types the profiler pick the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type. Here is the list of supported types:
- Text
- Date
- Time
- DateTime
- Geographical
- No Percentiles
- Means, Median
- Variance
- Std. Deviation
- Number
Inferred Data Type: Inferred Data Types after dataprofiling the object.
Data classes: list of data classes matched and likelihood as a percentage.

Data Sampling and Profiling Sources

MetaKarta allows you to profile the data of database and file system models. See harvesting metadata for details.

Data Sampling and Profiling Process Considerations

When you specify the profiling settings, MetaKarta recursively profiles all tables in the import scope (MQL SUBSET SCOPE and object from which you are starting if performing manually) and collects sample rows. Data profiling and metadata import processes share the datastore connectivity and scope details. You do not need to configure the data profiling connectivity explicitly.

The data profiling process collects information that contains data values. It makes an effort to protect the information and show it to authorized users only and encrypts the information in its database. See Showing and Hiding.

Generic profiling statistics, like "% of distinct values" are available to all users that can view the content.

Showing and Hiding Sample Data and Profiling Results

Sampled data and the profiling results are hidden from most users by default. One must be assigned the Data Viewing or Data Owner capability object role assignment for the model in question.

One may even mask or hide this information even from a Data Viewer. To do so, one must either:

Have the Data Classification Editing capability object role assignment for that metadata and assign a sensitivity label which has data hiding.
Classify imported objects where a data class has a sensitivity label assigned to it which has data hiding.
Specify that all imported objects in a model are automatically hidden by default by way of a sensitivity label.

There are however, inferred sensitivity labels so that when you apply a sensitivity label to an imported object, e.g. a column, then all the imported objects "downstream" in the data flow lineage will be given at least that level of sensitivity as "Sensitivity Label Lineage Proposed". This means you will see automatic sensitive label tagging by inference across the enterprise architecture. As with "Sensitivity Label Data Proposed", the "Sensitivity Label Lineage Proposed" can be rejected, therefore stopping the propagation of inferred sensitivity labels in that data flow direction. Note that the propagation of inferred sensitivity level is also not inferred by any data masking discovered within the ETL/DI/Scrip imports involved in that data flow.

Data Sampling and Profiling Technical Details

MetaKarta reuses the model bridge infrastructure and metamodel for data profiling. Database and file system bridges provide "concealed" support for data profiling. They run in the metadata import mode by default. You can run them in the profiling mode by specifying dedicated Miscellaneous options.

When the bridges are running in the metadata mode they import not only basic structural details, like tables and columns but also advanced details, like keys and indexes. When they are running in the profiling mode they import the same basic structural details to carry profiling statistics (e.g. UDPs on MIR Attribute). It allows MM to integrate the profiling statistics into already loaded metadata using basic structure.

The bridges use the data profiling library. The library is derived and depends on the open source Talend data quality library. When the bridge runs in the metadata mode it does not depend on the data profiling library.

The bridge uses two queries for data sampling/profiling:

the first query if the count of rows is less than 100 000 rows

SELECT * FROM TableName DISTRIBUTE BY rand() SORT BY rand() limit 100

the second query if the count of rows is more than or equal to 100 000 rows:

SELECT * FROM TableName TABLESAMPLE( n PERCENT)