Data classification is one of the cornerstones of data governance. It allows one to classify different harvested data elements or objects with your shared terminology, represented as glossary terms. These relationships or data classifications provide the harvested objects with business names and definitions. Data classification may also help you to find hidden relationships between these harvested objects.
Your data estate can have many millions of harvested objects. In general, it is not practical to classify them all manually, i.e., one by one. MetaKarta helps you to automate the process data classification using its data profiling technology and data classes.
Data vs. Metadata (or Semantic) classification
Of special importance are personally identifiable information, or Personally Identifiable Information (PII). Automation and completeness of the identification and data classification of the PII objects in the data is fundamental to these activities. In many cases, data classes may be used to profile and match the criteria by which PII may be isolated. In this case, it is the actual data that is analyzed and used to classify the data element. This feature is referred to as data-detected data classification. Data-detected data classification is good at detecting common data patterns automatically but less focused on providing definitions
Some PII harvested objects, like Maiden Name and Date of Birth do not have unique data patterns and cannot be discovered using the data-detected data classification. MetaKarta helps you to identify these types of harvested objects using the metadata-detected data classification. This feature is referred to as metadata-detected data classification. Metadata detected data classification is good at providing authoritative and common definitions. It is more flexible but less precise than data-detected data classification.
A harvested object can have multiple metadata detected data classifications (relationships with business terms). It can have one definition or data-detected data classification and many other semantic or metadata detected data classifications. I.e., you may classify with the same business term several harvested objects that have different data types and patterns.
A harvested object can have multiple proposed, approved and assigned data classifications (relationship with data classes). MetaKarta encourages users to be as precise as possible with the data classifications and strive for a harvested object to have one approved or assigned data classification.
In addition, in terms of semantic lineage, MetaKarta uses data and metadata-detected data classifications to implement lookups of the inferred definition and related elements.
Manage Data Classes
You may manage the data-detected and metadata-detected classes, from the list of classes.
Steps
-
Go to MANAGE > Data Classes in the banner.
-
The list of data classes is presented.
-
You may also
-
Search for by Name or Description.
Example
Sign in as Administrator and Go to MANAGE > Data Classes.
Enter "Belgium" in the Search box.
The search text may match text in both the Name and Description.
Add a Data Class
Steps
-
Sign in as a user with at least the Application Administrator capability global role assignment.
-
Go to MANAGE > Data Classes in the banner.
-
Click the plus sign to Add a new data class.
-
Specify the Type, which is in this case is Data
-
Specify the Name.
-
Pick one or more Groups that this data class is to be a member of.
You may classify the data of a model by group.
-
Specify the Description.
-
Click Save.
Example
Sign in as Administrator and go to MANAGE > Data Classes.
Click the Add plus sign. Click Data as the Type of data class and enter
-
"Product Number Pattern" as the NAME
-
"Product" in the GROUPS
-
"General regular expression for product number classification" as the DESCRIPTION
Click OK.
Edit a Data Class
Depending upon the Type of data class one is working with:
the options will be different. However, for all types, there are common actions and properties.
Steps
-
If the data class does not yet exist, add the data class.
-
Select the status of the data class:
-
Enable: This data class will be included in the next data classification operation.
-
Disable: This data class will not be included in the next data classification operation.
Even though a data class is disabled, it may still be manually assigned and unassigned from an imported object.
-
Edit the following text fields:
-
NAME: Name of the data class
-
DESCRIPTION: Primary text description of the data class
-
Pick one or more data CLASSIFICATION GROUPS this data class is to be a member of.
You may classify the data of a model by group.
-
Pick a TERM from a glossary this data class will be associated with. This term will be used as part of the semantic lineage just as one that is semantically mapped or classified.
-
Select DEFAULT SENSITIVITY: specifies the Sensitivity Label that will be assigned to and data element that classified by this data class. Sensitivity Label assignment can control the hiding of data profiling and sampling information on the object page even for data viewers.
When a harvested object has data profiling and sampling information on its details its page shows them, by default, if one has the Data Viewer capability object role assignment. However, when a harvested object has a proposed or assigned data class resulting in a Sensitivity Label that has the HIDE DATA flag, then its data profiling and sampling information is not shown on the object page for data viewers.
When the Data Hide attribute of the harvested object is set to True it ignores the Data Hide flag of its data classes.
-
Select Auto Learning: to allow the data class to be auto-populated with a pattern based upon existing imported objects.
-
Click SAVE.
The options will be different depending upon what type of data class you are editing. However, for all types, there are common actions and properties.
Example
Sign in as Administrator and go to MANAGE > Data Classes.
Enter "Product" in the Search box.
Click the line for the Product Number Pattern class
Enter the pattern "AA-A999"in the DATA PATTERN box. Select "20"in the MATCHING THREASHOLD (%) box. Click SAVE.
Data-detected Data Classes
A data class is defined to identify a data pattern. You can define the data pattern manually or ask the application to learn it automatically from the data and approval actions.
A field or column may be assigned one or more Data-detected and/or Metadata-detected Data Classes. Once that class type is assigned, it is a property of the element and may be searched on, filtered by and further edited (remove the assignment).
E.g., the field in the example above is assigned the data class Gender.
Data classes are based upon a pool, defined repository-wide. This pool includes a unique name for the class and either:
-
Enumeration - list of valid values, e.g., Red, Blue, Green.
-
Pattern - list of possible patterns, generally discovered by the software, e.g., A{2}9{3}-9{3}-9{3}
-
Regular Expression - syntactical rule set.
These are used to infer data classification for objects based upon sampling and profiling the data.
Some actions can apply to all objects of a certain data class. In particular the Hide/Show property.
Edit a Data-detected Data Class
Steps
-
If the data class does not yet exist, add the data class.
-
You may edit all the properties in common for a data class.
You may not edit the Type after it has been set. You must create a new data class instead.
-
Set the MATCHING THREASHOLD to specify the minimum percentage of values matching any of the enumeration values, patterns or regular expression among all values (of that field/column).
-
Set the UNIQUENESS THREASHOLD to specify he minimum number of unique values among all values (of that field/column) to require enough diversity of the data set.
By default, the UNIQUENESS THREASHOLD is set to 1 on enumerations (and limited to the maximum number of enumeration values) and set to 6 otherwise.
-
Enter the DATA PATTERN, which may be one of the following:
-
Enumeration: a list of values for the data to match.
-
Pattern: Patterns for the data to match.
-
Regular Expression: RegEx format expression for the data to match.
-
Click SAVE.
Usage
To understand these settings, an all-women's college student database can have 1000s of rows that all have Female in the Gender column. In this case, the UNIQUENESS THREASHOLD should be set to 1 to match the Gender data class.
The International Gender enumeration data class has Male and Female values in different languages. When the customer has a column that uses Male and Female values in one language the application will match it with confidence less than 100% because of other languages. It is recommended that you use "International" data classes with care and employ them only when you have truly multilingual columns. Otherwise, you should define a data class for each language used and group them in an "International" compound data class. For example:
-
English Gender (enumeration): Male, Female
-
French Gender (enumeration): Mâle, Femelle
-
International Gender (compound): English Gender, French Gender
When the matching rule is Enumeration and the number of its possible values is less than the one specified in the UNIQUENESS THREASHOLD the application uses the number of possible values as the UNIQUENESS THREASHOLD.
Example
Sign in as Administrator and go to MANAGE > Data Classes.
Enter "Product" in the Search box.
Click the line for the Product Number RegEx class
Click the Regular Expression radio button and enter "^\D{2}-\d{4}$" as the first line in the DATA PATTERN box. Select "20"in the MATCHING THREASHOLD (%) box. Click SAVE.
Metadata-detected Data Classes
The data classification process tries to detect and match unique formats of harvested objects. Some harvested objects have nothing unique or detectable about their data. Harvested objects of type DATE or BLOB are good examples. In this case, you can try to identify similar harvested objects using metadata-detected classes.
Metadata-detected class matches harvested objects by their metadata attributes, like name. For example, we can try to classify date of birth columns by their data type, DATE and name that contains DOB.
These date of birth columns can contain PII information. Customers can play safe and mark any of these columns with PII and instruct the application to hide their data. Data and Metadata-detected classes share the same PII and Data Hide infrastructure.
A column can have DOB name and DATE data type but have nothing to do with date of birth columns (e.g. date of bankruptcy). You can approve and reject a matched Metadata-detected class the same way you can do with a Data Class.
A Metadata-detected class matches objects by their attributes using an Metadata Query Language (MQL) query.
Data classification is an operation that users start explicitly. MM performs the metadata-detected data classification automatically, each time you import a harvested model. Harvested metadata is static (cannot be changed between imports). You can change Metadata-detected classes and decide to rerun metadata-detected data classification on the whole repository.
You do not need to invoke metadata-detected data classification. Instead, the application proposes new matching metadata-detected data classes that were not rejected before either upon harvesting of a model or upon update of a metadata-detected data class.
You may invoke it manually when you wish to confirm that it has been performed.
Edit a Metadata-detected Data Class
Steps
-
If the data class does not yet exist, add the data class.
-
You may edit all the properties in common for a data class.
You may not edit the Type after it has been set. You must create a new data class instead.
- Set the QUERY to specify an Metadata Query Language (MQL) query that must be met before a data object that is being classified is associated with this data class.
You may enter the query by hand, copy it from a worksheet MQL or use the EDIT button to query by example through a worksheet-like dialog.
- Click SAVE.
Example
Sign in as Administrator and go to MANAGE > Data Classes.
Click the Add plus sign. Click Metadata as the Type of data class and enter
-
"Product Number Query" as the NAME
-
"Product" in the GROUPS
-
"General metadata-detected data class for product numbers" as the DESCRIPTION
Click EDIT to build a query and enter "Product Number" in the
Search text box.
There are a number of false positives because the assumed condition is "Product" OR "Number" and "Number" has a huge amount of hits.
Let's force an AND condition.
Remove "Number" from the Search text box and
press Enter. Then click ADVANCED.
Simply copy the MQL text, type " AND ", and then paste the MQL text (again) after that.
Replace the second 'Product' with 'Number', click EXECUTE, and you should have:
text = 'Product' WITHIN ('Name', 'Physical Name') AND text =
'Number' WITHIN ('Name', 'Physical Name')
Click OK. Then SAVE.
You do not need to invoke metadata-detected data classification. Instead, the application proposes new matching metadata-detected data classes that were not rejected before either upon harvesting of a model or upon update of a metadata-detected data class.
You may invoke it manually when you wish to confirm that it has been performed.
Compound Data Classes
Compound data classes are defined as a collection of data and metadata-detected data classes. If any one of those data classes is a match to an object (that is being programmatically classified) then the compound type is a match.
Edit a Compound Data Class
Steps
-
If the data class does not yet exist, add the data class.
-
You may edit all the properties in common for a data class.
You may not edit the Type after it has been set. You must create a new data class instead.
- Select the existing data classes which are to be included in the compound data class.
You may enter the query by hand, copy it from a worksheet MQL or use the EDIT button to query by example through a worksheet-like dialog.
- Click SAVE.
Example
Sign in as Administrator and go to MANAGE > Data Classes.
Click the Add plus sign. Click Compound as the Type of data class and enter
-
"Product Number" as the NAME
-
"Product" in the GROUPS
-
"Compound data class with data pattern and metadata query" as the DESCRIPTION
Click OK. Then pick the two other Product Number data classes.
Click SAVE.
Delete a Data Class
-
Click line with the data class.
-
Click the Delete (x) icon.
-
Click OK.
Data Classification Groups
You may associate data class groups with data classes. You may do so when editing a data class.
Data Classification
Once you have data classes defined, you may apply these to harvested data elements:
-
Manually: You may do this through the object page or a worksheet and even in bulk for data-detected data classes, and on the manage data classes page for metadata-detected data types.
-
Programmatically: You may invoke a data classification process where data classes are proposed based upon the patterns and metadata queries defined for the different data classes.
Written another way:
-
Metadata-detected data classes are used to auto-tag (propose classifications)
-
on demand can be performed manually.
-
Data-detected data classes are used to auto-tag (propose classifications)
-
as side of effect of import only if the data classification import option is checked
-
Compound data classes may consist of both metadata and data-detected data classes and thus may be may be used to auto-tag (propose classification) through any of the above processes.
After there are data classes proposed, you may also approve or remove particular assignments.
Data Classification Learning Methodology
MetaKarta provides machine learning and a data class inference system centered around learning from the activities you perform, as well as continuing to learn from users accepting and rejecting inferred semantic types, by the following:
-
Automatic Data Classification uses Sample and Profiling data to assign "class" values (former semantic types) to data columns to identify what kind of data these columns contain.
-
You can instruct MetaKarta to classify an object, model, or folder for the first time or again.
-
You can accept or reject inferred data classes or add existing or new classes. You can specify/accept multiple data classes per column.
-
The application remembers your data classification decisions and uses them to improve classification suggestions in the future.
Any Learning algorithm for data classification will have a data-driven origin. Therefore, MetaKarta captures as much information associated with the classes as possible. Given sensitivity, the matching ratio controls the data classification algorithm, which you can adjust with the "learning" index according to the predefined weight.
Data Class Proposal and Approval Process
Data classification auto-tagging proposal:
-
For data-detected data classes, when the confidence level is higher than the MATCHING THRESHOLD specified for that data class the application proposes to classify the harvested object with the data-detected data class (e.g. Country Code (98%)). You can accept or reject the proposal.
-
For metadata-detected data classes, when the associated Metadata Query Language (MQL) query produces the harvested object as a match the application proposes to classify the harvested object with the metadata-detected data class (e.g. Maiden Name). You can accept or reject the proposal.
-
For compound data classes, when either of the two above conditions applies to any of the contained data classes the application proposes to classify the harvested object with the compound data class (e.g. PII). You can accept or reject the proposal.
When you accept the proposal the application creates the "classifies" relationship between the data class and harvested object. The application creates the same relationship when you assign a data class to a harvested object manually.
When you reject the proposal the application remembers it and does not propose the match from then on. If you rejected the match by mistake you can instruct the application to forget about the rejection by classifying the column with the term manually.
Invoking Data-detected data Classification
You may invoke the data-detected data classification process for a
-
Entire model (all objects in the model)
-
Object within a model (and all objects contained withing, e.g., columns in a table)
You do not invoke metadata-detected data classification. Instead, application proposes new matching metadata-detected data classes that were not rejected before on import of a model or update of a metadata-detected data class including as the result of import of metadata-detected data classes.
Invoking Data-detected Data Classification of a Model
Once you have your data classes defined and you have imported your model with Generate Data Sampling and Profiling information, you may invoke the data classification process.
Steps
-
Sign in as a user with at least Metadata Editing capability object role assignment for that model you wish to classify.
-
Navigate to the object page of the model.
-
Go to More Actions... and select Data Classification.
An operation is invoked. Once completed all metadata objects which have profiling and sampling data will have data classes proposed for them based upon the matching criteria.
Example
Sign is as Administrator and navigate to the object page of the Data Lake model.
There is also the option to Generate Data Sampling and Profiling. This action will ensure that there is data sampling and profiling information for the data classification process to work with.
Go to More Actions... and select Generate Data Sampling and Profiling.
See data sampling and profiling options details.
Click OK.
Once that process is done, go to More Actions... and select Data Classification.
You may classify by a single group or all groups.
Choose Product as the data CLASSIFICATION GROUP and click OK.
The action kicks off an operation which runs as a separate process.
Search for ProductNumber in the Data Lake model.
The Product Number Query and Product Number data classifications are proposed for this data element. However, again, the Product Number Query was proposed as soon as the metadata-detected data class was defined as you do not have to invoke metadata-detected data classification. Instead application proposes new matching metadata-detected data classes that were not rejected before on import of a model or update of a metadata-detected data class including as the result of import of metadata-detected data classes.
As assigning both is redundant, you can set Product Number Query to be a virtual data class.
Invoking Data-detected Data Classification of an Object in a Model
Once you have your data classes defined and you have imported your model with data sampling and profiling information, you may invoke the data classification process of individual objects or container objects and all those contained within (e.g., columns in a table).
The process is identical as for data classification of a model (entire model) except that you Navigate to the object page of the object in the model, rather than the top level of the model.
Invoking Metadata-detected Data Classification
You do not need to invoke metadata-detected data classification. Instead, the application proposes new matching metadata-detected data classes that were not rejected before either upon harvesting of a model or upon update of a metadata-detected data class.
You may invoke it manually when you wish to confirm that it has been performed by following these steps
Steps
-
Either:
- For all metadata-detected data classes:
i. Go to MANAGE > Data Classes
ii. Go to More Actions... and select Classify Metadata.
- For specific data classes
i. Go to MANAGE > Data Classes
ii. Select one or more Metadata type data classes in the list
iii. Right-click and select Classify Metadata.
- You may also perform data classification in bulk from a list (worksheet) of feature type objects.
An operation is invoked. Once completed all metadata objects will have data classes proposed for them based upon the matching criteria.
Example
Sign is as Administrator and go to MANAGE > Data Classes. Select one or more Metadata type data classes in the list.
Right-click and select Classify Metadata.
The action kicks off an operation which runs as a separate process.
Metadata-detected Data Classification in Bulk
Editing Data Classifications
Data classification assignments may be assigned manually or automatically proposed to an object and appear in the object's Data Classifications. If automatically proposed, then one may approve or reject the assignment.
Approving the assignment changes the state of that data class assignment to approved, and you may filter by that information in worksheets.
Rejecting the assignment changes will cause the product to remember this action and future automatic data classification of that object will never assign that same data class to that object, as it was rejected.
To clear this reject, simply manually re-assign the data class to the object's Data Classifications.
One may simply remove a data class proposal by editing the object's Data Classifications and removing the data class, rather than rejecting it. In this case, the product does not remember this action and future automatic data classification of that object will assign that same data class to that object, as it was not rejected.
Steps
-
Sign in as a user with at least Metadata Editing capability object role assignment for that model you wish to classify.
-
Navigate to the object page of the object with the proposed data class.
When editing data classifications in spreadsheet format, you must include the Data Classifications column in order to edit it.
You may also wish to add the Data Classifications Approved, Data Classifications Matched and Data Classifications Rejected columns (These replace the older concept of Semantic Type and Inferred Semantic Type).
Approving Data Classifications
To approve a proposed data class, click the check mark next to the data class.
Rejecting Data Classifications
To approve a proposed data class, click the "X" next to the data class.
Removing Data Classifications
To remove (without rejecting) a proposed data class, DO NOT click the "X" next to the data class, but instead double-click on the data classification editing box, then click again in the box and a pull down is presented where you can add or remove data classes.
Auto Learning Data Patterns
The data classification operation uses the data pattern to match data classes to harvested objects with some confidence. You can approve or reject proposed data classes. When you approve or reject a learning data class the application absorbs the information and masters its understanding of the data pattern.
The data classification algorithm handles dictionary and regex data patterns. An example of the dictionary data pattern is [red, blue, green]. The pattern applies to a column when it has a smaller number of distinct values than total sampled rows. A phone number column is an example of a column with regex data patterns. In this case, the majority of sampled rows have unique values that share the same data pattern, like NNN NNN-NNNN.
Just as you may enter a pattern for a data detected data class, the system may learn from what is in the sampling data and suggest patterns based upon what is learned.
Steps
-
Ensure that you have created a data-detected data class with the Auto Learning flag set.
-
Navigate to the object page of the object (data element) you wish to use as a basis to learn from.
You must have already sampled and profiled the data for that object.
-
Assign the data class manually to that object.
-
Later, when you have a good set of patterns, you may invoke data classification on other objects to automatically associate the data class with those other objects.
Example
Sign in as Administrator and created a data-detected data class with the Auto Learning flag set as shown:
Search for ProductNumber and pick the one in the Data Lake > DataCataloging > AdventureWorks Data Lake > Production > Product.csv file. Open its object page Overview tab and manually assign the Product Number Learned Pattern data class to ProductNumber.
Once the data class is assigned to the object, return to the Product Number Learned Pattern data class in MANAGE > Data Classes and note the patterns are learned:
It picks up all the patterns that fit the THRESHOLDS you specified. We can improve (shorten) this list by adjusting them.
The numbers in blue next to the patterns are the matched weight that particular pattern, with a minimum of 10.
The weight changes according to the following rules:
-
Increment weight by one for each matching possible value or pattern when somebody approves a data class for the column.
-
Decrement weight by one for each matching possible value or pattern when somebody removes previously approved data class from a column. We also automatically remove the pattern or possible value when the weight turns 0.
-
Set weight to 10 for all the current values/patterns when one enables auto learning mode for a data class. This way we treat them as user inputs with the higher weight. It helps to prevent them from being removed from a data class when one rejects the data class from the associated column (see second point above).
Turn off Auto Learning and clear out the DATA PATTERN box of all but one pattern: AA-A999. Click SAVE.
Finally, change the MATCHING THRESHOLD to 10% as this pattern on represented 15% of the data:
and click SAVE.
We have set up the data class, so it is ready to be used for data classification, rather than learning.
Then, go to the object page of the ProductNumber field in the Product.psv file (NOT the Product.csv file we were working with before.
Run data classification on that object.
The Product Number Learned Pattern data class is now assigned to this field, as well.
Accuracy
With learned data class patterns, as more and more objects suggest a pattern, and accuracy number appears next to the pattern. This allows one to identify which patterns are likely to be more accurate, so one may clean out less accurate patterns.
Import Data Classes
You may import data classes that were exported earlier or from another environment.
There are a number of standard extensions to the basic data classes provided with the product. You may import these from the installation path at /conf/Classification.
Steps
-
Go to MANAGE > Data Classes in the banner.
-
Click IMPORT and select From file.
-
Browse for a file and click OK.
The import action will produce a log and will update and merge, reporting on the number of roles affected.
Export Data Classes
You may export data classes to be imported later or in another environment.
Steps