Data Class Discovery

MetaKarta has a concept of data classes. These data classes may be applied like tags to column level (e.g., columns in a database or fields in a file) objects and indicate that object to be a class of object, e.g., Social Security Number or Gender. In this way, one may categorize by data class and thus identify, sort, operate on different objects all of that same type.

You may manually assign data classes to a object from the element's object page or when browsing in grid mode. In addition, as part of the harvesting and data profiling process, MetaKarta will suggest data class assignments that may be confirmed and made permanent.

Data classes have been referred to as semantic types in the past. Currently, though, with the inclusion of metadata-detected data classes and other improvements, the concept has been generalized into data class and all data classification is based upon these.

Steps

Ensure that you have specified the appropriate data sampling and profiling options before harvesting.
Navigate to the object page for the object you wish to work with.

You may also review and editing data class assignments in grid mode. However, they cannot be assigned in bulk.

MetaKarta will have proposed data classes.
To confirm a proposed data class, click the check.
To reject a data class, click the X.

Reject a data class proposal is permanent, and in future harvests it will not be suggested again. You may, however, assign it manually in the future.

To specify a data class that is not currently assigned, click in the box and start typing. A pull-down list with options of valid data classes will be provided to pick from.

Example

Navigate to the object page of the Gender field in the Employee.csv file.

There are two suggested data classes. Confirm the Gender type by clicking the check mark next to that type. Then reject the Civility type by clicking the X next to it.

You will receive a warning that this action is permanent.

Are you sure you want to reject Civility data class?

It will not be proposed again for this object if you reject it but you will still be able to manually add it.

And the result is a single confirmed type.

Valid Data Classes

The available set of data classes is strictly controlled, thus you may not simply type a new one in when assigning them to a object. A data class definition is more than just a name. In includes rules to match against (textual pattern matching rules or a list of valid values).

The current set of valid data classes may be reviewed, edited and removed using the manage data classes feature.

Hiding profiling and sample data by data class

You may ensure that that sample and profile data are hidden from the casual user by setting a Hide flag on that object.

In addition to manually setting this value, you may also define a data class to hide the data sampled and profiled on subsequent harvests. Thus, e.g., you could define the data class US Social Security Number to be hidden for all objects of that data class. Then, as the data is profiled in subsequent harvests, and MetaKarta determines that an element is of that data class, its flag will be set to hidden. Go to manage data classes to manage this feature.

Auto-tagging or Data Classification

Classification is one of the cornerstones of data governance. It allows one to classify different harvested data elements or objects with your shared terminology, represented as glossary terms. These relationships or data classifications provide the harvested objects with business names and definitions. Data classification may also help you to find hidden relationships between these harvested objects.

Your data estate can have many millions of harvested objects. In general, it is not practical to classify them all manually, i.e., one by one. Auto-tagging of data classes helps you to automate the process data classification using its data profiling technology and data classes.

Data vs. Metadata-detected data classification

Of special importance are personally identifiable information, or Personally Identifiable Information (PII). Automation and completeness of the identification and classification of the PII objects in the data is fundamental to these activities. In many cases, data classes may be used to profile and match the criteria by which PII may be isolated. In this case, it is the actual data that is analyzed and used to classify the data element. This feature is referred to as data-detected data classification. Data-detected data classification is good at detecting common data patterns automatically but less focused on providing definitions

Some PII harvested objects, like Maiden Name and Date of Birth do not have unique data patterns and cannot be discovered using the data-detected data classification. MetaKarta helps you to identify these types of harvested objects using the metadata-detected classification. This feature is referred to as metadata-detected type data classification. Metadata-detected data classification is good at providing authoritative and common definitions. It is more flexible but less precise than data-detected data classification.

A harvested object can have multiple metadata-detected data classifications (relationships with business terms). It can have one definition (data-detected) classification and many other semantic (metadata-detected) classifications. I.e., you may classify with the same business term several harvested objects that have different data types and patterns.

A harvested object can have multiple proposed, approved and assigned data classifications (tagged with data classes). MetaKarta encourages users to be as precise as possible with the data classifications and strive for a harvested object to have one approved or assigned data classification.

In addition, in terms of semantic lineage, MetaKarta uses data and metadata-detected data classifications to implement lookups of the inferred definition and related elements.

Metadata-detected data classification occurs automatically as you create or edit those data classes or when metadata is harvested. In addition, you may invoke the metadata-detected data classification explicitly. For data-detected data classification, you may manage it and invoke it by following the steps below.

Please see further details at:

Steps

Add and edit a data class as needed.
Browse for a glossary term to associate with the data class as necessary.
For the source metadate, be sure to harvest the metadata with data profiling information captured.
Invoke the automatic data classification either as a part of the harvest or using the Classify Data options.
Navigate to the object page for the object you wish to work with.

You may also review and editing data class assignments in grid mode. However, they cannot be assigned in bulk.

MetaKarta will have proposed data classes.
Either
Approve or reject the already proposed data class
Specify the data class manually.
Go to the Lineage tab. Click on Definition and note the term associated with the data class is now presented.
Click on the term. Go to the Lineage tab. Click on Usage and note the metadata object (and others assigned to the same data class) are now presented.

Example

Go to the MANAGE > Data classes in the banner.

Click the line with the US Social Security Number data class.

Click the browse icon () next to TERM. Select the term US Social Security Number in the Object Explorer dialog.

Click OK, then SAVE.

Navigate to the object page of the SSN field in the Employee.csv file.

There are no data classes proposed yet as we have not auto-tagged.

Go to More Actions > Data Classification

Select PII-SSN for the CLASSIFICATION GROUP.

It is not necessary to selection any data classification group, as leaving it blank would simply use all the data classes defined. However, for performance purposes when you are classifying an entire model or configuration, you may wish to improve performance by specifying a group.

The SSN field is assigned the US Social Security Number data class. Thus, the Name and Business Definition are populated with the information of the SSN term.

Go to the Lineage tab. Click Definition in the upper right.

The US Social Security Number term is provided as one of the definitions.

Click on the term US Social Security Number(first line). Go to the Lineage tab. Click Usage in the upper right.

We see that SSN is classified with this term (because of the auto-tagging).

We can classify the entire data lake at once and see what else is auto-tagged. Navigate to the object page of the Data Lake model and select **More Actions

Data Classification. Again select PII-SSN for the CLASSIFICATION GROUP and click OK**.

Wait for the operation to finish.

Return to the US Social Security number term and the semantic usage report. Click Diagram on the left.

The ID field is shown as one metadata object (along with others) that use the definition and name of this term.