Semantic Table Overview
Semantic tables provide business-friendly views of raw database tables and views. They host semantic columns, metrics, and hierarchies. A model can employ a semantic table as a fact or dimension.
A semantic table can optionally define joins to represent a snowflake dimension (e.g., nation and region) or a dependent fact (e.g., lineitem and order).
Semantic Table defines:
- Source database table
- Unique Keys for the table to be used in Joins
- Joins to relate the table with other tables (e.g., dependent multi-fact or snowflake dimension).
- Semantic Columns to select database columns that the semantic table can use and define calculation expressions that metrics and levels utilize.
- Levels as columns in a database that represent a different granularity of detail within a dimension semantic table.
- Hierarchies to organize data from the highest level of summarization (the "all" or root level) down to the most granular level (the "leaf" level) within a dimension semantic table
- Metrics to represent an aggregated semantic column expression, such as the sum of sales or average order quantity.
Dimension Semantic Tables
Dimension semantic tables provide descriptive details (e.g., Customer, Product, Date) for filtering and grouping metric data.
MSL supports the following types of dimension semantic tables:
- Normal: Sources from one database table to help implement a classic star schema with single-level joins. All data for a normal dimension is normalized into a single table or view. There are two types of normal dimensions:
- Standard: Can have any type of hierarchy.
- Time: Must have a time hierarchy. The semantic table is of the TIME dimension type when it has a metric with the
time_typeparameter set to a non-default value. - Snowflake: A logical dimension that is composed of multiple underlying physical datasets.
- Outrigger: Secondary dimensions referenced from other dimensions (e.g., Customer dimension with an embedded Nation lookup).
- Role-Playing Dimensions: Same dimension table used in different contexts (e.g., Date as Order Date, Ship Date).
- Degenerate: Uses columns in a fact source database table and shares the semantic table with metrics.
- Junk Dimensions: Low-cardinality flags and indicators grouped together to simplify the schema.
- Slowly Changing Dimensions (SCDs): Handling historical changes in attributes (Type 1 overwrite, Type 2 versioning, Type 3 alternate columns).
- Many-to-many: Also called multi-valued. This occurs when a fact dataset row references more than one row in a dimension dataset. In MSL, this is modeled by defining a dimensional bridge or junction table to resolve the many-to-many relationship.
Entity Relationships
---
config:
class:
hideEmptyMembersBox: true
---
classDiagram
`MSL Object` <|-- `Semantic Table`
`MSL Object` <|-- `BI Object`
`MSL Object` <|-- Level
`BI Object` <|-- `Generic Semantic Column`
`BI Object` <|-- Metric
`BI Object` <|-- Hierarchy
`Generic Semantic Column` <|-- `Semantic Column`
`Generic Semantic Column` <|-- `Calculated Semantic Column`
class `MSL Object` {
<<interface>>
String name [1..1]
String description
String objectID
}
class `BI Object`{
<<interface>>
Boolean hide
String folder
}
classDiagram
`Semantic Table` *-- `Generic Semantic Column`: columns
`Semantic Table` *-- `Metric` : metrics
`Semantic Table` *-- `Level` : levels
`Semantic Table` *-- `Hierarchy`: hierarchies
`Semantic Table` *-- `Unique Key`
`Semantic Table` *-- Join
`Join` *-- RolePlay
`Level` o-- `Generic Semantic Column`
`Hierarchy` o-- `Level`: levels
`Metric` o-- `Generic Semantic Column` :expression
class `Semantic Table` {
String source [1..1]
Array~Unique Key~ unique_keys [1..1]
Array~Join~ joins
Array~Generic Semantic Column~ columns
Array~Metric~ metrics
Array~Level~ levels
Array~Hierarchy~ hierarchies
String synonyms
}
class `Unique Key` {
Array~String~ columns [1..1]
}
class Join {
Array~String~from_fk_columns [1..1]
String to_semantic_table [1..1]
String to_unique_key
Array~Join~ joins
RolePlay role_play
}
class RolePlay {
String prefix
String suffix
Array~String~ attributes
Array~String~ hierarchies
}
class `Level` {
Array~String~ key_columns [1..1]
Array~String~ alternate_key_columns
String value_column
String order_by_column
Array~String~ dependent_columns
String time_type
}
class `Hierarchy` {
Array~String~ levels
}
class `Generic Semantic Column` {
String bi_data_type
String bi_category
String default_aggregation
}
`Generic Semantic Column` <|-- `Semantic Column`
`Generic Semantic Column` <|-- `Calculated Semantic Column`
class `Semantic Column` {
String source_column [1..1]
}
class `Calculated Semantic Column` {
Array~Expression~ expression [1..1]
}
class `Metric` {
Array~Expression~ expression [1..1]
%% Push Non-Additive Metrics to the Client (e.g., BI) Semantic Layer
Boolean nonAdditive
}
Semantic Table Definition
source
- Type: string
- Required: Yes
Identifies a source database table (or view) for this semantic table. When a dimension semantic table involves multiple tables (e.g., snowflake), this attribute identifies the grain table (e.g., connected to the fact table).
TBD: Should we enable users to specify a SQL query to implement custom scenarios?
unique_keys
- Type: Array<Unique Key>
- Required: Yes
Defines one or more sets of columns that uniquely identify rows in the semantic table. The first key, or one explicitly named "default", is used as the default unique key for joins. Joins can optionally reference non-default keys.
- name: Customer
source: customer
unique_keys:
- name: customer # The first key or one called default is the default unique key used for joins.
# Joins can optionally use non-default keys.
columns: # Keys supports either "multiple" columns or "single" column (shorter) property
- c_custkey
- name: second key
columns:
- c_custname
- name: Orders
source: orders
unique_keys:
- column: o_orderkey # Fact tables do not need to define keys unless the target database is Snowflake
- name: Nation
source: nation
unique_keys:
- name: default # default key
columns:
- n_nationkey
- name: Region
source: region
unique_keys: # The key name is 'default' when the name property is missing.
- column:
- r_regionkey
Note: Fact tables do not need to define keys unless the target database is Snowflake.
joins
- Type: Array< Join>
- Required: No
Join relationships this table or model has. See Join for details.
synonyms
- Type: string
- Required: No
The comma-separated list of synonyms. For example, the active_user_count can have synonyms: active users, monthly active users, MAU
Synonyms are used for informational purposes only and act as alternative, business-friendly names for objects. like semantic tables. They provide a layer of abstraction that helps Large Language Models (LLMs) and business users interpret data, without affecting the underlying database objects. Semantic Column
Semantic Columns enables you to either select a database source_column or define a calculation expression.
A basic column uses the source_column property to select a database column in the source table.
A calculated column determines its row values using the expression parameter that can involve database columns and semantic columns. They are helpful for row-level computations, such as combining text, calculating new values, or creating categories. The expression parameter enables you to define formulas for different database dialects.
You use semantic columns to define levels and metrics. Unless you decide to hide them (e.g., ID or key columns), they are available in the BI semantic layer.
Join
A Join represents a MANY_TO_ONE relationship between FROM (many) and TO (one) semantic tables using the from_unique_key and to_fk_columns properties.
The TO semantic table (e.g., dimension in the relationship to a fact) can define multiple unique_keys referenced by different joins. When the to_unique_key is missing, the join uses the default unique key.
You can optionally name a join using the name property. When it is missing, the join assumes the name of the to_semantic_table.
TBD: Why do we need to name joins, and what rules should these names follow (e.g., uniqueness)?
The from_columns point to the foreign key column(s) of the FROM table.
You can define joins in the model and semantic tables.
You can use nested joins to depict precise relationships between tables.
Example illustrating Model, Fact, and Dimension joins:
- name: TPC-H Lineitem-Order Customer-Nation Catalog
definition: Example of all kinds of joins (only - not dimension and fact details).
dimensions:
- name: Nation
source: nation
unique_keys: # the default unique key
column: n_nationkey
- name: Region
source: region
unique_keys: # the default unique key
- column: r_regionkey
- name: Customer Nation Region
source: customer
unique_keys: # the default unique key
- column: c_custkey
# 1. Dimension Semantic Table - snowflake joins
joins:
- from_fk_column: c_nationkey
to_semantic_table: Nation
joins:
- from_fk_column: n_regionkey
to_semantic_table: Region
facts:
- name: Orders
source: orders
unique_keys: # the default unique key
- column: o_orderkey
- name: Lineitem Orders
source: lineitem
unique_keys:
columns:
- l_orderkey
- l_linenumber
# 2. Fact Semantic Table - dependent multi-fact joins
joins:
- from_fk_column: l_orderkey
to_semantic_table: Orders
models:
- name: Model
fact_semantic_table: Lineitem Order
# 3. Model - fact and dimension joins
joins:
- from_fk_column: o_orderkey # FK
to_semantic_table: Customer Nation Region
- from_fk_column: l_suppkey # FK
to_semantic_table: Supplier
Array~String~from_fk_columns [1..1]
String to_semantic_table [1..1]
String to_unique_key
Array~Join~ joins
to_semantic_table
- Type: String
- Required: Yes
The TO semantic table that uses the source table with the to_unique_key.
from_fk_columns
- Type: Array
- Required: Yes
List of database columns in the source table that make up the foreign key.
to_unique_key
- Type: String
- Required: No
The name of the unique key in the TO semantic table. When the property is missing, the DEFAULT key is used.
joins
- Type: Array
- Required: No
Optional nested Joins.
role_play
- Type: RolePlay
- Required: No
For role-playing dimensions only. Defines the role-playing template for the TO dimension in the semantic table join.
RolePlay
The role-playing template is the prefix and/or suffix added to every name attribute in the role-playing dimension, including the names of hierarchies.
You can subset the list of hierarchies in a role-playing dimension semantic tables using the hierarchies property. When it is missing, all hierarchies are included.
The following example defines "Order Date" and "Ship Date UPS" role-playing dimension semantic tables.
- name: Sales - Role-playing Date dimension
fact_semantic_table: Sales
semantic_tables:
- name: Date
source: dim_date
unique_key:
- column: d_datekey
joins:
- to_semantic_table: Date
from_column: o_orderdatekey
role_play:
- prefix: Order
- to_semantic_table: Date
from_column: o_shipdatekey
role_play:
- prefix: Ship
suffix: UPS
hierarchies:
- Retail 445 Week Hierarchy
prefix
- Type: string
- Required: Yes (when the suffix is missing) TBD: The prefix text .
suffix
- Type: string
- Required: Yes (when the prefix is missing)
TBD: The suffix text.
columns
columns
- Type: Array
An optional list of semantic columns to include in the role-playing dimension. When the property is missing, all semantic columns are included.
hierarchies
- Type: Array
- Required: No
Optional list of hierarchies to include into the role playing dimension. When the property is missing, all hierarchies are included. Including a hierarchy includes all semantic columns its levels depend on.
Semantic Column
A semantic column is defined from the source database column and can be used as a 'base fact' (when default_aggregation is set) or a 'dimension attribute' (otherwise).
source_column
- Type: String
- Required: Yes
The name of the database column in the source table specified in the parent semantic table.
Calculated Semantic Column
A calculated semantic column is defined using an expression and can be used as 'base fact' (when default_aggregation is set) or 'dimension attribute' (otherwise).
Expression in Calculated Semantic Column
- Type: array
- Required: Yes
Row-level dialect-specific expression defining this semantic column from database columns or other semantic columns using $(sem_col_name)$ macro. For example, when x=t.c1+t.c2 and y=$(x)$/100, then y=(t.c1+t.c2)/100).
See Expression for details.
Generic Semantic Column
The base abstract definition, which Semantic Column and Calculated Semantic Column implement.
serves as a layer bonding logical Dimension's Attribute and Base Fact to their physical
database column using expression that can reference these columns or other "semantic columns".
Expression in Generic Semantic Column
- Type: array
- Required: Yes (when
source_columnis missing)
A dialect-specific expression defining a semantic column from database columns and other semantic columns.
See Expression for details.
default_aggregation
- Type: string
- Required: No
- Default: NONE
Default aggregation function name to apply to the base fact semantic column. MSL defines generic (not tool-specific) enumeration of aggregation functions and their mapings to tool-specific functions. You can specify an enumeration value or a string for unsupported tool-specific function.
Common Aggregation enumeration:
| Common | Description | Oracle | SAP HANA | Snowflake | Databricks (Spark SQL) | Power BI (DAX / visual) | Tableau |
|---|---|---|---|---|---|---|---|
| SUM | Sum of values | SUM(x) | SUM(x) | SUM(x) | SUM(x) | SUM(x) | SUM(x) |
| AVERAGE | Arithmetic mean | AVG(x) | AVG(x) | AVG(x) | AVG(x) | AVERAGE(x) | AVG(x) |
| MIN | Smallest value | MIN(x) | MIN(x) | MIN(x) | MIN(x) | MIN(x) | MIN(x) |
| MAX | Largest value | MAX(x) | MAX(x) | MAX(x) | MAX(x) | MAX(x) | MAX(x) |
| COUNT_ROWS | Count of all rows | COUNT(*) | COUNT(*) | COUNT(*) | COUNT(*) | COUNTROWS(table) (visual: Count) | Number of Records |
| COUNT_NON_NULL | Count of non-nulls in column | COUNT(x) | COUNT(x) | COUNT(x) | COUNT(x) | (visual) Count (Not blank) — COUNT(x)/COUNTA(x) | CNT(x) |
| COUNT_DISTINCT | Count of distinct values | COUNT(DISTINCT x) | COUNT(DISTINCT x) | COUNT(DISTINCT x) | COUNT(DISTINCT x) | DISTINCTCOUNT(x) | CNTD(x) |
| MEDIAN | 50th percentile (median) | MEDIAN(x) | MEDIAN(x) | MEDIAN(x) | percentile_approx(x, 0.5) | MEDIAN(x) | MEDIAN(x) |
| STDDEV_SAMP | Std dev, sample | STDDEV_SAMP(x) | STDDEV_SAMP(x) | STDDEV_SAMP(x) | STDDEV_SAMP(x) | STDEV.S(x) | STDEV(x) |
| STDDEV_POP | Std dev, population | STDDEV_POP(x) | STDDEV_POP(x) | STDDEV_POP(x) | STDDEV_POP(x) | STDEV.P(x) | STDEVP(x) |
| VAR_SAMP | Variance, sample | VAR_SAMP(x) | VAR_SAMP(x) | VAR_SAMP(x) | VAR_SAMP(x) | VAR.S(x) | VAR(x) |
| VAR_POP | Variance, population | VAR_POP(x) | VAR_POP(x) | VAR_POP(x) | VAR_POP(x) | VAR.P(x) | VARP(x) |
| PERCENTILE_CONT | Continuous percentile p∈[0,1] | PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x) | same | same | percentile(x, p) or percentile_approx(x, p) | PERCENTILEX.INC(table, expr, p) | PERCENTILE(x, p) (table calc) |
| PERCENTILE_DISC | Discrete percentile p∈[0,1] | PERCENTILE_DISC(p) WITHIN GROUP (ORDER BY x) | same | same | — (use approx) | PERCENTILEX.EXC(...) | — (use PERCENTILE calc) |
| APPROX_COUNT_DISTINCT | Fast approx distinct count | APPROX_COUNT_DISTINCT(x) | — | APPROX_COUNT_DISTINCT(x) | approx_count_distinct(x) | — | — |
| APPROX_PERCENTILE | Fast approx percentile | APPROX_PERCENTILE(x, p) | — | APPROX_PERCENTILE(x, p) | percentile_approx(x, p[, accuracy]) | — | — |
| ANY_VALUE | Arbitrary representative value | — (use MIN/MAX) | — | ANY_VALUE(x) | — (use first/last or MIN) | SELECTEDVALUE(x) (or logic) | ATTR(x) |
Example:
bi_data_type
- Type: string
- Required: Yes
Generic (not tool-specific) data type classification that defines the kind of data a column contains, such as text, whole numbers, decimals, or dates. categorization In Power BI Desktop, the 'Data type'.
Common Data Type enumeration with descriptions, and BI tool examples:
| Data type | Description | Power BI | Tableau |
|---|---|---|---|
| INTEGER | A whole number with no decimals. Maps to the largest integer data type in the tool. | Whole Number | Number (whole) |
| DECIMAL | A number with a fractional component. | Decimal Number Fixed Decimal Number | Number (decimal) |
| BOOLEAN | A value that can be either true or false. | True/False (Boolean) | Boolean |
| TEXT | A string of alphanumeric characters. | Text | String |
| DATE | A calendar date (e.g., 2025-10-17). | Date | Date |
| DATETIME | A value containing both date and time. | Date/Time | Date & Time |
| TIME | A value representing the time of day. | Time | Date & Time (at time granularity) |
| BINARY | Binary large object (image, document). | Binary | Not natively supported |
bi_category
- Type: string
- Required: No
Generic (not tool-specific) categorization on top of the bi_data_type to influence the visualization behavior of BI tools. MSL maps them to tool-specific categories. Some tools support user defined categories. You can specify them as a string.
Common Category with descriptions, and BI tool examples:
| CATEGORY | Description | Tableau | Power BI |
|---|---|---|---|
| AIRPORT | Airport codes/locations (IATA/ICAO). | Airport | — |
| AREA_CODE_US | U.S. telephone area code; geo-ish but not a location polygon. | Area Code (US) | — |
| CBSA_MSA_US | U.S. Core/Base Statistical Areas & Metro Statistical Areas. | CBSA/MSA (US) | — |
| CITY | City/locality. | City | City |
| CONGRESSIONAL_DISTRICT_US | U.S. House districts. | Congressional District (US) | — |
| COUNTRY_REGION | Nation-level. | Country/Region | Country/Region |
| COUNTY | County/shire/prefecture equivalents. | County | County |
| NUTS_EUROPE | EU NUTS levels (NUTS1/2/3). | NUTS Europe | — |
| STATE_PROVINCE | First-order admin division (state, province, region, oblast, etc.). | State/Province | State or Province |
| POSTAL_CODE | ZIP/postal/CEP/etc. | ZIP Code/Postcode | Postal Code |
| CONTINENT | Continental region (e.g., Europe, Asia). | — | Continent |
| LATITUDE | Numeric latitude. | Latitude | Latitude |
| LONGITUDE | Numeric longitude. | Longitude | Longitude |
| ADDRESS | Full street address. Not a Tableau geographic role by default. | — | Address |
| PLACE_GENERIC | Power BI’s generic location bucket (ambiguous granularity). | — | Place |
| WEB_URL | Non-geographic. Used for clickable links. | — | Web URL |
| IMAGE_URL | Non-geographic. Used for image rendering. | — | Image URL |
| BARCODE | Non-geographic. Used for bar/QR codes, product IDs, etc. | — | Barcode |
Metric
A metric is a numeric value representing an aggregated database column expression, such as the sum of sales or average order quantity. Metrics always result from an aggregate calculation applied to one or more columns of fact database tables.
MSL supports the following types of metrics:
- Additive: Metrics whose values can be summarized for any dimension attribute of the model and then combined consistently.
- Non-additive: Metrics whose values cannot be summed across any dimensional groupings using basic addition, since this would typically produce an inaccurate result. A distinct count of an attribute value is a typical example of a non-additive metric.
- Semi-additive: Metrics whose values can be summarized for some dimensions in a model, but not all. Ratios such as averages are also considered semi-additive metrics. MSL implements them in the BI semantic layer.
MSL supports basic Fact or calculated Metric (see isFact property). A fact metric originates from a specific column and may have a default aggregation. A calculated metric defines dialect-specific expressions. They can combine, evaluate, or manipulate other metrics defined in the table. For example, you can do simple math operations to combine metrics or simple comparison operations to return a given metric value when certain conditions are met. Or you can use database-specific functions to perform time-series analysis.
TBD - FUTURE: The separation of calculated metrics from other types enables you to easily create boilerplate calculations that can be used across multiple metrics (e.g., macros).
Metrics are implemented in the database. Non-additive metrics are pushed to BI clients.
Details
Metrics are quantifiable measures of business performance calculated by aggregating facts or other columns from the same table (using functions like SUM, AVG, and COUNT) across multiple rows. They transform raw data into meaningful business indicators, often combining multiple calculations in complex formulas. Examples include Total Revenue or Profit Margin Percentage. Metrics represent the KPIs in reports and dashboards that drive business decision-making.
Facts are row-level attributes in the data model that represent specific business events or transactions. While facts can be defined using aggregates from more detailed levels of data (such as SUM(t.x) where t represents data at a more detailed level), they are always presented as attributes at the individual row level of the logical table. Facts capture 'how much' or 'how many' at the most granular level, such as individual sales amounts, quantities purchased, or costs. It's important to note that facts typically function as 'helper' concepts within the semantic table to help construct dimensions and metrics.
expression
- Type: Array
- Required: Yes
Native expressions define metrics, including their aggregation functions, windows, and ratios. Expressions should reference base facts from this Fact semantic tables and other metrics as reusable components.
TBD: Define a calculated measure by specifying an analytic view expression. The expression may reference other measures in the analytic view, but may not reference fact columns. Calculated measures do not have an aggregate clause because they're computed over the aggregated base measures.
Example:
metrics:
- name: total_revenue
expression:
- dialect: Snowflake
text: SUM(o_totalprice)
- dialect: Oracle
text: SUM(o_totalprice)
nonAdditive
- Type: Boolean
-
Required: No
-
Default: false
BI tools aggregate all measures by default. To avoid re-aggregation of non-additives (like exact distinct count), set this attribute to true. It requests the application to push the metric to the client side (e.g., BI or SQL). The value is false by default.
Example: TBD
metrics:
- name: distinct_customers
nonAdditive: true
expression:
- dialect: Snowflake
text: COUNT(DISTINCT c_custkey)
- dialect: Power BI
text: COUNT(DISTINCT c_custkey)
Expression
Expressions define database-specific calculations and formulas.
dialect
- Type: string
- Required: Yes
The name of the relational database or business analytics tool (dialect) for which this expression is written.
Supported values:
DATABRICKSSNOWFLAKEORACLEHANAPOWER BITABLEAU
Example:
text
- Type: string
- Required: Yes
The actual SQL expression or formula.
Example:
Unique Key
columns
- Type: array< string>.
- Required: Yes Ordered list of database columns in the key