A context-based entity applies when:
You need to create context-based entities when a set of data may be delivered within the boundaries of a parent context. It may happen that all the Part Suppliers for just one single Part number are delivered. Since there is only focus on the Part that you specify in the delivery, all data of the other Parts remains unchanged. In this case, in order to allow implicit delete detection and set based validations, the existing set of Part Suppliers must be filtered based on the context of its parent context. If you change or remove an item, it is implicitly taken into account. Another term used for this is context-based enddating
, since it enddates records that are not taken into the delivery.
{info} When you create a dependent relationships, the context filter is automatically set to
True
. In independent relationships the context filter is automatically set toFalse
. You can change this, but be very careful if you want to put a context filter for the independent relation on.
Suppose we deliver Part as a delta delivery; we only deliver Part Nbr 1
, 2
, 200
. We delivery Part Supplier as a full load. This is what will happen:
1
, 2
, 200
we will do a full compare of the Part Supplier records between what is delivered and what is in CFPL.
{danger} If you forget to set the context filter to
True
and you make a partial delivery for only Part numbers1
,2
,200
, than all other Part numbers will be deleted.
In our Design Time process the Logical DataModel drives the set-up of most physical datamodels. Each physical model is a representation of the data that resides in one of the technical layers of our Runtime environment. Understanding of this linkage is a key element of our solution.
This layer is a separate database that represents the physical data delivery of the related tenant. The main purpose of this layer is to create a common starting point for further processing. It is expected this layer holds data in columns and rows.
This layer contains data from the last delivery and must be truncated at the start of each new delivery. The status of the delivery is managed through i-refactory API's. We therefore assume an external process orchestrates deliveries by means of our API's.
You can use this Layer to:
{info} This all to prepare the step from Technical Staging towards the Logical Validation Layer (LVL). Preparing in this layer avoid problems in LVL. In order to retain full lineage, all datafixes and enrichments must be captured in computed columns or derived entities.
The Logical Validation Layer (LVL) is the second layer in the chain and is initiated after the data has been fully transmitted to the Technical Staging In layer. The LVL layer is a separate database that services the following functions :
A delivery gets “accepted with finding” when there are constraint violations present in the data, without exceeding any constraint threshold. A constraint threshold determines the number of occasions in which a constraint can be violated before the delivery receives the “rejected” flag. Each constraint has a separate configurable threshold that describes the maximal allowed occurrences before a delivery is rejected, and an action that describes what to do when a violation occurs. In case the delivery is “accepted” or “accepted with findings” the data is prepared for handover to the next layer.
The actions that can be selected variates over the constraint types. For instance, a primary key constraint will not have the option to select ‘Empty column’ or ‘No action’, since it is for a database impossible to have an invalid or ‘NULL’ value as primary key. The possible actions are skip row, empty column or no action.
The data validation process determines the quality of the data by evaluating the constraints that are captured in the model. The constraints are grouped into a constraint type and are separated over two stages. At the end of each stage an intermediate evaluation is performed to determine whether any threshold exceeded.
The first stage covers the validation types that can be performed without requiring data from outside the delivered dataset itself. The validations that are performed during this stage are:
The output of the first phase of validations is initially stored in a temporary table. This temporary table is used to register the information of the witnessed violations in the ACME database and to determine whether the delivery should be rejected for exceeding a threshold. Only when the none of the threshold is exceeded the data is approved and transferred to the corresponding ‘UCLVL_BC’. These tables have an additional column ‘ACM_SKIP_ROW’ that indicates for which records the ‘skip row’ action is performed. The ‘no action’ and ‘empty column’ actions are column specific and the output of this action is already processed to its final state.
The second stage contains the constraints that requires additional data from outside the provided dataset. Missing data that is required to perform the validation is retrieved from the main dataset and appended to the validation dataset. For example, a delivery with data that references to a table that is only set once during the initially delivery can only be validated when this data is appended to the validation dataset. After all the missing datapoints are appended to the validation dataset the constraints are validated. The constraints that are validated during this phase are:
The result set of the second phase of validations is initially stored in a temporary table and used for violation registration and classification of whether the delivery should be rejected for exceeding a threshold. Only when none of the threshold is exceeded the data is transferred to the corresponding ‘UCLVL_SC’. These tables contain the final product of the two validation phases.
In case a delivery is accepted with or without findings the final step of updating the LVL dataset and transferring the data to the next layer (CFPL) is initiated. This process is performed by comparing the dataset stored in the UCLVL_SC tables with the current from the LVL, to construct a delta script that only contains the changes. This script is afterwards used to update the LVL itself and in placed in the queue for the next layer (CFPL). This allows the LVL and the CFPL to work concurrent and independent from each other. The LVL clears the data from the UCLVL_BC and UCLVL_SC and the whole process repeats for the next delivery.
{warning} Deliveries with key violations, data-type conversion violations or with validation issues above treshold will not be further processed and require mediation.
{editorial} KJD Link to missing file: ../data-logistics/metadata.md
For deliveries with active validation, validation results are stored in our metadata
.
This layer is a separate database that services the following functions:
The CFPL database is automatically populated by the I-Refactory load processor, that will generate the SQL code from the Logical Validation Layer to the Central Facts Layer from the metadata.
This is the access layer that makes the data as persisted in the CFL layer accessible for data consumption in various perspectives:
This layer is by default virtual and this layer also supports consuming data at the same time this layer is refreshed. This layer also manages CRUD updates to the CFL layer by means of generated CRUD-triggers within the current perspective database views.
This architecture layer groups together all tasks related to the file based exchange :
The i-refactory orchestrates and executes a critical part of your data logistics. We designed our process taking the characteristics of physical deliveries into account. For this reason we identify the following main building blocks to manage your data logistics:
{list}
- A
Delivery Agreement
is an agreement for data-exchange between a Tenant and the i-refactory dataplatform targeted to/from one Layer.- A
Tenant
is a party for a dataset or datarecord as agreed in the Delivery Agreement.- A
Delivery
is a container that holds the specification of what is actually exchanged to/from the i-refactory platform.- A
Layer
holds the logical model relevant for the the Delivery (Agreement). In practice the layer specifies if you deliver datasets (the layer is identified as Logical Validation Layer), you opt for OLTP integration (the layer is identified as Generic Data Access Model) or you opt to consume data in a physical file (the layer is identified as File Export).
The i-refactory orchestrates the data logistics for three different scenario's :
The delivery of datasets is targeted towards i-refactory's Logical Validation Layer. The delivery of datasets is a combination of:
inclusion of all or some entities related to the Logical Data Model, also known als a Complete
and Partial
Delivery. A Complete Delivery is a delivery that includes all entities of the corresponding Logical Data Model. A Partial Delivery is a delivery that includes some entities of the corresponding Logical Data Model.
indication per entity if this entity represent a Full
or Delta
of the provided snapshot.
A Full provided snapshot of the corresponding entity includes all known records at the snapshot datetime. All inserts, updates and logical deletes can be determined automatically by comparing the available current facts with the content of the Delivery.
A Delta provisioned snapshot of the corresponding entity includes the delta as of previous successfully delivered facts and thus must include and indicate deleted records while new and changed records can be identified automatically.
With the options as described above a Tenant can:
Context Based Delivery
Context Based Delivery is a special type of delivery that allows Full
datasets within a given context to be executed by i-refactory. The main advantage is that the implicit delete detection (or physical delete process) works within context.
Example: assume we have a Logical Datamodel that holds the following entities: Sales Organization related to N Sales Orders related to N Sales Order lines. There are 5 regional Sales Organizations that individually provide their sales order details to this Logical Model. If we treat Sales Organization as Context, you can provide for one Sales Organization (thus context) all their Sales Order and related Sales Order Lines. The processing of Sales Order and Sales Order Lines are treated as Full
within their context and thus implicit delete detection is determined automatically for this scenario.
The architecture layer File Export
is used to support file based data consumption.
For this architecture layer you can define Delivery Agreements and deliveries to load data into parquet files on your local filesystem or on your Azure Storage Account.
The event that triggers the outbound delivery is based upon an external process that uses our delivery-API.
Outbound delivers can be configured to balance system resources in similar manner as inbound deliveries, also monitoring outbound deliveries is done on the same overview as where you monitor inbound deliveries.
You need to specify the following parameters in our API to trigger an outbound delivery:
The delivery of OLTP integration is targeted towards i-refactory's Generic Data Access Layer. This means OLTP integration can Create, Read, Update and Delete facts through the Generic Data Access Layer. From a delivery perspective this means you can activate deliveries to a Data Access Model, that results in Create, Read, Update and Delete actions orchestrated by the i-refactory solution.
The default pattern supported is a record based process. For applications that deliver sets of records to be processed, this requires additional configuration. Specific Business Rules or Controls other than primary key and relationship constraints can be included as part of the execution process as so-called Data Application Programming Interface (DAPI) controls. CRUD transactions process current timeperspective on data in a relational structure.
As a result of the temporal consistency we use a time stamp, part of the Delivery, for the (technical) timelines throughout relevant layers. The snapshotdatetime of the source system or the datetime chosen by user determine the timestamp used. Per entity a 'high water mark' is defined: the latest (most recent) time stamp processed for that entity. The logistical process won't accept delivery set to a time stamp older than the high water mark.
The logistic process fully manages the Traveling Now
to ensure a GDAL view always delivers a time consistent view of the registered facts. This also enables processing facts in CFPL in parallel to time consistent consumption of data through GDAL - this avoids typical scenario's to separate loading in night batches from daily data consumption assuming the database is capable to technically orchestrate the workloads.
The i-refactory solution allows user influence of the date used as timestamp by defining the timestamp as part of the Delivery. This provide an option to migrate historical timelines to your i-refactory managed dataplatform by providing data per timeslice in chronological order. Through this method you can recreate your historical data perspectives with the existing datawarehouse as a source system.
When performing a delivery you can opt for logical or physical deletes.
The default scenario is logical delete, which means that:
acm_record_ind
from the deleted records is changed to 'R'acm_record_ind
is addedacm_exists_ind
from 1 to 0With i-refactory you have the possibility to serve a mixture of OLTP and OLAP scenarios in a fully automated manner. In order to support these scenario's modelling concepts you need to balance availability versus quality. The Unified Anchor Modelling is an enhanced set of modelling patterns, evolved from Anchor Modelling and Data Vault. The modelling is closely aligned with the concepts of Fact Based Modelling.
We created the Unified Anchor modelling to specify the modelling construct of the Central Fact Layer. In order to be able to maintain full history, we split the business keys in Anchors
and their context as Context
. Context are functionally dependent (“belong to”) their Anchor. This context can consist of attributes describing the Anchor and/or relationships. This split is often referred to as ‘6th Normal Form’.
We keep track of existence for every Anchor: for each Business Key
stored in the Anchor we track if that key was delivered in the last delivery. We assume that a business key
is denoted as the primary key
in a Logical Model.
All Anchors are ‘append only’: only new rows are added, a record in an Anchor will never be removed. By default we maintain history for all Context. Context can be logically or physically deleted.
An Independent Entity
is an entity that implements an Anchor for a Business Key that ‘stands alone’ e.g. that does not contain a reference to another Entity.
An Independent Entity contains Business Key fields, that show up as alternate key
(AK), and the primary key
(PK) is it's surrogate key
(ID).
{info} With the Logical Model as your main reference, an independent entity can be identified by the following rule:
Examples in TPCH: Customer, Order and Clerk.
Meta data stored for an Independent Entity:
Audit ID
- Process that created the recordEntity ID
- Reference to the Data Definition Register to the entityExistence
- Indicates whether or not the key value is found (exists) in the most recent load of the entityLatest Start Datetime
- the datetime at which the related context was last modifiedLast Created Datetime
= Last Existence Datetime - the datetime at which the anchor existed (existence = true) most recently.A Dependent Entity
is an entity that implements an Anchor for a Business Key that ‘depends’ in its existence on another Entity.
A Dependent Entity contains Business Key fields of which at least one is a foreign key
(FK). The Business Keys will show up as alternate key
(AK), and the primary key
(PK) is it's surrogate key
(ID).
{info} With the Logical Model as your main reference, a dependent entity can be identified by the following rule:
Examples in TPCH: Orderline Item and Part Supplier.
Meta data stored for a Dependent Entity equals to the metadata of an Independent Entity.
Context
provides Context to an Anchor Entity and contains:
Examples in TPCH: Part Supplier S Properties and Order Line Item S Properties.
{info} A Context Entity will be populated of the Attributes and relationships that are not part of the primary key.
The attributes in an entity in the Logical Model can be split over multiple Context Entities. Reasons to split one logical model entity into multiple Context Entities:
Metadata stored for a Context Entity:
Start datetime
of the transaction timeline - the transaction datetime uniquely identifies each change in the context. For batch processing this date is provided as part of the metadata of a delivery. For transactional processing the time is set when the data is committed.End datetime
of the transaction timeline - this field will be end dated if the record is logically deleted or updated.Audit ID
- Process that created the Entity.Entity ID
- Reference to the Data Definition Register to the entity.Record Indicator
- indicates whether a record is a New record (N), result of an update (A) or logical delete (R).The example shown above shows that first record is a new Record (‘N’) to Anchor with ID = 1. The functional Value is ‘aap’. For the same Anchor (with ID = 1) there are two more updates performed. The functional value is changed to 'noot' followed by 'mies'. Both updates are marked as appended update (‘A’). Last record is still open because the End datetime is dated in the future. Records are logically closed by physically updating the end date with the datetime of the next record.
We only INSERT records, we deviate from this rule to:
Immutable Context
is a Context Entity for which all attributes and relationships are ‘insert once’: once the context is provided, it will never change. By doing so no history will be maintained.
As soon as context is provided for an Anchor, its context won’t be updated even if (changed) content is delivered. No errors or warnings are shown. Only when an update is provided via a CRUD-transaction processing call, an error will be returned.
{warning} Be aware: Updates by sources are ignored when processing Immutable Context.
In a Logical Model it is not visible whether context is immutable or not. This information should be specified by a modeller in the Central Facts Layer.
An example of Immutable Context could be meter readings or strict bookkeeping systems that allow no updates on financial transactions and thus enforce a reverse transaction instead.
A Bi-Temporal context is a context extended with two additional columns to store a valid timeline: Valid Start Datetime and Valid End Datetime. The Valid Start Datetime is part of the primary key, hence the primary key of a Bi-Temporal context is a composite key consisting of three columns: id, ACM Start Datetime, Valid Start Datetime.
A Bi-Temporal context cannot be used to track the existence of a key, for this you need to add a standard context (possibly a "dummy" context without any context columns).
Multi-Active Context
is a Context Entity that contains a Multi Active (Versioning Key) extending the business key of the associated Anchor.
{info} Each Multi Active Business Key has its own context which is functional dependent from the business key in the Anchor extended by the Multi Active Business Key.
This implies that one business key can have multiple context records at the same time.
Example: a customer has multiple types of phone numbers. “Home”, “Work” and “Mobile”. We add a Multi Active Key “Phone Nbr Type”.
Let's assume your logical model specified the relation between customer and customer phone numbers as shown above. How can this be reflected in the CFL ?
Multi-Active Context
Dependent Entity
for Customer Phone with it's own contextWhat alternative should you use?
{info} Rule of thumb: if we know for sure that there is and will never be context to the multi active key we can create a Multi Active Entity especially when performance and/or volume are important considerations. Choosing for an Anchor is more flexible but less efficient.
Alternate Key Context
is a Context Entity with an additional uniqueness constraint over it's alternate key columns.
{info} An Alternate Key Context only holds Alternate Keys that are treated as unique. If you combine Alternate Keys in one Context, the combination is treated as unique.
A Foreign Key Link
relates Anchor Context to another Anchor. It's a formalization of a foreign key relation as defined in your Logical DataModel.
The main benefits of Foreign Key Links are:
A Subtype
forms a subset of the business keys of its Supertype
. A subtype is an (in)dependent entity on which existence is maintained. As an additional step, any insert to the subtype creates a record in the supertype.
Subtypes are designated in your Logical Datamodel by the Subtype-symbol. The subtype construct can be Mutually exclusive
and/or Complete
. Check on these constraints should be part of the Logical Validation Model checks. No constraints are implemented in the CFL.
It is not necessary that every subtype is also physically implemented as a CFL subtype. Attributes of subtypes can be stored at their supertype. Formally a subtype describes a relation only.
A Generalization
pattern defines a Supertype of more than one Anchor. This pattern generalize Anchors with different Business Keys.
Generalization Patterns are relevant for use cases when :
This pattern is mainly used in a separate Concept Integration Model
. A Key Root
integrates entities of different Models that share the same Business Key. The KeyRoot is a technical construct that manages the integration within the Central Facts Layer. Since a Key Root only manages the integration based on keys, a Key Root cannot have any context.