DatasetManager¶

DatasetManager is a global manager to materialize datasets (tables and persistent views) right after a pipeline update.

DatasetManager

Materialization is a process of publishing tables and persistent views to session TableCatalog and SessionCatalog, for tables and persistent views, respectively.

Scala object

DatasetManager is an object in Scala which means it is a class that has exactly one instance (itself). A Scala object is created lazily when it is referenced for the first time.

Learn more in Tour of Scala.

Materialize Datasets¶

materializeDatasets(
  resolvedDataflowGraph: DataflowGraph,
  context: PipelineUpdateContext): DataflowGraph

materializeDatasets constructFullRefreshSet for the tables in the given DataflowGraph (and the PipelineUpdateContext).

materializeDatasets marks the tables (in the given DataflowGraph) as to be refreshed and fully-refreshed.

materializeDatasets...FIXME

For every table to be materialized, materializeDatasets materializeTable.

In the end, materializeDatasets materializeViews.

materializeDatasets is used when:

PipelineExecution is requested to initialize the dataflow graph

Materialize Table¶

materializeTable(
  resolvedDataflowGraph: DataflowGraph,
  table: Table,
  isFullRefresh: Boolean,
  context: PipelineUpdateContext): Table

materializeTable prints out the following INFO message to the logs:

Materializing metadata for table [identifier].

materializeTable uses the given PipelineUpdateContext to find the CatalogManager (in the SparkSession).

materializeTable finds the TableCatalog for the table.

materializeTable requests the TableCatalog to load the table if exists already.

For an existing table, materializeTable wipes data out (TRUNCATE TABLE) if it isisFullRefresh or the table is not streaming.

For an existing table, materializeTable requests the TableCatalog to alter the table if there are any changes in the schema or table properties.

Unless created already, materializeTable requests the TableCatalog to create the table.

In the end, materializeTable requests the TableCatalog to load the materialized table and returns the given Table back (with the normalized table storage path updated to the location property of the materialized table).

Materialize Views¶

materializeViews(
  virtualizedConnectedGraphWithTables: DataflowGraph,
  context: PipelineUpdateContext): Unit

materializeViews requests the given DataflowGraph for the persisted views to materialize (publish or refresh).

Publish (Materialize) Views

To publish a view, it is required that all the input sources must exist in the metastore. If a Persisted View target reads another Persisted View source, the source must be published first.

materializeViews...FIXME

For each view to be persisted (with no pending inputs), materializeViews materialize the view.

Materialize View¶

materializeView(
  view: View,
  flow: ResolvedFlow,
  spark: SparkSession): Unit

materializeView executes a CreateViewCommand logical command.

materializeView creates a CreateViewCommand logical command (as a PersistedView with allowExisting and replace flags enabled).

materializeView requests the given ResolvedFlow for the QueryContext to set the current catalog and current database, if defined, in the session CatalogManager.

In the end, materializeView executes the CreateViewCommand.

constructFullRefreshSet¶

constructFullRefreshSet(
  graphTables: Seq[Table],
  context: PipelineUpdateContext): (Seq[Table], Seq[TableIdentifier], Seq[TableIdentifier])

constructFullRefreshSet gives the following collections:

Tables to be refreshed (incl. a full refresh)
TableIdentifiers of the tables to be refreshed (excl. fully refreshed)
TableIdentifiers of the tables to be fully refreshed only

If there are tables to be fully refreshed yet not allowed for a full refresh, constructFullRefreshSet prints out the following INFO message to the logs:

Skipping full refresh on some tables because pipelines.reset.allowed was set to false.
Tables: [fullRefreshNotAllowed]

constructFullRefreshSet...FIXME

constructFullRefreshSet is used when:

PipelineExecution is requested to initialize the dataflow graph