DatasetManager¶
DatasetManager
is a global manager to materialize datasets (tables and persistent views) right after a pipeline update.
Materialization is a process of publishing tables and persistent views to session TableCatalog and SessionCatalog, for tables and persistent views, respectively.
Scala object
DatasetManager
is an object
in Scala which means it is a class that has exactly one instance (itself). A Scala object
is created lazily when it is referenced for the first time.
Learn more in Tour of Scala.
Materialize Datasets¶
materializeDatasets(
resolvedDataflowGraph: DataflowGraph,
context: PipelineUpdateContext): DataflowGraph
materializeDatasets
constructFullRefreshSet for the tables in the given DataflowGraph (and the PipelineUpdateContext).
materializeDatasets
marks the tables (in the given DataflowGraph) as to be refreshed and fully-refreshed.
materializeDatasets
...FIXME
For every table to be materialized, materializeDatasets
materializeTable.
In the end, materializeDatasets
materializeViews.
materializeDatasets
is used when:
PipelineExecution
is requested to initialize the dataflow graph
Materialize Table¶
materializeTable(
resolvedDataflowGraph: DataflowGraph,
table: Table,
isFullRefresh: Boolean,
context: PipelineUpdateContext): Table
materializeTable
prints out the following INFO message to the logs:
Materializing metadata for table [identifier].
materializeTable
uses the given PipelineUpdateContext to find the CatalogManager (in the SparkSession).
materializeTable
finds the TableCatalog for the table.
materializeTable
requests the TableCatalog
to load the table if exists already.
For an existing table, materializeTable
wipes data out (TRUNCATE TABLE
) if it isisFullRefresh
or the table is not streaming.
For an existing table, materializeTable
requests the TableCatalog
to alter the table if there are any changes in the schema or table properties.
Unless created already, materializeTable
requests the TableCatalog
to create the table.
In the end, materializeTable
requests the TableCatalog
to load the materialized table and returns the given Table back (with the normalized table storage path updated to the location
property of the materialized table).
Materialize Views¶
materializeViews(
virtualizedConnectedGraphWithTables: DataflowGraph,
context: PipelineUpdateContext): Unit
materializeViews
requests the given DataflowGraph for the persisted views to materialize (publish or refresh).
Publish (Materialize) Views
To publish a view, it is required that all the input sources must exist in the metastore. If a Persisted View target reads another Persisted View source, the source must be published first.
materializeViews
...FIXME
For each view to be persisted (with no pending inputs), materializeViews
materialize the view.
Materialize View¶
materializeView(
view: View,
flow: ResolvedFlow,
spark: SparkSession): Unit
materializeView
executes a CreateViewCommand logical command.
materializeView
creates a CreateViewCommand logical command (as a PersistedView
with allowExisting
and replace
flags enabled).
materializeView
requests the given ResolvedFlow for the QueryContext to set the current catalog and current database, if defined, in the session CatalogManager.
In the end, materializeView
executes the CreateViewCommand.
constructFullRefreshSet¶
constructFullRefreshSet(
graphTables: Seq[Table],
context: PipelineUpdateContext): (Seq[Table], Seq[TableIdentifier], Seq[TableIdentifier])
constructFullRefreshSet
gives the following collections:
- Tables to be refreshed (incl. a full refresh)
TableIdentifier
s of the tables to be refreshed (excl. fully refreshed)TableIdentifier
s of the tables to be fully refreshed only
If there are tables to be fully refreshed yet not allowed for a full refresh, constructFullRefreshSet
prints out the following INFO message to the logs:
Skipping full refresh on some tables because pipelines.reset.allowed was set to false.
Tables: [fullRefreshNotAllowed]
constructFullRefreshSet
...FIXME
constructFullRefreshSet
is used when:
PipelineExecution
is requested to initialize the dataflow graph