DatasetManager¶
Scala object
DatasetManager
is an object
in Scala which means it is a class that has exactly one instance (itself). A Scala object
is created lazily when it is referenced for the first time.
Learn more in Tour of Scala.
Materialize Datasets¶
materializeDatasets(
resolvedDataflowGraph: DataflowGraph,
context: PipelineUpdateContext): DataflowGraph
materializeDatasets
constructFullRefreshSet for the tables in the given DataflowGraph (and the PipelineUpdateContext).
materializeDatasets
marks the tables (in the given DataflowGraph) as to be refreshed and fully-refreshed.
materializeDatasets
...FIXME
For every table to be materialized, materializeDatasets
materializeTable.
materializeDatasets
is used when:
PipelineExecution
is requested to initialize the dataflow graph
Materialize Table¶
materializeTable(
resolvedDataflowGraph: DataflowGraph,
table: Table,
isFullRefresh: Boolean,
context: PipelineUpdateContext): Table
materializeTable
prints out the following INFO message to the logs:
Materializing metadata for table [identifier].
materializeTable
uses the given PipelineUpdateContext to find the CatalogManager (in the SparkSession).
materializeTable
finds the TableCatalog for the table.
materializeTable
requests the TableCatalog
to load the table if exists already.
For an existing table, materializeTable
wipes data out (TRUNCATE TABLE
) if it isisFullRefresh
or the table is not streaming.
For an existing table, materializeTable
requests the TableCatalog
to alterTable if there are any changes in the schema or table properties.
Unless created already, materializeTable
requests the TableCatalog
to create the table.
In the end, materializeTable
requests the TableCatalog
to load the materialized table and returns the given Table back (with the normalized table storage path updated to the location
property of the materialized table).
constructFullRefreshSet¶
constructFullRefreshSet(
graphTables: Seq[Table],
context: PipelineUpdateContext): (Seq[Table], Seq[TableIdentifier], Seq[TableIdentifier])
constructFullRefreshSet
gives the following collections:
- Tables to be refreshed (incl. a full refresh)
TableIdentifier
s of the tables to be refreshed (excl. fully refreshed)TableIdentifier
s of the tables to be fully refreshed only
If there are tables to be fully refreshed yet not allowed for a full refresh, constructFullRefreshSet
prints out the following INFO message to the logs:
Skipping full refresh on some tables because pipelines.reset.allowed was set to false.
Tables: [fullRefreshNotAllowed]
constructFullRefreshSet
...FIXME
constructFullRefreshSet
is used when:
PipelineExecution
is requested to initialize the dataflow graph