Skip to content

DatasetManager

Scala object

DatasetManager is an object in Scala which means it is a class that has exactly one instance (itself). A Scala object is created lazily when it is referenced for the first time.

Learn more in Tour of Scala.

Materialize Datasets

materializeDatasets(
  resolvedDataflowGraph: DataflowGraph,
  context: PipelineUpdateContext): DataflowGraph

materializeDatasets constructFullRefreshSet for the tables in the given DataflowGraph (and the PipelineUpdateContext).

materializeDatasets marks the tables (in the given DataflowGraph) as to be refreshed and fully-refreshed.

materializeDatasets...FIXME

For every table to be materialized, materializeDatasets materializeTable.


materializeDatasets is used when:

Materialize Table

materializeTable(
  resolvedDataflowGraph: DataflowGraph,
  table: Table,
  isFullRefresh: Boolean,
  context: PipelineUpdateContext): Table

materializeTable prints out the following INFO message to the logs:

Materializing metadata for table [identifier].

materializeTable uses the given PipelineUpdateContext to find the CatalogManager (in the SparkSession).

materializeTable finds the TableCatalog for the table.

materializeTable requests the TableCatalog to load the table if exists already.

For an existing table, materializeTable wipes data out (TRUNCATE TABLE) if it isisFullRefresh or the table is not streaming.

For an existing table, materializeTable requests the TableCatalog to alterTable if there are any changes in the schema or table properties.

Unless created already, materializeTable requests the TableCatalog to create the table.

In the end, materializeTable requests the TableCatalog to load the materialized table and returns the given Table back (with the normalized table storage path updated to the location property of the materialized table).

constructFullRefreshSet

constructFullRefreshSet(
  graphTables: Seq[Table],
  context: PipelineUpdateContext): (Seq[Table], Seq[TableIdentifier], Seq[TableIdentifier])

constructFullRefreshSet gives the following collections:

  • Tables to be refreshed (incl. a full refresh)
  • TableIdentifiers of the tables to be refreshed (excl. fully refreshed)
  • TableIdentifiers of the tables to be fully refreshed only

If there are tables to be fully refreshed yet not allowed for a full refresh, constructFullRefreshSet prints out the following INFO message to the logs:

Skipping full refresh on some tables because pipelines.reset.allowed was set to false.
Tables: [fullRefreshNotAllowed]

constructFullRefreshSet...FIXME


constructFullRefreshSet is used when: