InsertIntoHadoopFsRelationCommand Logical Command¶
InsertIntoHadoopFsRelationCommand takes the following to be created:
- Output Path (as a Hadoop Path)
- Static Partitions
- Partition Columns (
- BucketSpec if defined
- Options (
- CatalogTable if available
- FileIndex if defined
- Names of the output columns
InsertIntoHadoopFsRelationCommand is created when:
OptimizedCreateHiveTableAsSelectCommandlogical command is executed
DataSourceis requested to planForWritingFileFormat
- DataSourceAnalysis logical resolution rule is executed (for a
InsertIntoStatementover a LogicalRelation with a HadoopFsRelation)
type TablePartitionSpec = Map[String, String] staticPartitions: TablePartitionSpec
InsertIntoHadoopFsRelationCommand is given a specification of a table partition (as a mapping of column names to column values) when created.
There will be no partitions when created for the following:
DataSourcewhen requested to planForWritingFileFormat
Dynamic Partition Inserts and dynamicPartitionOverwrite Flag¶
InsertIntoHadoopFsRelationCommand defines a
dynamicPartitionOverwrite flag to indicate whether dynamic partition inserts is enabled or not.
dynamicPartitionOverwrite is based on the following (in the order of precedence):
- partitionOverwriteMode option (
DYNAMIC) in the parameters if available
dynamicPartitionOverwrite is used when:
- DataSourceAnalysis logical resolution rule is executed (for dynamic partition overwrite)
run( sparkSession: SparkSession, child: SparkPlan): Seq[Row]
run is part of the DataWritingCommand abstraction.
run uses the following to determine whether partitions are tracked by a catalog (
- spark.sql.hive.manageFilesourcePartitions configuration property
- catalogTable is defined with the partitionColumnNames and tracksPartitionsInCatalog flag enabled
When is the catalogTable defined?
When is tracksPartitionsInCatalog enabled?
With partitions tracked by a catalog,
FileCommitProtocol utility to instantiate a
FileCommitProtocol based on the spark.sql.sources.commitProtocolClass with the following:
For insertion (
Otherwise (for a non-insertion case),
run does nothing but prints out the following INFO message to the logs and finishes.
Skipping insertion into a relation that already exists.
run makes sure that there are no duplicates in the outputColumnNames.