FileWrite¶

FileWrite is an extension of the Write abstraction for file writers.

Contract¶

Format Name¶

formatName: String

See:

ParquetWrite

Used when:

FileWrite is requested for the description and validateInputs

LogicalWriteInfo¶

info: LogicalWriteInfo

See:

ParquetWrite

Used when:

FileWrite is requested for the schema, the queryId and the options

paths¶

paths: Seq[String]

See:

ParquetWrite

Used when:

FileWrite is requested for a BatchWrite and to validateInputs

Preparing Write Job¶

prepareWrite(
  sqlConf: SQLConf,
  job: Job,
  options: Map[String, String],
  dataSchema: StructType): OutputWriterFactory

Prepares a write job and returns an OutputWriterFactory

See:

ParquetWrite

Used when:

FileWrite is requested for a BatchWrite (and creates a WriteJobDescription)

supportsDataType¶

supportsDataType: DataType => Boolean

See:

ParquetWrite

Used when:

FileWrite is requested to validateInputs

Implementations¶

AvroWrite
CSVWrite
JsonWrite
OrcWrite
ParquetWrite
TextWrite

Creating BatchWrite¶

Write

toBatch: BatchWrite

toBatch is part of the Write abstraction.

toBatch validateInputs.

toBatch creates a new Hadoop Job for just a single path out of the paths.

toBatch creates a FileCommitProtocol (Spark Core) with the following:

spark.sql.sources.commitProtocolClass
A random job ID
The first of the paths

toBatch creates a WriteJobDescription.

toBatch requests the FileCommitProtocol to setupJob (with the Hadoop Job instance).

In the end, toBatch creates a FileBatchWrite (for the Hadoop Job, the WriteJobDescription and the FileCommitProtocol).

Creating WriteJobDescription¶

createWriteJobDescription(
  sparkSession: SparkSession,
  hadoopConf: Configuration,
  job: Job,
  pathName: String,
  options: Map[String, String]): WriteJobDescription

createWriteJobDescription...FIXME