SingleDirectoryDataWriter¶

SingleDirectoryDataWriter is a FileFormatDataWriter for FileFormatWriter and FileWriterFactory.

Creating Instance¶

SingleDirectoryDataWriter takes the following to be created:

WriteJobDescription
Hadoop TaskAttemptContext
FileCommitProtocol (Spark Core)
Custom SQLMetrics by name (Map[String, SQLMetric])

While being created, SingleDirectoryDataWriter creates a new OutputWriter.

SingleDirectoryDataWriter is created when:

FileFormatWriter is requested to write data out (in a single Spark task) (of a non-partitioned non-bucketed write job)
FileWriterFactory is requested for a DataWriter (of a non-partitioned write job)

recordsInFile Counter¶

SingleDirectoryDataWriter uses recordsInFile counter to track how many records have been written out.

recordsInFile counter is 0 when SingleDirectoryDataWriter creates a new OutputWriter (and increments until maxRecordsPerFile threshold if defined).

Writing Record Out¶

FileFormatDataWriter

write(
  record: InternalRow): Unit

write is part of the FileFormatDataWriter abstraction.

write creates a new OutputWriter for a positive maxRecordsPerFile (of the WriteJobDescription) and the recordsInFile counter above the threshold.

write requests the current OutputWriter to write the record and informs the WriteTaskStatsTrackers that there was a new row.

write increments the recordsInFile.

Creating New OutputWriter¶

newOutputWriter(): Unit

newOutputWriter sets the recordsInFile counter to 0.

newOutputWriter releaseResources.

newOutputWriter uses the given WriteJobDescription to access the OutputWriterFactory for a file extension (ext).

newOutputWriter requests the given FileCommitProtocol for a path of a new data file (with -c[fileCounter][nnn][ext] suffix).

newOutputWriter uses the given WriteJobDescription to access the OutputWriterFactory for a new OutputWriter.

newOutputWriter informs the WriteTaskStatsTrackers that a new file is about to be written.

newOutputWriter is used when:

SingleDirectoryDataWriter is created and requested to write (every maxRecordsPerFile threshold)