PartitionedFileUtil¶

When requested for split files of a file (Apache Hadoop), PartitionedFileUtil uses isSplitable property of a FileFormat and creates one or more PartitionedFiles.

Only when splitable, a file will have as many PartitionedFiles as the number of parts of maxSplitBytes size.

Split Files¶

splitFiles(
  sparkSession: SparkSession,
  file: FileStatus,
  filePath: Path,
  isSplitable: Boolean,
  maxSplitBytes: Long,
  partitionValues: InternalRow): Seq[PartitionedFile]

splitFiles branches off based on the given isSplitable flag.

If splitable, splitFiles uses the given maxSplitBytes to split the given file into PartitionedFiles for every part file.

Otherwise, splitFiles creates a single PartitionedFile for the given file (with the given filePath and partitionValues).

splitFiles is used when:

FileSourceScanExec is requested to createReadRDD
FileScan is requested for the partitions

getPartitionedFile¶

getPartitionedFile(
  file: FileStatus,
  filePath: Path,
  partitionValues: InternalRow): PartitionedFile

getPartitionedFile finds the BlockLocations of the given FileStatus (Apache Hadoop).

getPartitionedFile finds the BlockHosts with the BlockLocations.

In the end, getPartitionedFile creates a PartitionedFile with the following:

Argument	Value
partitionValues	The given `partitionValues`
filePath	The URI of the given `filePath`
start	0
length	The lenght of the `file`
locations	Block hosts
modificationTime	The modification time of the `file`
fileSize	The size of the `file`

getPartitionedFile is used when:

FileSourceScanExec is requested to create a FileScanRDD with Bucketing Support
PartitionedFileUtil is requested to splitFiles