When requested for split files of a file (Apache Hadoop), PartitionedFileUtil uses isSplitable property of a FileFormat and creates one or more PartitionedFiles.

Only when splitable, a file will have as many PartitionedFiles as the number of parts of maxSplitBytes size.

Split Files

  sparkSession: SparkSession,
  file: FileStatus,
  filePath: Path,
  isSplitable: Boolean,
  maxSplitBytes: Long,
  partitionValues: InternalRow): Seq[PartitionedFile]

splitFiles branches off based on the given isSplitable flag.

If splitable, splitFiles uses the given maxSplitBytes to split the given file into PartitionedFiles for every part file.

Otherwise, splitFiles creates a single PartitionedFile for the given file (with the given filePath and partitionValues).

splitFiles is used when:


  file: FileStatus,
  filePath: Path,
  partitionValues: InternalRow): PartitionedFile

getPartitionedFile finds the BlockLocations of the given FileStatus (Apache Hadoop).

getPartitionedFile finds the BlockHosts with the BlockLocations.

In the end, getPartitionedFile creates a PartitionedFile with the following:

Argument Value
partitionValues The given partitionValues
filePath The URI of the given filePath
start 0
length The lenght of the file
locations Block hosts
modificationTime The modification time of the file
fileSize The size of the file

getPartitionedFile is used when: