PartitionedFileUtil¶
When requested for split files of a file (Apache Hadoop), PartitionedFileUtil uses isSplitable property of a FileFormat and creates one or more PartitionedFiles.
Only when splitable, a file will have as many PartitionedFiles as the number of parts of maxSplitBytes size.
Split Files¶
splitFiles(
sparkSession: SparkSession,
file: FileStatus,
filePath: Path,
isSplitable: Boolean,
maxSplitBytes: Long,
partitionValues: InternalRow): Seq[PartitionedFile]
splitFiles branches off based on the given isSplitable flag.
If splitable, splitFiles uses the given maxSplitBytes to split the given file into PartitionedFiles for every part file.
Otherwise, splitFiles creates a single PartitionedFile for the given file (with the given filePath and partitionValues).
splitFiles is used when:
FileSourceScanExecis requested to createReadRDDFileScanis requested for the partitions
getPartitionedFile¶
getPartitionedFile(
file: FileStatus,
filePath: Path,
partitionValues: InternalRow): PartitionedFile
getPartitionedFile finds the BlockLocations of the given FileStatus (Apache Hadoop).
getPartitionedFile finds the BlockHosts with the BlockLocations.
In the end, getPartitionedFile creates a PartitionedFile with the following:
| Argument | Value |
|---|---|
| partitionValues | The given partitionValues |
| filePath | The URI of the given filePath |
| start | 0 |
| length | The lenght of the file |
| locations | Block hosts |
| modificationTime | The modification time of the file |
| fileSize | The size of the file |
getPartitionedFile is used when:
FileSourceScanExecis requested to create a FileScanRDD with Bucketing SupportPartitionedFileUtilis requested to splitFiles