PartitionedFile is a part (block) of a file (similarly to a Parquet block or a HDFS split).
PartitionedFile represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.
PartitionedFile takes the following to be created:
- Partition Column Values
- Path of the file to read
- Beginning offset (in bytes)
- Length of this file part (number of bytes to read)
- Block Hosts
- Modification time
- File size
PartitionedFile is created when:
PartitionedFileUtilis requested for split files and getPartitionedFile
Partition Column Values¶
PartitionedFile is given an InternalRow with the partition column values to be appended to each row.
The partition column values are the values of the partition columns and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).
PartitionedFile is given a collection of nodes (host names) with data blocks.
toString is part of the
Object (Java) abstraction.
toString is the following text:
path: [filePath], range: [start]-[start+length], partition values: [partitionValues]
import org.apache.spark.sql.execution.datasources.PartitionedFile import org.apache.spark.sql.catalyst.InternalRow val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1")) println(partFile)
path: fakePath0, range: 0-10, partition values: [empty row]