PartitionedFile¶
PartitionedFile
is a part (block) of a file (similarly to a Parquet block or a HDFS split).
PartitionedFile
represents a chunk of a file that will be read, along with partition column values appended to each row, in a partition.
Creating Instance¶
PartitionedFile
takes the following to be created:
- Partition Column Values
- Path of the file to read
- Beginning offset (in bytes)
- Length of this file part (number of bytes to read)
- Block Hosts
- Modification time
- File size
PartitionedFile
is created when:
PartitionedFileUtil
is requested for split files and getPartitionedFile
Partition Column Values¶
partitionValues: InternalRow
PartitionedFile
is given an InternalRow with the partition column values to be appended to each row.
The partition column values are the values of the partition columns and therefore part of the directory structure not the partitioned files themselves (that together are the partitioned dataset).
Block Hosts¶
locations: Array[String]
PartitionedFile
is given a collection of nodes (host names) with data blocks.
Default: (empty)
String Representation¶
toString: String
toString
is part of the Object
(Java) abstraction.
toString
is the following text:
path: [filePath], range: [start]-[start+length], partition values: [partitionValues]
Demo¶
import org.apache.spark.sql.execution.datasources.PartitionedFile
import org.apache.spark.sql.catalyst.InternalRow
val partFile = PartitionedFile(InternalRow.empty, "fakePath0", 0, 10, Array("host0", "host1"))
println(partFile)
path: fakePath0, range: 0-10, partition values: [empty row]