FileTable¶

FileTable is an extension of the Table abstraction for file-based tables with support for read and write.

Contract¶

Fallback FileFormat¶

fallbackFileFormat: Class[_ <: FileFormat]

Fallback V1 FileFormat

Used when FallBackFileSourceV2 extended resolution rule is executed (to resolve an InsertIntoStatement with a DataSourceV2Relation with a FileTable)

Format Name¶

formatName: String

Name of the file table (format)

FileTable	Format Name
`AvroTable`	`AVRO`
`CSVTable`	`CSV`
`JsonTable`	`JSON`
`OrcTable`	`ORC`
ParquetTable	Parquet
`TextTable`	`Text`

Schema Inference¶

inferSchema(
    files: Seq[FileStatus]): Option[StructType]

Infers schema of the given files (as Hadoop FileStatuses)

See:

ParquetTable

Used when:

FileTable is requested for the data schema

supportsDataType¶

supportsDataType(
    dataType: DataType): Boolean = true

Controls whether the given DataType is supported by the file-backed table

Default: All DataTypes are supported

See:

ParquetTable

Used when:

FileTable is requested for the schema

Implementations¶

AvroTable
CSVTable
JsonTable
OrcTable
ParquetTable
TextTable

Creating Instance¶

FileTable takes the following to be created:

SparkSession
Options
Paths
Optional user-defined schema (Option[StructType])

FileTable is an abstract class and cannot be created directly. It is created indirectly for the concrete FileTables.

Table Capabilities¶

capabilities: java.util.Set[TableCapability]

capabilities is part of the Table abstraction.

capabilities are the following TableCapabilities:

Data Schema¶

dataSchema: StructType

dataSchema is the schema of the data of the file-backed table

Lazy Value

dataSchema is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

dataSchema is used when:

FileTable is requested for the schema

Partitioning¶

partitioning: Array[Transform]

partitioning is part of the Table abstraction.

partitioning...FIXME

Properties¶

properties: util.Map[String, String]

properties is part of the Table abstraction.

properties returns the options.

Table Schema¶

Signature

schema: StructType

schema is part of the Table abstraction.

schema checks the dataSchema for column name duplication.

schema makes sure that all field types in the dataSchema are supported.

schema requests the PartitioningAwareFileIndex for the partitionSchema to checks for column name duplication.

In the end, schema is the dataSchema followed by (the fields of) the partitionSchema.

PartitioningAwareFileIndex¶

fileIndex: PartitioningAwareFileIndex

Lazy Value

fileIndex is a Scala lazy value to guarantee that the code to initialize it is executed once only (when accessed for the first time) and the computed value never changes afterwards.

Learn more in the Scala Language Specification.

fileIndex creates one of the following PartitioningAwareFileIndexs:

MetadataLogFileIndex when reading from the results of a streaming query (and loading files from the metadata log instead of listing them using HDFS APIs)
InMemoryFileIndex

fileIndex is used when:

FileTables are requested for FileScanBuilders
Dataset is requested for the inputFiles
CacheManager is requested to lookupAndRefresh
FallBackFileSourceV2 is created
FileTable is requested to dataSchema, schema, partitioning