MetadataLogFileIndex¶
MetadataLogFileIndex is a PartitioningAwareFileIndex of metadata log files (generated by FileStreamSink).
Tip
Learn more about PartitioningAwareFileIndex in The Internals of Spark SQL online book.
Creating Instance¶
MetadataLogFileIndex takes the following to be created:
-
SparkSession - Hadoop Path
- Parameters (
Map[String, String]) - User-Defined Schema (
Option[StructType])
MetadataLogFileIndex is created when:
DataSourceis requested to resolveRelation (forFileFormatstreaming data sources)FileTableis requested for aPartitioningAwareFileIndex(forFileFormatstreaming data sources)FileStreamSourceis requested to allFilesUsingMetadataLogFileIndex
While being created, MetadataLogFileIndex prints out the following INFO message to the logs (with the metadataDirectory):
Reading streaming file log from [metadataDirectory]
Metadata Directory¶
metadataDirectory: Path
metadataDirectory is a Hadoop Path of Metadata Directory.
metadataDirectory is a _spark_metadata directory in the given path.
metadataDirectory is used to create a FileStreamSinkLog.
FileStreamSinkLog¶
metadataLog: FileStreamSinkLog
metadataLog is a FileStreamSinkLog with the Metadata Directory.
metadataLog is used for metadata log files.
Metadata Log Files¶
allFilesFromLog: Array[FileStatus]
allFilesFromLog requests the FileStreamSinkLog for all files that are in turn requested for their representation as a Hadoop FileStatus.
allFilesFromLog is used for leafFiles and leafDirToChildrenFiles.
Leaf Files¶
leafFiles: mutable.LinkedHashMap[Path, FileStatus]
leafFiles...FIXME
leafFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).
leafDirToChildrenFiles¶
leafDirToChildrenFiles: Map[Path, Array[FileStatus]]
leafDirToChildrenFiles...FIXME
leafDirToChildrenFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).
Logging¶
Enable ALL logging level for org.apache.spark.sql.execution.streaming.MetadataLogFileIndex logger to see what happens inside.
Add the following line to conf/log4j.properties:
log4j.logger.org.apache.spark.sql.execution.streaming.MetadataLogFileIndex=ALL
Refer to Logging.