MetadataLogFileIndex¶
MetadataLogFileIndex
is a PartitioningAwareFileIndex
of metadata log files (generated by FileStreamSink).
Tip
Learn more about PartitioningAwareFileIndex in The Internals of Spark SQL online book.
Creating Instance¶
MetadataLogFileIndex
takes the following to be created:
-
SparkSession
- Hadoop Path
- Parameters (
Map[String, String]
) - User-Defined Schema (
Option[StructType]
)
MetadataLogFileIndex
is created when:
DataSource
is requested to resolveRelation (forFileFormat
streaming data sources)FileTable
is requested for aPartitioningAwareFileIndex
(forFileFormat
streaming data sources)FileStreamSource
is requested to allFilesUsingMetadataLogFileIndex
While being created, MetadataLogFileIndex
prints out the following INFO message to the logs (with the metadataDirectory):
Reading streaming file log from [metadataDirectory]
Metadata Directory¶
metadataDirectory: Path
metadataDirectory
is a Hadoop Path of Metadata Directory.
metadataDirectory
is a _spark_metadata directory in the given path.
metadataDirectory
is used to create a FileStreamSinkLog.
FileStreamSinkLog¶
metadataLog: FileStreamSinkLog
metadataLog
is a FileStreamSinkLog with the Metadata Directory.
metadataLog
is used for metadata log files.
Metadata Log Files¶
allFilesFromLog: Array[FileStatus]
allFilesFromLog
requests the FileStreamSinkLog for all files that are in turn requested for their representation as a Hadoop FileStatus.
allFilesFromLog
is used for leafFiles and leafDirToChildrenFiles.
Leaf Files¶
leafFiles: mutable.LinkedHashMap[Path, FileStatus]
leafFiles
...FIXME
leafFiles
is part of the PartitioningAwareFileIndex
abstraction (Spark SQL).
leafDirToChildrenFiles¶
leafDirToChildrenFiles: Map[Path, Array[FileStatus]]
leafDirToChildrenFiles
...FIXME
leafDirToChildrenFiles
is part of the PartitioningAwareFileIndex
abstraction (Spark SQL).
Logging¶
Enable ALL
logging level for org.apache.spark.sql.execution.streaming.MetadataLogFileIndex
logger to see what happens inside.
Add the following line to conf/log4j.properties
:
log4j.logger.org.apache.spark.sql.execution.streaming.MetadataLogFileIndex=ALL
Refer to Logging.