MetadataLogFileIndex¶

MetadataLogFileIndex is a PartitioningAwareFileIndex of metadata log files (generated by FileStreamSink).

Tip

Creating Instance¶

MetadataLogFileIndex takes the following to be created:

MetadataLogFileIndex is created when:

DataSource is requested to resolveRelation (for FileFormat streaming data sources)
FileTable is requested for a PartitioningAwareFileIndex (for FileFormat streaming data sources)
FileStreamSource is requested to allFilesUsingMetadataLogFileIndex

While being created, MetadataLogFileIndex prints out the following INFO message to the logs (with the metadataDirectory):

Reading streaming file log from [metadataDirectory]

metadataDirectory: Path

metadataDirectory is a Hadoop Path of Metadata Directory.

metadataDirectory is a _spark_metadata directory in the given path.

metadataDirectory is used to create a FileStreamSinkLog.

metadataLog: FileStreamSinkLog

metadataLog is used for metadata log files.

allFilesFromLog: Array[FileStatus]

allFilesFromLog requests the FileStreamSinkLog for all files that are in turn requested for their representation as a Hadoop FileStatus.

allFilesFromLog is used for leafFiles and leafDirToChildrenFiles.

leafFiles: mutable.LinkedHashMap[Path, FileStatus]

leafFiles...FIXME

leafFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).

leafDirToChildrenFiles: Map[Path, Array[FileStatus]]

leafDirToChildrenFiles...FIXME

leafDirToChildrenFiles is part of the PartitioningAwareFileIndex abstraction (Spark SQL).

Enable ALL logging level for org.apache.spark.sql.execution.streaming.MetadataLogFileIndex logger to see what happens inside.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.sql.execution.streaming.MetadataLogFileIndex=ALL

Refer to Logging.