HDFSMetadataLog¶
HDFSMetadataLog
is an extension of the MetadataLog abstraction for metadata storage to store batch files in a metadata log directory on Hadoop DFS-compatible file systems (for fault-tolerance and reliability).
Extensions¶
Creating Instance¶
HDFSMetadataLog
takes the following to be created:
-
SparkSession
(Spark SQL) - Path of the metadata log directory
While being created, HDFSMetadataLog
makes sure that the path exists (and creates it if not).
Metadata Log Directory¶
HDFSMetadataLog
uses the given path as the metadata log directory with metadata logs (one per batch).
The path is immediately converted to a Hadoop Path for file management.
CheckpointFileManager¶
HDFSMetadataLog
creates a CheckpointFileManager (with the metadata log directory) when created.
Implicit Json4s Formats¶
HDFSMetadataLog
uses Json4s with the Jackson binding for metadata serialization and deserialization (to and from JSON format).
Latest Committed Batch Id with Metadata (If Available)¶
getLatest(): Option[(Long, T)]
getLatest
is a part of MetadataLog abstraction.
getLatest
requests the internal <
getLatest
takes the batch ids (the batch files correspond to) and sorts the ids in reverse order.
getLatest
gives the first batch id with the metadata which <
Note
It is possible that the batch id could be in the metadata storage, but not available for retrieval.
Retrieving Metadata of Streaming Batch¶
get(
batchId: Long): Option[T]
get
is part of the MetadataLog abstraction.
get
...FIXME
Deserializing Metadata¶
deserialize(
in: InputStream): T
deserialize
deserializes a metadata (of type T
) from a given InputStream
.
Retrieving Metadata of Streaming Batches (if Available)¶
get(
startId: Option[Long],
endId: Option[Long]): Array[(Long, T)]
get
is part of the MetadataLog abstraction.
get
...FIXME
Persisting Metadata of Streaming Micro-Batch¶
add(
batchId: Long,
metadata: T): Boolean
add
is part of the MetadataLog abstraction.
add
return true
when the metadata of the streaming batch was not available and persisted successfully. Otherwise, add
returns false
.
Internally, add
<batchId
) and returns false
when found.
Otherwise, when not found, add
<batchId
and <add
returns true
if successful.
Writing Batch Metadata to File (Metadata Log)¶
writeBatchToFile(
metadata: T,
path: Path): Unit
writeBatchToFile
requests the <path
and the overwriteIfPossible
flag disabled).
writeBatchToFile
then <CancellableFSDataOutputStream
output stream) and closes the stream.
In case of an exception, writeBatchToFile
simply requests the CancellableFSDataOutputStream
output stream to cancel
(so that the output file is not generated) and re-throws the exception.
Serializing Metadata¶
serialize(
metadata: T,
out: OutputStream): Unit
serialize
simply writes out the log data in a serialized format (using Json4s (with Jackson binding) library).
Purging Expired Metadata¶
purge(
thresholdBatchId: Long): Unit
purge
is part of the MetadataLog abstraction.
purge
...FIXME
Batch Files¶
HDFSMetadataLog
considers a file a batch file when the name is simply a long
number.
HDFSMetadataLog
uses a Hadoop PathFilter to list only batch files.
Verifying Batch Ids¶
verifyBatchIds(
batchIds: Seq[Long],
startId: Option[Long],
endId: Option[Long]): Unit
verifyBatchIds
...FIXME
verifyBatchIds
is used when:
Path of Metadata File by Batch Id¶
batchIdToPath(
batchId: Long): Path
batchIdToPath
simply creates a Hadoop Path for the file by the given batchId
under the metadata log directory.
Batch Id by Path of Metadata File¶
pathToBatchId(
path: Path): Long
pathToBatchId
...FIXME