HDFSMetadataLog¶
HDFSMetadataLog is an extension of the MetadataLog abstraction for metadata storage to store batch files in a metadata log directory on Hadoop DFS-compatible file systems (for fault-tolerance and reliability).
Extensions¶
Creating Instance¶
HDFSMetadataLog takes the following to be created:
-  SparkSession(Spark SQL)
- Path of the metadata log directory
While being created, HDFSMetadataLog makes sure that the path exists (and creates it if not).
Metadata Log Directory¶
HDFSMetadataLog uses the given path as the metadata log directory with metadata logs (one per batch).
The path is immediately converted to a Hadoop Path for file management.
CheckpointFileManager¶
HDFSMetadataLog creates a CheckpointFileManager (with the metadata log directory) when created.
Implicit Json4s Formats¶
HDFSMetadataLog uses Json4s with the Jackson binding for metadata serialization and deserialization (to and from JSON format).
Latest Committed Batch Id with Metadata (If Available)¶
getLatest(): Option[(Long, T)]
getLatest is a part of MetadataLog abstraction.
getLatest requests the internal <
getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse order.
getLatest gives the first batch id with the metadata which <
Note
It is possible that the batch id could be in the metadata storage, but not available for retrieval.
Retrieving Metadata of Streaming Batch¶
get(
  batchId: Long): Option[T]
get is part of the MetadataLog abstraction.
get...FIXME
Deserializing Metadata¶
deserialize(
  in: InputStream): T
deserialize deserializes a metadata (of type T) from a given InputStream.
Retrieving Metadata of Streaming Batches (if Available)¶
get(
  startId: Option[Long],
  endId: Option[Long]): Array[(Long, T)]
get is part of the MetadataLog abstraction.
get...FIXME
Persisting Metadata of Streaming Micro-Batch¶
add(
  batchId: Long,
  metadata: T): Boolean
add is part of the MetadataLog abstraction.
add return true when the metadata of the streaming batch was not available and persisted successfully. Otherwise, add returns false.
Internally, add <batchId) and returns false when found.
Otherwise, when not found, add <batchId and <add returns true if successful.
Writing Batch Metadata to File (Metadata Log)¶
writeBatchToFile(
  metadata: T,
  path: Path): Unit
writeBatchToFile requests the <path and the overwriteIfPossible flag disabled).
writeBatchToFile then <CancellableFSDataOutputStream output stream) and closes the stream.
In case of an exception, writeBatchToFile simply requests the CancellableFSDataOutputStream output stream to cancel (so that the output file is not generated) and re-throws the exception.
Serializing Metadata¶
serialize(
  metadata: T,
  out: OutputStream): Unit
serialize simply writes out the log data in a serialized format (using Json4s (with Jackson binding) library).
Purging Expired Metadata¶
purge(
  thresholdBatchId: Long): Unit
purge is part of the MetadataLog abstraction.
purge...FIXME
Batch Files¶
HDFSMetadataLog considers a file a batch file when the name is simply a long number.
HDFSMetadataLog uses a Hadoop PathFilter to list only batch files.
Verifying Batch Ids¶
verifyBatchIds(
  batchIds: Seq[Long],
  startId: Option[Long],
  endId: Option[Long]): Unit
verifyBatchIds...FIXME
verifyBatchIds is used when:
Path of Metadata File by Batch Id¶
batchIdToPath(
  batchId: Long): Path
batchIdToPath simply creates a Hadoop Path for the file by the given batchId under the metadata log directory.
Batch Id by Path of Metadata File¶
pathToBatchId(
  path: Path): Long
pathToBatchId...FIXME