Skip to content

HDFSMetadataLog

HDFSMetadataLog is an extension of the MetadataLog abstraction for metadata storage to store batch files in a metadata log directory on Hadoop DFS-compatible file systems (for fault-tolerance and reliability).

Extensions

Creating Instance

HDFSMetadataLog takes the following to be created:

While being created, HDFSMetadataLog makes sure that the path exists (and creates it if not).

Metadata Log Directory

HDFSMetadataLog uses the given path as the metadata log directory with metadata logs (one per batch).

The path is immediately converted to a Hadoop Path for file management.

CheckpointFileManager

HDFSMetadataLog creates a CheckpointFileManager (with the metadata log directory) when created.

Implicit Json4s Formats

HDFSMetadataLog uses Json4s with the Jackson binding for metadata serialization and deserialization (to and from JSON format).

Latest Committed Batch Id with Metadata (If Available)

getLatest(): Option[(Long, T)]

getLatest is a part of MetadataLog abstraction.

getLatest requests the internal <> for the files in <> that match <>.

getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse order.

getLatest gives the first batch id with the metadata which <>.

Note

It is possible that the batch id could be in the metadata storage, but not available for retrieval.

Retrieving Metadata of Streaming Batch

get(
  batchId: Long): Option[T]

get is part of the MetadataLog abstraction.


get...FIXME

Deserializing Metadata

deserialize(
  in: InputStream): T

deserialize deserializes a metadata (of type T) from a given InputStream.

Retrieving Metadata of Streaming Batches (if Available)

get(
  startId: Option[Long],
  endId: Option[Long]): Array[(Long, T)]

get is part of the MetadataLog abstraction.

get...FIXME

Persisting Metadata of Streaming Micro-Batch

add(
  batchId: Long,
  metadata: T): Boolean

add is part of the MetadataLog abstraction.

add return true when the metadata of the streaming batch was not available and persisted successfully. Otherwise, add returns false.

Internally, add <> (batchId) and returns false when found.

Otherwise, when not found, add <> for the given batchId and <>. add returns true if successful.

Writing Batch Metadata to File (Metadata Log)

writeBatchToFile(
  metadata: T,
  path: Path): Unit

writeBatchToFile requests the <> to createAtomic (for the specified path and the overwriteIfPossible flag disabled).

writeBatchToFile then <> (to the CancellableFSDataOutputStream output stream) and closes the stream.

In case of an exception, writeBatchToFile simply requests the CancellableFSDataOutputStream output stream to cancel (so that the output file is not generated) and re-throws the exception.

Serializing Metadata

serialize(
  metadata: T,
  out: OutputStream): Unit

serialize simply writes out the log data in a serialized format (using Json4s (with Jackson binding) library).

Purging Expired Metadata

purge(
  thresholdBatchId: Long): Unit

purge is part of the MetadataLog abstraction.

purge...FIXME

Batch Files

HDFSMetadataLog considers a file a batch file when the name is simply a long number.

HDFSMetadataLog uses a Hadoop PathFilter to list only batch files.

Verifying Batch Ids

verifyBatchIds(
  batchIds: Seq[Long],
  startId: Option[Long],
  endId: Option[Long]): Unit

verifyBatchIds...FIXME

verifyBatchIds is used when:

  • FileStreamSourceLog is requested to get
  • HDFSMetadataLog is requested to get

Path of Metadata File by Batch Id

batchIdToPath(
  batchId: Long): Path

batchIdToPath simply creates a Hadoop Path for the file by the given batchId under the metadata log directory.

Batch Id by Path of Metadata File

pathToBatchId(
  path: Path): Long

pathToBatchId...FIXME