Spark and Hadoop

Hadoop Storage Formats

The currently-supported Hadoop storage formats typically used with HDFS are:

FIXME What are the differences between the formats and how are they used in Spark.

Introduction to Hadoop

This page is the place to keep information more general about Hadoop and not related to Spark on YARN or files Using Input and Output (I/O) (HDFS). I don’t really know what it could be, though. Perhaps nothing at all. Just saying.

From Apache Hadoop's web site:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

  • Originally, Hadoop is an umbrella term for the following (core) modules:

    • HDFS (Hadoop Distributed File System) is a distributed file system designed to run on commodity hardware. It is a data storage with files split across a cluster.

    • MapReduce - the compute engine for batch processing

    • YARN (Yet Another Resource Negotiator) - the resource manager

  • Currently, it’s more about the ecosystem of solutions that all use Hadoop infrastructure for their work.

People reported to do wonders with the software with Yahoo! saying:

Yahoo has progressively invested in building and scaling Apache Hadoop clusters with a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters.

Beside numbers Yahoo! reported that:

Deep learning can be defined as first-class steps in Apache Oozie workflows with Hadoop for data processing and Spark pipelines for machine learning.

You can find some preliminary information about Spark pipelines for machine learning in the chapter ML Pipelines.

HDFS provides fast analytics – scanning over large amounts of data very quickly, but it was not built to handle updates. If data changed, it would need to be appended in bulk after a certain volume or time interval, preventing real-time visibility into this data.

  • HBase complements HDFS’ capabilities by providing fast and random reads and writes and supporting updating data, i.e. serving small queries extremely quickly, and allowing data to be updated in place.

When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop InputFormat used to read this file. For instance, if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (but the split between partitions would be done on line split, not the exact block split), unless you have a compressed text file. In case of compressed file you would get a single partition for a single file (as compressed text files are not splittable).

If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)

With HDFS you can store any data (regardless of format and size). It can easily handle unstructured data like video or other binary files as well as semi- or fully-structured data like CSV files or databases.

There is the concept of data lake that is a huge data repository to support analytics.

HDFS partition files into so called splits and distributes them across multiple nodes in a cluster to achieve fail-over and resiliency.

MapReduce happens in three phases: Map, Shuffle, and Reduce.