[CAUTION]¶

FIXME Offer a complete working solution in Scala
FIXME Load ORC files into dataframe ** val df = hiveContext.read.format("orc").load(to/path) ====

Solution was to use Hive in ORC format with partitions:

A table in Hive stored as an ORC file (using partitioning)
Using SQLContext.sql to insert data into the table
Using SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small files into larger files optimized for your HDFS block size ** Since the CONCATENATE command operates on files in place it is transparent to any downstream processing
Hive solution is just to concatenate the files ** it does not alter or change records. ** it's possible to update data in Hive using ORC format ** With transactional tables in Hive together with insert, update, delete, it does the "concatenate " for you automatically in regularly intervals. Currently this works only with tables in orc.format (stored as orc) ** Alternatively, use Hbase with Phoenix as the SQL layer on top ** Hive was originally not designed for updates, because it was.purely warehouse focused, the most recent one can do updates, deletes etc in a transactional way.

Criteria:

spark-streaming/spark-streaming.md[Spark Streaming] jobs are receiving a lot of small events (avg 10kb)
Events are stored to HDFS, e.g. for Pig jobs
There are a lot of small files in HDFS (several millions)