Performance Tuning

Goal: Improve Spark’s performance where feasible.

  • measure performance bottlenecks using new metrics, including block-time analysis

  • a live demo of a new performance analysis tool

  • CPU — not I/O (network) — is often a critical bottleneck

  • community dogma = network and disk I/O are major bottlenecks

  • a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60 machine cluster with 2.5TB of data.

    • network is almost irrelevant for performance of these workloads

    • network optimization could only reduce job completion time by, at most, 2%

    • 10Gbps networking hardware is likely not necessary

  • serialized compressed data

  • reduceByKey is better

  • mind serialization time

    • impacts CPU - time to serialize and network - time to send the data over the wire

  • Tungsten - recent initiative from Databrics - aims at reducing CPU time

    • jobs become more bottlenecked by IO