Murmur3Hash¶
Murmur3Hash
is a HashExpression to calculate the hash code (integer) of the given child expressions.
Creating Instance¶
Murmur3Hash
takes the following to be created:
- Child Expressions
- Seed (default:
42
)
Murmur3Hash
is created when:
HashPartitioning
is requested for the partitionId expression- hash standard and SQL functions are used
Demo¶
val data = Seq[Option[Int]](Some(0), None, None, None, Some(4), None)
.toDF
.withColumn("hash", hash('value))
scala> data.show
+-----+----------+
|value| hash|
+-----+----------+
| 0| 933211791|
| null| 42|
| null| 42|
| null| 42|
| 4|-397064898|
| null| 42|
+-----+----------+
scala> data.printSchema
root
|-- value: integer (nullable = true)
|-- hash: integer (nullable = false)
val defaultSeed = 42
val nonEmptyPartitions = data
.repartition(numPartitions = defaultSeed, partitionExprs = 'value)
.mapPartitions { it: Iterator[org.apache.spark.sql.Row] =>
import org.apache.spark.TaskContext
val ns = it.map(_.get(0)).mkString(",")
Iterator((TaskContext.getPartitionId, ns))
}
.as[(Long, String)]
.collect
.filterNot { case (pid, ns) => ns.isEmpty }
nonEmptyPartitions.foreach { case (pid, ns) => printf("%2s: %s%n", pid, ns) }
0: null,null,null,null
25: 0
32: 4