EvalPythonExec Unary Physical Operators¶
EvalPythonExec is an extension of the UnaryExecNode (Spark SQL) abstraction for unary physical operators that evaluate PythonUDFs (when executed).
Contract¶
Evaluating PythonUDFs¶
evaluate(
funcs: Seq[ChainedPythonFunctions],
argOffsets: Array[Array[Int]],
iter: Iterator[InternalRow],
schema: StructType,
context: TaskContext): Iterator[InternalRow]
See:
Used when:
EvalPythonExecphysical operator is requested to doExecute
Result Attributes¶
resultAttrs: Seq[Attribute]
Result Attributes (Spark SQL)
See:
Used when:
EvalPythonExecphysical operator is requested for the output and producedAttributes
Python UDFs¶
udfs: Seq[PythonUDF]
See:
Used when:
EvalPythonExecphysical operator is requested to doExecute
Implementations¶
- ArrowEvalPythonExec
BatchEvalPythonExec
Executing Physical Operator¶
The gist of doExecute is to evaluate Python UDFs (for every InternalRow) with some pre- and post-processing.
doExecute requests the child physical operator to execute (to produce an input RDD[InternalRow]).
Note
EvalPythonExecs are UnaryExecNodes (Spark SQL).
doExecute uses RDD.mapPartitions operator to execute a function over partitions of InternalRows.
For every partition, doExecute creates a MutableProjection for the inputs (and the child's output) and requests it to initialize.
doExecute evaluates Python UDFs (for every InternalRow).
In the end, doExecute creates an UnsafeProjection for the output to "map over" the rows (from evaluating Python UDFs).