EvalPythonExec Unary Physical Operators¶
EvalPythonExec
is an extension of the UnaryExecNode
(Spark SQL) abstraction for unary physical operators that evaluate PythonUDFs (when executed).
Contract¶
Evaluating PythonUDFs¶
evaluate(
funcs: Seq[ChainedPythonFunctions],
argOffsets: Array[Array[Int]],
iter: Iterator[InternalRow],
schema: StructType,
context: TaskContext): Iterator[InternalRow]
See:
Used when:
EvalPythonExec
physical operator is requested to doExecute
Result Attributes¶
resultAttrs: Seq[Attribute]
Result Attribute
s (Spark SQL)
See:
Used when:
EvalPythonExec
physical operator is requested for the output and producedAttributes
Python UDFs¶
udfs: Seq[PythonUDF]
See:
Used when:
EvalPythonExec
physical operator is requested to doExecute
Implementations¶
- ArrowEvalPythonExec
BatchEvalPythonExec
Executing Physical Operator¶
The gist of doExecute
is to evaluate Python UDFs (for every InternalRow
) with some pre- and post-processing.
doExecute
requests the child physical operator to execute
(to produce an input RDD[InternalRow]
).
Note
EvalPythonExec
s are UnaryExecNode
s (Spark SQL).
doExecute
uses RDD.mapPartitions
operator to execute a function over partitions of InternalRow
s.
For every partition, doExecute
creates a MutableProjection
for the inputs (and the child's output) and requests it to initialize
.
doExecute
evaluates Python UDFs (for every InternalRow
).
In the end, doExecute
creates an UnsafeProjection
for the output to "map over" the rows (from evaluating Python UDFs).