AggregateInPandasExec Physical Operator¶

AggregateInPandasExec is a unary physical operator (Spark SQL) that executes pandas UDAFs using ArrowPythonRunner (one per partition).

Creating Instance¶

AggregateInPandasExec takes the following to be created:

Grouping Expressions (Spark SQL) (Seq[NamedExpression])
pandas UDAFs (PythonUDFs with SQL_GROUPED_AGG_PANDAS_UDF)
Result Named Expressions (Spark SQL) (Seq[NamedExpression])
Child Physical Operator (Spark SQL)

AggregateInPandasExec is created when Aggregation execution planning strategy (Spark SQL) is executed for Aggregate logical operators (Spark SQL) with PythonUDF aggregate expressions only.

Executing Operator¶

SparkPlan

doExecute(): RDD[InternalRow]

doExecute is part of the SparkPlan (Spark SQL) abstraction.

doExecute uses ArrowPythonRunner (one per partition) to execute PythonUDFs.