DataFrame¶

DataFrame is a Python class with PandasMapOpsMixin and PandasConversionMixin mixins.

DataFrame is defined in pyspark.sql.dataframe module.

from pyspark.sql.dataframe import DataFrame

Creating Instance¶

DataFrame takes the following to be created:

jdf
SQLContext

groupBy¶

groupBy(self, *cols)

groupBy requests the _jdf to groupBy and creates a GroupedData with it.

observe¶

observe(
  self,
  observation: Union["Observation", str],
  *exprs: Column,
) -> "DataFrame"

observe accepts an Observation or a name as the observation:

For an Observation, observe requests it to _on (with this DataFrame and the exprs columns).
For a name, observe creates a new DataFrame after requesting _jdf to observe (with the name).

Demo¶

QueryExecutionListener

You should install QueryExecutionListener (Spark SQL) to intercept QueryExecution on a successful query execution (to access observedMetrics).

import pandas as pd

pandas_df = pd.DataFrame({
  'name': ['jacek', 'agata', 'iweta', 'patryk', 'maksym'],
  'age': [50, 49, 29, 26, 11]
  })
df = spark.createDataFrame(pandas_df)

from pyspark.sql.functions import *
row_count_metric = count(lit(1)).alias("count")
observed_df = df.observe("observe_demo", row_count_metric)

observed_df.count()