DataFrame¶
DataFrame is a Python class with PandasMapOpsMixin and PandasConversionMixin mixins.
DataFrame is defined in pyspark.sql.dataframe module.
from pyspark.sql.dataframe import DataFrame
Creating Instance¶
DataFrame takes the following to be created:
- jdf
- SQLContext
groupBy¶
groupBy(self, *cols)
groupBy requests the _jdf to groupBy and creates a GroupedData with it.
observe¶
observe(
self,
observation: Union["Observation", str],
*exprs: Column,
) -> "DataFrame"
observe accepts an Observation or a name as the observation:
-
For an Observation,
observerequests it to _on (with thisDataFrameand theexprscolumns). -
For a name,
observecreates a newDataFrameafter requesting _jdf toobserve(with the name).
Demo¶
QueryExecutionListener
You should install QueryExecutionListener (Spark SQL) to intercept QueryExecution on a successful query execution (to access observedMetrics).
import pandas as pd
pandas_df = pd.DataFrame({
'name': ['jacek', 'agata', 'iweta', 'patryk', 'maksym'],
'age': [50, 49, 29, 26, 11]
})
df = spark.createDataFrame(pandas_df)
from pyspark.sql.functions import *
row_count_metric = count(lit(1)).alias("count")
observed_df = df.observe("observe_demo", row_count_metric)
observed_df.count()