pyspark.pandas.DataFrame.pandas_on_spark.transform_batch¶
-
pandas_on_spark.
transform_batch
(func: Callable[[…], Union[pandas.core.frame.DataFrame, pandas.core.series.Series]], *args: Any, **kwargs: Any) → Union[DataFrame, Series]¶ Transform chunks with a function that takes pandas DataFrame and outputs pandas DataFrame. The pandas DataFrame given to the function is of a batch used internally. The length of each input and output should be the same.
See also Transform and apply a function.
Note
the func is unable to access to the whole input frame. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below.
>>> # This case does not return the length of whole frame but of the batch internally ... # used. ... def length(pdf) -> ps.DataFrame[int]: ... return pd.DataFrame([len(pdf)] * len(pdf)) ... >>> df = ps.DataFrame({'A': range(1000)}) >>> df.pandas_on_spark.transform_batch(length) c0 0 83 1 83 2 83 ...
Note
this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.
To avoid this, specify return type in
func
, for instance, as below:>>> def plus_one(x) -> ps.DataFrame[float, float]: ... return x + 1
If the return type is specified, the output column names become c0, c1, c2 … cn. These names are positionally mapped to the returned DataFrame in
func
.To specify the column names, you can assign them in a pandas friendly style as below:
>>> def plus_one(x) -> ps.DataFrame['a': float, 'b': float]: ... return x + 1
>>> pdf = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 4, 5]}) >>> def plus_one(x) -> ps.DataFrame[zip(pdf.dtypes, pdf.columns)]: ... return x + 1
When the given function returns DataFrame and has the return type annotated, the original index of the DataFrame will be lost and then a default index will be attached to the result. Please be careful about configuring the default index. See also Default Index Type.
- Parameters
- funcfunction
Function to transform each pandas frame.
- *args
Positional arguments to pass to func.
- **kwargs
Keyword arguments to pass to func.
- Returns
- DataFrame or Series
See also
DataFrame.pandas_on_spark.apply_batch
For row/columnwise operations.
Series.pandas_on_spark.transform_batch
transform the search as each pandas chunks.
Examples
>>> df = ps.DataFrame([(1, 2), (3, 4), (5, 6)], columns=['A', 'B']) >>> df A B 0 1 2 1 3 4 2 5 6
>>> def plus_one_func(pdf) -> ps.DataFrame[int, int]: ... return pdf + 1 >>> df.pandas_on_spark.transform_batch(plus_one_func) c0 c1 0 2 3 1 4 5 2 6 7
>>> def plus_one_func(pdf) -> ps.DataFrame['A': int, 'B': int]: ... return pdf + 1 >>> df.pandas_on_spark.transform_batch(plus_one_func) A B 0 2 3 1 4 5 2 6 7
>>> def plus_one_func(pdf) -> ps.Series[int]: ... return pdf.B + 1 >>> df.pandas_on_spark.transform_batch(plus_one_func) 0 3 1 5 2 7 dtype: int64
You can also omit the type hints so pandas-on-Spark infers the return schema as below:
>>> df.pandas_on_spark.transform_batch(lambda pdf: pdf + 1) A B 0 2 3 1 4 5 2 6 7
>>> (df * -1).pandas_on_spark.transform_batch(abs) A B 0 1 2 1 3 4 2 5 6
Note that you should not transform the index. The index information will not change.
>>> df.pandas_on_spark.transform_batch(lambda pdf: pdf.B + 1) 0 3 1 5 2 7 Name: B, dtype: int64
You can also specify extra arguments as below.
>>> df.pandas_on_spark.transform_batch(lambda pdf, a, b, c: pdf.B + a + b + c, 1, 2, c=3) 0 8 1 10 2 12 Name: B, dtype: int64