Aggregations in Apache Spark

Spark has tremendous value when it comes to aggregation using its flexible DataFrame API. It enables you to execute performant jobs in a distributed fashion. A DataFrame is nothing but a Dataset organized into named columns which makes the concept similar to the data frames in python (pandas) and R. However, DataFrame API has a richer and sophisticated optimization behind the hood.

When you are dealing with an increasing overflow of zillions of data points Spark aggregation functions come to the rescue. First of all, they will enable you to deliver the demands of your analyst and data science teams. And second, its optimization engine give you the peace of mind you need for a more productive and happy life.

In this post, we are going to cover 3 of them: groupBy with pivot and window, rollup, cube. The diagrams below briefly describe the intuition behind these functions. Please also get The Notebook to see how they work in action.

groupBy: Just like in any other framework we can create group our categories.

groupBy with pivot: We can add new dimensions to our grouped categories using the pivot.

rollup: It is an extension of groupBy that calculates subtotals and a grand total across multi-dimensions. You will get subtotal by the group and the grand total at one go:

cube: On the other hand, when you use cube you will get the above-mentioned subtotals across all the combinations of specified columns. Plus, you will still have your grand total.


It easily follows that if you use only one variable using rollup and cube you can really get the same result. For detailed explanations please refer to the book: Spark Definitive Guide

You can download The Notebook and use it on Databricks environment. If you still don’t have a Databricks Community Addition you can sign up for free and test it.