Apache Spark Programming course – 3-days open training

22 – 24 October 2018



This ​three-day ​course ​will ​be ​delivered ​by ​Datapao, ​our ​Databricks ​Authorized ​Training ​Partner ​from ​Monday, ​October ​22, ​2018 ​to ​Wednesday, ​October ​24, ​2018 ​ ​from ​9:00am ​to ​5:00pm ​CET.

Our experienced instructor provides a hands-on introduction, which is easy to follow and incorporates the most useful parts necessary for day-to-day work. Whatever questions you may have about Spark, you’ll have your answers delivered.


This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Learning Objectives

After taking this class, students will be able to:

  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Who should apply?

Data engineers and Data Scientists interested in the most current technologies, analysts and BI professionals with basic coding skills and developers looking for a specialization in big data. Course material will be written in Python, but you don’t have to be an expert to be able to follow and understand it. The course is also great for IT Managers to get a better understanding of Apache Spark and the capabilities it can deliver.

What do I need?

  • Laptop
  • Browser


  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • The DataFrames/Datasets API
    • Spark SQL
    • Data Aggregation
    • Column Operations
    • The Functions API: date/time, string manipulation, aggregation
    • Joins & Broadcasting
    • User Defined Functions
    • Caching and caching storage levels
    • Use of the Spark UI to analyze behavior and performance
  • In-depth discussion of Spark internals
    • Cluster Architecture
    • The Catalyst query optimizer
    • The Tungsten in-memory data format
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • How various data sources are partitioned
    • How Spark handles data reads and writes
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
    • Kafka Integration
  • Overview of Spark’s MLlib Pipeline API for Machine Learning
    • Transformer/Estimator/Pipeline API
    • Perform feature preprocessing
    • Evaluate and apply ML models
  • Graph processing with GraphFrames
    • Transforming DataFrames into a graph
    • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

About Databricks

Databricks’ ​vision ​is ​to ​empower ​anyone ​to ​easily ​build ​and ​deploy ​advanced ​analytics ​solutions. ​The ​company ​was ​founded ​by ​the ​team ​who ​created ​Apache® ​Spark™, ​a ​powerful ​open ​source ​data ​processing ​engine ​built ​for ​sophisticated ​analytics, ​ease ​of ​use, ​and ​speed. ​Databricks ​is ​the ​largest ​contributor ​to ​the ​open ​source ​Apache ​Spark ​project ​providing ​10x ​more ​code ​than ​any ​other ​company. ​The ​company ​has ​also ​trained ​over ​20,000 ​users ​on ​Apache ​Spark, ​and ​has ​the ​largest ​number ​of ​customers ​deploying ​Spark ​to ​date. ​Databricks ​provides ​a ​just-in-time ​data ​platform, ​to ​simplify ​data ​integration, ​real-time ​experimentation, ​and ​robust ​deployment ​of ​production ​applications.

How much does it cost?

Course fee


The course fee includes the tuition fee for 3 days.

Maximum number of participants:


Have a question?

Write us an email: courses@datapao.com.


If you have questions about the course or would like to register, feel free to contact us here:

Timing (Day 1-3)

9:00-10:30 Morning session 1
10:30-10:45 Break (Coffee)
10:45-12:00 Morning session 2
12:00-13:00 Lunch
13:00-14:15 Afternoon session 1
14:15-14:30 Break (Coffee)
14:30-16:00 Afternoon session 2

Who is the instructor?


Zoltán Tóth is Principal Instructor at Databricks, the company founded by the original creators of Apache Spark. He delivered dozens of Spark courses for companies and also on the major conferences globally, like Strata and Spark Summit. He is also a contributor to Databricks’s Official Spark Courseware, with a special focus on Machine Learning topics. Prior to teaching Apache Spark, Zoltan worked with big data architectures. distributed systems as a Senior Engineer at RapidMiner and an Engineering Manager at Prezi.


T-Mobile building, Rennweg 97-99, 1030, Wien

What should I do if I have a question?

We are happy to answer any of your questions. Write us an email: courses@datapao.com
Our super enthusiastic team will answer shortly!