Apache Spark Programming course – 3-days open training

26 – 28 November 2018

Berlin

This ​three-day ​course ​will ​be ​delivered ​by ​Datapao, ​our ​Databricks ​Authorized ​Training ​Partner ​from ​Monday, ​November 26, ​2018 ​to ​Wednesday, ​November 28, ​2018 ​ ​from ​9:00am ​to ​5:00pm ​CET.

Our experienced instructor provides a hands-on introduction, which is easy to follow and incorporates the most useful parts necessary for day-to-day work. Whatever questions you may have about Spark, you’ll have your answers delivered.

Overview

This 3-day course is equally applicable to data engineers, data scientist, analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Learning Objectives

After taking this class, students will be able to:

  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Understand the basics of Spark’s internals

Who should apply?

Data engineers and Data Scientists interested in the most current technologies, analysts and BI professionals with basic coding skills and developers looking for a specialization in big data. Course material will be written in Python, but you don’t have to be an expert to be able to follow and understand it. The course is also great for IT Managers to get a better understanding of Apache Spark and the capabilities it can deliver.

What do I need?

  • Laptop
  • Browser

Topics

  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • The DataFrames/Datasets API
    • Spark SQL
    • Data Aggregation
    • Column Operations
    • The Functions API: date/time, string manipulation, aggregation
    • Joins & Broadcasting
    • User Defined Functions
    • Caching and caching storage levels
    • Use of the Spark UI to analyze behavior and performance
  • In-depth discussion of Spark internals
    • Cluster Architecture
    • The Catalyst query optimizer
    • The Tungsten in-memory data format
    • How Spark schedules and executes jobs and tasks
    • Shuffling, shuffle files, and performance
    • How various data sources are partitioned
    • How Spark handles data reads and writes
  • Spark Structured Streaming
    • Sources and sinks
    • Structured Streaming APIs
    • Windowing & Aggregation
    • Checkpointing & Watermarking
    • Reliability and Fault Tolerance
    • Kafka Integration
  • Overview of Spark’s MLlib Pipeline API for Machine Learning
    • Transformer/Estimator/Pipeline API
    • Perform feature preprocessing
    • Evaluate and apply ML models
  • Graph processing with GraphFrames
    • Transforming DataFrames into a graph
    • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

About Databricks

Databricks’ ​vision ​is ​to ​empower ​anyone ​to ​easily ​build ​and ​deploy ​advanced ​analytics ​solutions. ​The ​company ​was ​founded ​by ​the ​team ​who ​created ​Apache® ​Spark™, ​a ​powerful ​open ​source ​data ​processing ​engine ​built ​for ​sophisticated ​analytics, ​ease ​of ​use, ​and ​speed. ​Databricks ​is ​the ​largest ​contributor ​to ​the ​open ​source ​Apache ​Spark ​project ​providing ​10x ​more ​code ​than ​any ​other ​company. ​The ​company ​has ​also ​trained ​over ​20,000 ​users ​on ​Apache ​Spark, ​and ​has ​the ​largest ​number ​of ​customers ​deploying ​Spark ​to ​date. ​Databricks ​provides ​a ​just-in-time ​data ​platform, ​to ​simplify ​data ​integration, ​real-time ​experimentation, ​and ​robust ​deployment ​of ​production ​applications.

How much does it cost?

Course fee

2,500

The course fee includes the tuition fee for 3 days.

Maximum number of participants:

20

Have a question?

Write us an email: courses@datapao.com.

Register

If you have questions about the course or would like to register, feel free to contact us here:



Timing (Day 1-3)

9:00-10:30 Morning session 1
10:30-10:45 Break (Coffee)
10:45-12:00 Morning session 2
12:00-13:00 Lunch
13:00-14:15 Afternoon session 1
14:15-14:30 Break (Coffee)
14:30-16:00 Afternoon session 2

Who is the instructor?

miki

Miklós Tóth is a senior instructor at Datapao. Besides Datapao, Miklós is working as a Data Scientist and working on different Machine Learning projects, running JAVA Spring Framework classes. Prior to teaching Apache Spark, Miklós worked for AUDI Academy as an IT trainer.

Where?

Berlin – downtown

What should I do if I have a question?

We are happy to answer any of your questions. Write us an email: courses@datapao.com
Our super enthusiastic team will answer shortly!