Apache Spark

Apache Spark is a cluster-computing framework originally born at the UC Berkeley’s AMPLab in 2009 and later published as an open-source project hosted by the Apache Software Foundation. It can be as much as 100 times faster than MapReduce and offers a much simpler programming model, allowing great productivity gains for developers.

Apache Spark allows the creation and deployment of Big Data applications in either batch or real time with the ability to integrate machine learning and graph analysis.

This course is a comprehensive introduction to Spark’s world and will allow you to master its key features.

Training plan

1 - General

Hadoop reminders, discovery of Apache Spark, its ecosystem and its internal architecture.

2 - Apache Spark and its APIs

General principles and use of RDDs (Resilient Distributed Datasets), Apache Spark SQL, DataFrames and Datasets, Catalyst and Tungsten.

3 - Moving to real time with Apache Spark Streaming and Structured Streaming

4 - Apache Spark in production

Unit testing, Tuning, Monitoring, Debugging, Apache Spark on Mesos.

5 - Pipeline with Apache Spark Streaming, Apache Kafka and ElasticSearch

Training goals
  • Mastering the different concepts of Apache Spark
  • Using the Apache Spark SQL, Streaming and Structured Streaming extensions
  • Performing tuning and real-time processing with Kafka
  • Manipulating several Spark APIs in Scala
  • Deploying on Mesos
3 days
Needed skills
  • Prior knowledge in Java or Scala
Can we use some cookies?

This site uses cookies. An explanation of their purpose can be found below. To comply with new EU regulation, please confirm your consent to their use by clicking "Accept". After consenting, you will not see this message again.

Know more about tracers