Apache Spark is a cluster-computing framework originally born at the UC Berkeley’s AMPLab in 2009 and later published as an open-source project hosted by the Apache Software Foundation. It can be as much as 100 times faster than MapReduce and offers a much simpler programming model, allowing great productivity gains for developers.
Apache Spark allows the creation and deployment of Big Data applications in either batch or real time with the ability to integrate machine learning and graph analysis.
This course is a comprehensive introduction to Spark’s world and will allow you to master its key features.
1 - General
Hadoop reminders, discovery of Apache Spark, its ecosystem and its internal architecture.
2 - Apache Spark and its APIs
General principles and use of RDDs (Resilient Distributed Datasets), Apache Spark SQL, DataFrames and Datasets, Catalyst and Tungsten.
3 - Moving to real time with Apache Spark Streaming and Structured Streaming
4 - Apache Spark in production
Unit testing, Tuning, Monitoring, Debugging, Apache Spark on Mesos.
5 - Pipeline with Apache Spark Streaming, Apache Kafka and ElasticSearch
- Mastering the different concepts of Apache Spark
- Using the Apache Spark SQL, Streaming and Structured Streaming extensions
- Performing tuning and real-time processing with Kafka
- Manipulating several Spark APIs in Scala
- Deploying on Mesos
- Prior knowledge in Java or Scala