Spark for Java Developers

Try for free!

Subscribe and stream all our courses
from just $30.00 per month
Start my free trial
Or download the entire series of Apache Spark to watch offline $79.00
Spark for Java Developers, Spark Module 2 SparkSQL and DataFrames, Spark Module 3 Machine Learning SparkML, Spark Module 4 Streaming and Structured Streaming

Spark for Java Developers

Big Data with Java Lambdas!

The course is 6 hours long and would be equivalent to a 3 day training course.

  • Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
  • All of the fundamentals you need to understand the main operations you can perform in Spark Core.
  • Deploy to a live EMR hardware cluster.
  • Understand the internals of Spark and how it optimizes your execution plans.
  • Get some great practice with Java 8 Lambdas - our most "functional" course to date!
  • There will be a follow on module covering SparkSQL later in the year.
You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.

Contents

Having problems? check the errata

Introduction 16m 56s

A brief overview of Spark and some of the jargon terms you'll be encountering.

Preview

Getting Started 21m 35s

Let's get Spark "installed" - it's just a maven dependency.

Preview

Reduces 14m 19s

Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.

Watch

Update - problems with NotSerializableExceptions? 6m 28s

If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.

Watch

Mapping 17m 45s

Mapping allows you transform the RDD from one form to another.

Watch

Tuples 18m 12s

Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.

Watch

PairRDDs 41m 30s

A PairRDD is a key/value representation of a dataset.

Watch

FlatMap and Filtering 14m 46s

FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.

Watch

Reading Files 13m 26s

We can read local files, or from S3 or HDFS big data file systems.

Watch

Keyword Ranking 41m 47s

A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.

Watch

Sorts and Coalesces 28m 44s

There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).

Watch

Deploying to EMR 40m 42s

We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.

Watch

Joins 27m 27s

One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.

Watch

Big Data Big Exercise 51m 35s

A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.

Watch

Performance 80m 8s

A deeper look into the internals of Spark.

Watch
Copyright ©2021 VirtualPairProgrammers.com