Improve your coding skills from beginner to expert with the largest online Java e-learning platform

Spark for Java Developers

Big Data with Java Lambdas!
  • Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
  • All of the fundamentals you need to understand the main operations you can perform in Spark Core.
  • Deploy to a live EMR hardware cluster.
  • Understand the internals of Spark and how it optimizes your execution plans.
  • Get some great practice with Java 8 Lambdas - our most "functional" course to date!
  • There will be a follow on module covering SparkSQL later in the year.

Pre-requisites

You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.

Contents - The course is 6 hours long and would be equivalent to a 3 day training course.

 

Having problems? check the errata for this course.

1

Introduction


16 m 56 s
A brief overview of Spark and some of the jargon terms you'll be encountering.

2

Getting Started


21 m 35 s
Let's get Spark "installed" - it's just a maven dependency.

3

Reduces


14 m 19 s
Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.

Update - problems with NotSerializableExceptions?


6 m 28 s
If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.

4

Mapping


17 m 45 s
Mapping allows you transform the RDD from one form to another.

5

Tuples


18 m 12 s
Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.

6

PairRDDs


41 m 30 s
A PairRDD is a key/value representation of a dataset.

7

FlatMap and Filtering


14 m 46 s
FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.

8

Reading Files


13 m 26 s
We can read local files, or from S3 or HDFS big data file systems.

9

Keyword Ranking


41 m 47 s
A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.

10

Sorts and Coalesces


28 m 44 s
There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).

11

Deploying to EMR


40 m 42 s
We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.

12

Joins


27 m 27 s
One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.

13

Big Data Big Exercise


51 m 35 s
A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.

14

Performance


80 m 8 s
A deeper look into the internals of Spark.

Let the Course Come to You

About Us Pricing Frequently Asked Questions Contact Privacy T&Cs Affiliates and Resellers
Facebook Twitter YouTube LinkedIn