Spark for Java Developers

Big Data with Java Lambdas!

The course is 6 hours long and would be equivalent to a 3 day training course.

Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers
All of the fundamentals you need to understand the main operations you can perform in Spark Core.
Deploy to a live EMR hardware cluster.
Understand the internals of Spark and how it optimizes your execution plans.
Get some great practice with Java 8 Lambdas - our most "functional" course to date!
There will be a follow on module covering SparkSQL later in the year.

You'll need to be familar with Java. We'll be using Lamdbas throughout, but this course is a good introduction to them if you're not familar with them already.

Having problems? check the errata

Introduction 16m 56s A brief overview of Spark and some of the jargon terms you'll be encountering.	Preview
Getting Started 21m 35s Let's get Spark "installed" - it's just a maven dependency.	Preview
Reduces 14m 19s Reduces are fundamental transformations. Here we'll do a very basic reduce to establish the idea.	Watch
Update - problems with NotSerializableExceptions? 6m 28s If, in the next chapter on "Mapping" (or any future chapters) you experience a NotSerializableException, it is because your CPU architecture is sophisticated enough for Spark to treat each CPU as a node in a cluster! But this causes a crash with System.out.println. See this video for a simple workaround.	Watch
Mapping 17m 45s Mapping allows you transform the RDD from one form to another.	Watch
Tuples 18m 12s Commonly used in Scala, Tuples appear everywhere in the Spark Core API. We can use them in Java, but they are a bit awkward.	Watch
PairRDDs 41m 30s A PairRDD is a key/value representation of a dataset.	Watch
FlatMap and Filtering 14m 46s FlatMaps look complicated but it's a simple transformation. Also we'll see how to filter.	Watch
Reading Files 13m 26s We can read local files, or from S3 or HDFS big data file systems.	Watch
Keyword Ranking 41m 47s A major exercise, we'll automatically generate keywords for training courses based on their subtitle files.	Watch
Sorts and Coalesces 28m 44s There are some misunderstandings with sorts and we'll address that here. Also - what is Coalesce used for (and when it shouldn't be used).	Watch
Deploying to EMR 40m 42s We'll now deploy to a live cluster. Spark can deploy to Hadoop Yarn clusters or you can build a standalone cluster. Here we use Amazon EMR. Even if you're not using EMR, do watch this chapter as there is a lot to learn from running on real hardware.	Watch
Joins 27m 27s One last transformation type on the course - how to do Inner, Outer, Full and Cartesian Joins.	Watch
Big Data Big Exercise 51m 35s A chance for you to practice everything - a real "course ranking" process we run here at VirtualPairProgrammers.	Watch
Performance 80m 8s A deeper look into the internals of Spark.	Watch

Try for free!

Spark for Java Developers

Contents

Learn with Us

Useful links

Help and Support