Spark Module 2 SparkSQL and DataFrames

Try for free!

Subscribe and stream all our courses
from just $30.00 per month
Start my free trial

Spark Module 2 SparkSQL and DataFrames

featuring SQL and DataFrames.

This course is around 5 hours long.

  • The second module in the Spark series moves on to explore the SparkSQL and DataFrames API. This allows us to concentrate on the Data Science and work at a much higher level of abstraction, working with SQL style syntax instead of worrying about RDDs.
  • This course is designed for all Java developers who want to explore Spark. No previous data science experience is assumed, so every concept is explained in detail.
Previous knowledge of RDDs in Spark is assumed - module 1 in the series covers this.

Contents

Having problems? check the errata

Introduction 6m 29s

What do DataFrames and SparkSQL offer compared to SparkCore (RDDs)?

Preview

Getting Started 20m 10s

We'll read in a DataSet (DataFrame) to get started

Preview

Working with DataSets 29m 3s

For our first real task with SparkSQL, we'll see how do filters

Preview

Full SQL Syntax 13m 45s

How to query Spark using the full SQL syntax

Watch

In Memory Data 15m 4s

In Module 1 we used parallelize to use in memory data - useful for unit tests. This is how to do it using DataFrames.

Watch

Grouping and Aggregating 12m 59s

Understanding the Group By clause in SparkSQL

Watch

Date Formatting 6m 30s

How to use the date_format function in SparkSQL

Watch

Multiple Groupings 13m 59s

More than one group by column?

Watch

Ordering 16m 36s

How to use the order by clause

Watch

DataFrames API 28m 4s

We've concentrated on the SQL syntax so far, but we can also use a Java API to do everything (and more) that SQL can.

Watch

Pivot Tables 21m 21s

In DataFrames, we can produce Pivot Tables as with spreadsheets and databases. But for Big Data!

Preview

General Aggregations 18m 49s

The agg method is the most flexible aggregating function, so we'll see how to use it.

Watch

Practical Session 8m 12s

A short exercise

Watch

User Defined Functions 23m 55s

How to use lambdas to add your own functions to the SQL syntax and DataFrame API

Watch

Performance 25m 56s

Using the SparkUI to analyse tasks. We ask the question: is the SQL syntax slower than the DataFrame API? Answers will follow in the next video...

Watch

HashAggregation 39m 21s

Spark has two strategies for grouping - HashAggregation is extremely efficient but can only be used in restricted circumstances. Find out how to make sure HashAggegration is used instead of the (usually) slower SortAggregate routine.

Watch

SparkSQL vs SparkRDD 6m 55s

Which performs "better"?

Watch

Update - Tuning the spark.sql.shuffle.partitions Property 8m 18s

An update - by default you will have a large number of partitions when shuffling (such as when grouping) - this can kill performance on small jobs. This is how to fix the problem.

Watch

Module Summary 2m 24s

Coming up later in 2018 is a module on SparkML.

Watch
Copyright ©2021 VirtualPairProgrammers.com