Course Overview
TOPThis four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP).
Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components, such as Hive and Kafka. Participants will learn how to use Spark SQL to query structured data, how to use Spark Streaming to perform real-time processing on streaming data, and how to work with big data stored in a distributed file system.
After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.
Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components, such as Hive and Kafka. Participants will learn how to use Spark SQL to query structured data, how to use Spark Streaming to perform real-time processing on streaming data, and how to work with big data stored in a distributed file system.
After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.
Scheduled Classes
TOPWhat You'll Learn
TOPDistribute, store, and process data in a CDP cluster
- Write, configure, and deploy Apache Spark applications
- Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
- Query data using Spark SQL, DataFrames, and Hive tables
- Use Spark Streaming together with Kafka to process a data stream
Outline
TOP
Viewing outline for:
Shuffle
- Skew
- Order
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
- Exercise: Persisting DataFrames
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
- Exercise: Writing, Configuring, and Running a Spark Application
- Introduction to Structured Streaming
- Exercise: Processing Streaming Data
- What is Apache Kafka?
- Apache Kafka Overview
- Scaling Apache Kafka
- Apache Kafka Cluster Architecture
- Apache Kafka Command Line Tools
- Receiving Kafka Messages
- Sending Kafka Messages
- Exercise: Working with Kafka Streaming Messages
- Streaming Aggregation
- Joining Streaming DataFrames
- Exercise: Aggregating and Joining Streaming DataFrames
- Working with Datasets in Scala
- Exercise: Using Datasets in Scala
- Why Notebooks?
- Zeppelin Notes
- Demo: Apache Spark In 5 Minutes
- HDFS Overview
- HDFS Components and Interactions
- Additional HDFS Interactions
- Ozone Overview
- Exercise: Working with HDFS
- YARN Overview
- YARN Components and Interaction
- Working with YARN
- Exercise: Working with YARN
- The Disk Years: 2000 ->2010
- The Memory Years: 2010 ->2020
- The GPU Years: 2020 ->
- Introduction to DataFrames
- Exercise: Introducing DataFrames
- Exercise: Reading and Writing DataFrames
- Exercise: Working with Columns
- Exercise: Working with Complex Types
- Exercise: Combining and Splitting DataFrames
- Exercise: Summarizing and Grouping DataFrames
- Exercise: Working with UDFs
- Exercise: Working with Windows
- About Hive
- Hive and Spark Integration
- Exercise: Spark Integration with Hive
- ntroduction to Data Visualization with Zeppelin
- Zeppelin Analytics
- Zeppelin Collaboration
- Exercise: AdventureWorks
Prerequisites
TOPWho Should Attend
TOPThis course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.