What types of data processing tasks will I be able to perform after completing the course?

You will be able to perform a variety of tasks, including data cleaning, transformation, aggregation, and analysis of large datasets using PySpark’s RDDs and DataFrames.

What technologies and frameworks are covered in the course?

You’ll learn PySpark in detail, along with its integration with Hadoop, RDDs, DataFrames, and SQL-based data processing.

Is prior knowledge in data engineering required?

No, prior experience is not required; the course introduces PySpark basics before moving to advanced use cases.

Does the course cover workflow automation and ETL?

Yes, you’ll learn how to design ETL workflows and automate big data processing with PySpark.

Can I preview a course before enrolling?

Yes, you can preview the first video and view the syllabus before you enroll. You must purchase the course to access content not included in the preview.

When will I have access to the lectures and assignments?

If you decide to enroll in the course before the session start date, you will have access to all of the lecture videos and readings for the course. You’ll be able to submit assignments once the session starts.

What will I get when I enroll?

Once you enroll and your session begins, you will have access to all videos and other resources, including reading items and the course discussion forum. You’ll be able to view and submit practice assessments, and complete required graded assignments to earn a grade and a Course Certificate.

When will I receive my Course Certificate?

If you complete the course successfully, your electronic Course Certificate will be added to your Accomplishments page - from there, you can print your Course Certificate or add it to your LinkedIn profile.

Why can’t I audit this course?

This course is currently available only to learners who have paid or received financial aid, when available.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

PySpark in Action: Hands-On Data Processing

通过 Coursera Plus 提高技能，仅需 239 美元/年（原价 399 美元）。立即节省

PySpark in Action: Hands-On Data Processing

本课程是 PySpark for Data Science 专项课程的一部分

位教师：Edureka

包含在中

了解更多

5个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

2 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

5个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

2 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将学到什么

Explore the fundamental concepts of Big Data and the components of the Hadoop ecosystem.
Explain the architecture and key principles of Apache Spark and its role in big data processing.
Utilize RDD transformations and actions to effectively process large-scale datasets with PySpark.
Execute advanced DataFrame operations, including data manipulation and aggregation techniques.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

17 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 PySpark for Data Science 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有5个模块

PySpark in Action: Hands-on Data Processing is a practical course that equips you to work confidently with large-scale data using PySpark and distributed data processing frameworks. You’ll discover the fundamentals of Big Data, Apache Hadoop, and Apache Spark, then build on this knowledge through real-world exercises where you’ll process and analyze massive datasets.

During the course, you’ll gain hands-on experience with: - Foundational concepts of Big Data and components of the Hadoop ecosystem such as HDFS, enabling you to understand modern data storage and processing. - Spark architecture and critical design principles for scalable, fault-tolerant data workflows. - RDD transformations and actions, helping you handle large-scale datasets using PySpark’s distributed processing engine. - Advanced DataFrame techniques: manage complex data types, perform aggregations, and solve business data challenges efficiently. - PySpark SQL for applying advanced queries, optimizing processing workflows, and enabling rapid, reliable analysis at scale. This course is ideal for those new to data engineering or distributed computing who want a hands-on introduction to PySpark for large-scale data tasks. If you have basic Python skills but no prior experience in data engineering, you’ll find accessible explanations and step-by-step projects throughout. By course completion, you’ll be prepared to use PySpark in real-world projects, build and monitor data pipelines, automate processing, clean and integrate diverse datasets, and confidently tackle core challenges in distributed data analytics.

This module introduces you to the fundamental concepts of Big Data and Hadoop. You will explore the Hadoop ecosystem, its components, and the Hadoop Distributed File System (HDFS), setting the foundation for understanding big data processing and storage solutions.

涵盖的内容

15个视频5篇阅读材料4个作业1个讨论话题

15个视频总计74分钟

Course Introduction4分钟
What is Big Data?4分钟
Applications of Big Data5分钟
What is Hadoop?5分钟
Hadoop Ecosystem2分钟
Working of HDFS5分钟
Introduction to Apache Spark7分钟
Master-slave Architecture7分钟
Spark Architecture2分钟
Data Processing with Apache Spark6分钟
Directed Acyclic Graph (DAG)5分钟
Introduction to Spark Ecosystem5分钟
What is PySpark?5分钟
Key Features of PySpark7分钟
Basics of Python6分钟

5篇阅读材料总计50分钟

Welcome to PySpark in Action: Hands-On Data Processing10分钟
What is Big Data? – A Beginner’s Guide to the World of Big Data10分钟
Spark SQL10分钟
Features of PySpark10分钟
Module Summary: Big Data Processing with PySpark10分钟

4个作业总计38分钟

Knowledge Check: Big Data Processing with PySpark20分钟
Practice Quiz: Big Data Essentials6分钟
Practice Quiz: Apache Spark Fundamentals6分钟
Practice Quiz: PySpark 6分钟

1个讨论话题总计10分钟

Introduce Yourself10分钟

Dive into the core of PySpark by learning about Resilient Distributed Datasets (RDDs). This module covers the fundamentals of RDDs, how they work, and their key transformations and actions, enabling efficient distributed data processing in PySpark.

涵盖的内容

25个视频4篇阅读材料4个作业3个讨论话题

25个视频总计121分钟

Introduction to RDDs6分钟
Working of RDDs5分钟
Creating RDDs7分钟
Essentials of RDD6分钟
Key Concepts of RDD6分钟
Understanding Lazy Evaluations5分钟
Advantages of Lazy Evaluation3分钟
Introduction to Transformations5分钟
Narrow and Wide Transformations6分钟
Transformations: Map6分钟
Transformations: Filter, Reduce and groupBykey4分钟
Transformations: Distinct, Sample and Join 5分钟
Transformations: Union and Subtract3分钟
Introduction to Repartition6分钟
Significance of Repartition1分钟
Introduction to Actions5分钟
Actions: collect, reduce and reduceBykey5分钟
Implementing Actions: collect, reduce and reduceBykey3分钟
Actions: count, foreach and aggregate6分钟
Implementing Actions: count, foreach and aggregate3分钟
Actions: Coalesce, histogram and sortby4分钟
Implementing Actions: Coalesce, histogram and sortby3分钟
Working with RDD Transformations6分钟
Applying Distinct, sample and join Transformations3分钟
Grocery Store Data Analysis with PySPark RDDs7分钟

4篇阅读材料总计40分钟

PySpark RDDs in Organization10分钟
Managing RDD Transformations in PySpark10分钟
Optimizing RDD operations in PySpark10分钟
Module Summary: Working with RDD10分钟

4个作业总计38分钟

Knowledge Check: Working with RDD20分钟
Introduction to RDD6分钟
RDD Transformations6分钟
RDD Actions6分钟

3个讨论话题总计30分钟

Introduction to RDDs10分钟
Transformations: Map10分钟
Actions: Coalesce, histogram, and sortBy10分钟

This module covers the creation and manipulation of DataFrames in PySpark. You will learn how to perform basic and advanced operations, including aggregation, grouping, and handling missing data, with a focus on optimizing large-scale data processing tasks.

涵盖的内容

22个视频4篇阅读材料4个作业1个讨论话题

22个视频总计116分钟

Overview of Data frames7分钟
Introduction to DataFrames API4分钟
Creating Data Frames from Different Sources7分钟
Data Frames from RDD6分钟
Basic DataFrame Operations6分钟
Implementation of DataFrame Operations4分钟
Performing Aggregations and Groupings - GroupBy and Window6分钟
Performing Aggregations and Groupings - Cube and Rollup4分钟
Handling Missing Data - Managing Null Values7分钟
Demonstration for Handling Missing Data4分钟
Working with Complex Data Types - Arrays and Structs7分钟
Demonstration for Working with Complex Data Types3分钟
Advanced DataFrame Transformations and Actions7分钟
Demonstration: Working with DataFrames7分钟
Introduction to Data Visualization and Key Aspects5分钟
Introduction to Data Visualization - General Visuals4分钟
Libraries for Data Visualization - Matplotlib and Seaborn4分钟
Libraries for Data Visualization - Plotly4分钟
Implementing Data Visualization6分钟
Implementing Data Visualization - Plotting Charts6分钟
Customizing the Visualizations 4分钟
Customizing Charts and Visuals6分钟

4篇阅读材料总计40分钟

Importance of PySpark DataFrames10分钟
Window Functions in PySpark10分钟
Data Visualization Libraries in PySpark10分钟
Module Summary: PySpark DataFrames10分钟

4个作业总计38分钟

Knowledge Check: PySpark Dataframes20分钟
Introduction to PySpark DataFrames6分钟
Advanced DataFrame Operations6分钟
Data Visualizations with PySpark DataFrames6分钟

1个讨论话题总计5分钟

PySpark DataFrames and Traditional Pandas DataFrames5分钟

In this module, you will explore the SQL capabilities of PySpark. Learn how to perform CRUD operations, execute SQL commands, and merge and aggregate data using PySpark SQL. You'll also discover best practices for using SQL with PySpark to enhance data workflows.

涵盖的内容

28个视频4篇阅读材料4个作业2个讨论话题

28个视频总计135分钟

Structured Data vs. Unstructured Data5分钟
Characteristic of Structured Data 5分钟
Relational Database and its Components7分钟
SQL in Relation with Relational Database6分钟
Normalization and its Types6分钟
Exploring Different Types of Normalization4分钟
Data Querying and Filtering Logic6分钟
DDL Commands - Creating Tables5分钟
DDL Commands - Altering and Truncating Tables4分钟
DQL Commands - Select Statement and Where Clause4分钟
DQL Commands - Practical Implementation4分钟
DML Commands - Insert, Update, and Delete4分钟
DML Commands - Lock4分钟
DCL Commands7分钟
TCL Commands6分钟
Alter - Altering a Table and Constraints5分钟
Alter - Altering Indexes and Views3分钟
Performing CRUD Operations6分钟
Operations on PySpark SQL DataFrames4分钟
Performing Operations on PySpark SQL DataFrames7分钟
Data Merging and Aggregation using PySpark SQL5分钟
Implementing Data Merging and Aggregation using PySpark SQL4分钟
SQL Best Practices6分钟
Data Integrity and Error Handling with PySpark3分钟
Problem Statement: Ecommerce Organization 4分钟
Data Analysis of an E-commerce Organization4分钟
Demonstration: Spark SQL - Retail Organization4分钟
Demonstration: Analyzing the Data4分钟

4篇阅读材料总计34分钟

Best Practices for Data Querying: Optimizing SQL Performance8分钟
User-Defined Functions (UDFs) in PySpark8分钟
Best Practices for Using SQL with PySpark8分钟
Module Summary: PySpark SQL10分钟

4个作业总计38分钟

Knowledge Check: PySpark SQL20分钟
Introduction to SQL6分钟
SQL Commands6分钟
Working with PySpark SQL6分钟

2个讨论话题总计10分钟

Why Normalization is Crucial for Database Design?5分钟
Importance of Aggregate Functions 5分钟

This module is meant to test how well you understand the different ideas and lessons you've learned in this course. You will undertake a project based on these PySpark concepts and complete a comprehensive quiz that will assess your confidence and proficiency in Data Processing with PySpark.

涵盖的内容

1个视频1篇阅读材料1个作业1个讨论话题

获得职业证书

将此证书添加到您的 LinkedIn 个人资料、简历或履历中。在社交媒体和绩效考核中分享。

位教师

授课教师评分

(5个评价)

Edureka

176 门课程157,508 名学生

提供方

Edureka

从 Data Analysis 浏览更多内容

状态：免费试用
EDUCBA
PySpark & Python: Hands-On Guide to Data Processing
课程
状态：预览
Edureka
Introduction to PySpark
课程
状态：免费试用
EDUCBA
PySpark: Apply & Analyze Advanced Data Processing
课程
Coursera
PySpark Foundations: Process, analyze, and summarize data
指导项目

人们为什么选择 Coursera 来帮助自己实现职业发展

Felipe M.

自 2018开始学习的学生

''能够按照自己的速度和节奏学习课程是一次很棒的经历。只要符合自己的时间表和心情，我就可以学习。'

Jennifer J.

自 2020开始学习的学生

''我直接将从课程中学到的概念和技能应用到一个令人兴奋的新工作项目中。'

Larry W.

自 2021开始学习的学生

''如果我的大学不提供我需要的主题课程，Coursera 便是最好的去处之一。'

Chaitanya A.

''学习不仅仅是在工作中做的更好：它远不止于此。Coursera 让我无限制地学习。'

通过 Coursera Plus 开启新生涯

无限制访问 10,000+ 世界一流的课程、实践项目和就业就绪证书课程 - 所有这些都包含在您的订阅中

了解更多

通过在线学位推动您的职业生涯

获取世界一流大学的学位 - 100% 在线

探索学位

加入超过 3400 家选择 Coursera for Business 的全球公司

提升员工的技能，使其在数字经济中脱颖而出

了解更多

常见问题

You will need access to a computer with Python and Apache Spark installed. Detailed setup instructions will be provided at the beginning of the course.

This course is designed for individuals new to big data and PySpark, providing a solid foundation to start working with distributed data processing.

While prior SQL knowledge is beneficial, it is not mandatory. The course will introduce SQL concepts as they relate to PySpark and provide practice with SQL queries.

PySpark in Action: Hands-On Data Processing

PySpark in Action: Hands-On Data Processing

您将学到什么

您将获得的技能

您将学习的工具

要了解的详细信息

了解顶级公司的员工如何掌握热门技能

积累特定领域的专业知识

该课程共有5个模块

Big Data Processing with PySpark

涵盖的内容

Working with RDD

涵盖的内容

PySpark DataFrames

涵盖的内容

PySpark SQL

涵盖的内容

Course Wrap Up and Assessment

涵盖的内容

获得职业证书

位教师

提供方

从 Data Analysis 浏览更多内容

PySpark & Python: Hands-On Guide to Data Processing

Introduction to PySpark

PySpark: Apply & Analyze Advanced Data Processing

PySpark Foundations: Process, analyze, and summarize data

人们为什么选择 Coursera 来帮助自己实现职业发展

Felipe M.

Jennifer J.

Larry W.

Chaitanya A.

通过 Coursera Plus 开启新生涯

通过在线学位推动您的职业生涯

加入超过 3400 家选择 Coursera for Business 的全球公司

常见问题

What tools or software do I need to complete the course?

Is this course suitable for beginners in data processing?

Is knowledge of SQL required for this course?

更多问题