Big Data Processing with Hadoop and Spark

Big Data Processing with Hadoop and Spark

本课程是 Cloud Computing for Data Science 专项课程的一部分

位教师：Dmitriy Babichenko

访问权限由 New York State Department of Labor 提供

3个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

9 小时完成

灵活的计划

自行安排学习进度

3个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

9 小时完成

灵活的计划

自行安排学习进度

您将学到什么

Explain how Hadoop and Spark enable large-scale data processing.
Build and manage distributed data pipelines using Hadoop frameworks.
Implement in-memory analytics and real-time processing with Spark.
Apply big data tools to design scalable, data-driven applications.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

8 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 Cloud Computing for Data Science 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有3个模块

Master the tools and techniques that power large-scale data processing and analytics. This course introduces the principles and frameworks of Big Data Processing with Hadoop and Spark, enabling learners to manage, process, and analyze massive datasets efficiently.

You’ll start by understanding the Hadoop ecosystem, including HDFS and MapReduce, and how distributed storage and computation work together to handle data at scale. Then, you’ll explore Apache Spark, a powerful framework for fast, in-memory data processing and real-time analytics. Through guided exercises and case studies, you’ll learn how to build scalable data pipelines, optimize performance, and apply transformations for business insights. By the end of this course, you’ll be equipped to handle complex data workloads using industry-standard big data tools. Ideal for aspiring data engineers, analysts, and developers, this course bridges data management and cloud computing—preparing you to design, implement, and manage big data solutions that drive intelligent decision-making in modern organizations.

This module guides you through the core components of the Hadoop ecosystem, starting with its architecture and distributed file system. You’ll explore how Hadoop processes data, gain insight into its broader ecosystem, and apply your knowledge in hands-on activities using both Docker and a Linux virtual machine.

涵盖的内容

6个视频1篇阅读材料3个作业

6个视频总计41分钟

Overview: Hadoop2分钟
Lecture 1: Introduction to Hadoop7分钟
Lecture 2: HDFS Architecture7分钟
Lecture 3: Yarn Architecture7分钟
Lecture 4: Hadoop Ecosystem9分钟
Lecture 5: Hadoop Data Processing9分钟

1篇阅读材料总计10分钟

Course Overview10分钟

3个作业总计90分钟

HDFS Architecture30分钟
Test Yourself: Hadoop30分钟
Let's Practice: Hadoop30分钟

This module introduces you to key programming models for distributed data processing, with a focus on MapReduce and its practical applications. You'll explore core concepts and terminology, work through guided code walkthroughs using Python to implement word count and server log analysis tasks, and gain experience using Apache Pig for data transformation. You'll also gain hands-on experience writing data transformation scripts in Apache Pig, culminating in an assignment that applies these skills to web log analysis.

涵盖的内容

6个视频6篇阅读材料3个作业

6个视频总计34分钟

Overview: Parallel Programming Models2分钟
Lecture 1: Programming Models4分钟
Lecture 2: Programming Models Concepts and Terminology11分钟
Lecture 3: MapReduce8分钟
Lecture 4: MapReduce Deeper Dive6分钟
Lecture 5: Apache Pig4分钟

6篇阅读材料总计60分钟

Code Review: Introduction to MapReduce With Python10分钟
Code Review: Word Count Example with MapReduce + Python10分钟
Code Review: Server Log Analysis with MapReduce + Python10分钟
Code Review: Server Log Analysis (Reading from File) with MapReduce + Python10分钟
Activity & Code Review: Word Count with Apache Pig10分钟
Activity: Working with Apache Pig10分钟

3个作业总计90分钟

MapReduce30分钟
Test Yourself: Programming Models30分钟
Let's Practice: Programming Models30分钟

This module introduces you to Apache Spark, covering its core concepts, architecture, and machine learning capabilities through MLlib. You’ll learn how to set up Spark using Docker and Linux VM, explore how PySpark operates within the Spark framework, and compare Spark MLlib with scikit-learn through hands-on code walkthroughs. By the end of the module, you'll apply what you've learned in graded activities and an assignment focused on building a predictive model with PySpark and MLlib.