Data Processing, Exploratory Analysis and Visualization

本课程是 Microsoft Big Data Management and Analytics 专业证书的一部分

位教师： Microsoft

访问权限由 Coursera Learning Team 提供

5个模块

深入了解一个主题并学习基础知识。

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

5个模块

深入了解一个主题并学习基础知识。

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将获得的技能

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

47 任务¹

AI 评分请参见免责声明

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累 Data Analysis 领域的专业知识

本课程是 Microsoft Big Data Management and Analytics 专业证书专项课程的一部分

在注册此课程时，您还会同时注册此专业证书。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
通过 Microsoft 获得可共享的职业证书

该课程共有5个模块

This course introduces distributed computing frameworks and big data visualization techniques. Learners will explore MapReduce, work with Apache Spark, implement transformations with PySpark, and use Spark SQL for large-scale analysis. The course concludes with building compelling dashboards and reports using Power BI for actionable business insights.

By the end of this course, you will be able to: Explain distributed computing and MapReduce concepts Process large datasets using Apache Spark and PySpark Apply Spark SQL for advanced queries and transformations Create dashboards and visualizations using Power BI Tools & Software: Apache Spark, PySpark, Azure Databricks, Power BI Skills: Distributed computing, Data analysis, PySpark, Spark SQL, Data visualization

Distributed Computing and MapReduce Concepts explores the foundational principles that enable modern organizations to process massive datasets that have outgrown the limits of single-machine computing. Through real-world examples, visual walkthroughs, hands-on labs, and guided design activities, you'll examine how data is broken into parallel tasks and executed across clusters of machines, how the Map, shuffle, and Reduce phases work together, and how common MapReduce patterns—such as counting, filtering, joining, and aggregation—solve practical big data problems efficiently and at scale.

涵盖的内容

6个视频3篇阅读材料8个作业

6个视频总计36分钟

The Scale Challenge in Modern Computing 4分钟
Visualizing Distributed Processing Workflows 7分钟
Simplifying Complex Problems with MapReduce 5分钟
Tracing MapReduce Execution Flow 8分钟
MapReduce Patterns in Production Systems 5分钟
Implementing MapReduce Patterns 8分钟

3篇阅读材料总计90分钟

Distributed Computing Principles for Big Data 30分钟
MapReduce Programming Model Deep Dive 30分钟
Essential MapReduce Patterns and Algorithms 30分钟

8个作业总计240分钟

Distributed Computing and MapReduce Mastery 30分钟
Distributed Computing Analysis 30分钟
Distributed Computing Concepts 30分钟
MapReduce Algorithm Design 30分钟
MapReduce Execution Tracing 30分钟
MapReduce Programming Model 30分钟
MapReduce Solution Design 30分钟
MapReduce Patterns and Applications 30分钟

Apache Spark Architecture and Fundamentals provides a comprehensive introduction to the distributed processing engine that revolutionized big data analytics by overcoming traditional MapReduce limitations. Through real-world examples, visual walkthroughs, hands-on labs, and guided design activities, you'll examine Spark's core components, including the driver, executors, and cluster manager, explore how in-memory processing delivers dramatic performance improvements, and learn to configure and manage Spark clusters and applications for efficient large-scale data processing.

涵盖的内容

7个视频3篇阅读材料9个作业

7个视频总计40分钟

Spark's Revolution in Big Data Processing 4分钟
Spark Cluster Setup and Configuration - Part 1 6分钟
Spark Cluster Setup and Configuration - Part 2 5分钟
Inside Spark's Intelligent Execution Engine 5分钟
Analyzing Spark Execution Plans 8分钟
RDDs: The Foundation of Spark's Power 5分钟
Hands-on RDD Programming 7分钟

3篇阅读材料总计90分钟

Apache Spark Architecture Deep Dive 30分钟
Spark Execution Model and Optimization 30分钟
RDD Programming Model and Operations 30分钟

9个作业总计270分钟

Apache Spark Fundamentals Mastery 30分钟
Spark Environment Setup 30分钟
Spark Architecture Components 30分钟
Spark Job Analysis 30分钟
Spark Execution Analysis 30分钟
Spark Execution Optimization 30分钟
RDD Operations and Lineage 30分钟
RDD Programming Practice 30分钟
RDD Programming Concepts 30分钟

Data Processing with PySpark RDDs and DataFrames focuses on practical data processing using PySpark's Python API for Apache Spark. Through real-world examples, visual walkthroughs, hands-on labs, and guided design activities, you'll implement data processing operations using both RDDs and DataFrames, develop transformation pipelines, apply common data cleaning and preparation techniques, and optimize PySpark code for better performance across enterprise-scale big data scenarios.

涵盖的内容

6个视频3篇阅读材料10个作业

6个视频总计37分钟

Python Meets Big Data with PySpark 4分钟
PySpark Development Workflow 9分钟
DataFrames: Structured Big Data Made Simple 4分钟
DataFrame Operations and Schema Management 8分钟
Advanced Analytics with PySpark Transformations 5分钟
Building Complex Transformation Pipelines 7分钟

3篇阅读材料总计90分钟

PySpark Development Environment and Best Practices 30分钟
PySpark DataFrame Programming Guide 30分钟
Advanced PySpark DataFrame Operations 30分钟

10个作业总计300分钟

PySpark Data Processing Mastery 30分钟
PySpark Environment Setup 30分钟
PySpark Development Environment 30分钟
PySpark Development Fundamentals 30分钟
DataFrame Schema and Operations 30分钟
DataFrame Data Cleaning Pipeline 30分钟
DataFrame Operations and Schema 30分钟
Advanced Transformation Patterns 30分钟
Complex Analytics Pipeline 30分钟
Advanced Transformation Techniques 30分钟

Advanced Data Processing with Spark SQL introduces Spark SQL as a powerful interface for structured data processing in distributed environments. Through real-world examples, visual walkthroughs, hands-on labs, and guided design activities, you'll master SQL operations at scale, from basic queries to complex analytical operations, learn to create and manage temporary views and tables, and optimize query performance for production workloads that would overwhelm traditional database systems.

涵盖的内容

6个视频3篇阅读材料10个作业

6个视频总计35分钟

SQL at Scale with Spark SQL 4分钟
Spark SQL Environment and Basic Queries 7分钟
Enterprise Analytics with Advanced Spark SQL 5分钟
Implementing Complex Analytical Queries 7分钟
Optimizing Spark SQL for Production Performance 5分钟
Query Performance Analysis and Tuning 7分钟

3篇阅读材料总计90分钟

Spark SQL Architecture and Programming Model 30分钟
Advanced Spark SQL Operations and Optimization 30分钟
Spark SQL Performance Tuning and Optimization 30分钟

10个作业总计300分钟

Spark SQL Advanced Processing Mastery 30分钟
Spark SQL Views and Queries 30分钟
Spark SQL Environment Setup 30分钟
Spark SQL Fundamentals 30分钟
Complex SQL Query Development 30分钟
Advanced Analytics Pipeline 30分钟
Advanced SQL Operations 30分钟
Query Optimization Practice 30分钟
Comprehensive Performance Optimization 30分钟
Query Optimization Techniques 30分钟

Data Visualization for Big Data with Power BI introduces comprehensive visualization techniques specifically designed for big data environments using Microsoft Power BI. Through real-world examples, visual walkthroughs, hands-on labs, and guided design activities, you'll learn to connect Power BI to various big data sources, create effective visualizations for large datasets, build interactive dashboards that enable self-service analytics, and implement best practices for handling performance challenges when visualizing massive datasets.