Spark, Hadoop, and Snowflake for Data Engineering

Spark, Hadoop, and Snowflake for Data Engineering

本课程是 Applied Python Data Engineering 专项课程的一部分

位教师：Noah Gift

14,039 人已注册

包含在中

了解更多

4个模块

深入了解一个主题并学习基础知识。

64 条评论

高级设置等级

推荐体验

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

4个模块

深入了解一个主题并学习基础知识。

64 条评论

高级设置等级

推荐体验

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将学到什么

Create scalable data pipelines (Hadoop, Spark, Snowflake, Databricks) for efficient data handling.
Optimize data engineering with clustering and scaling to boost performance and resource use.
Build ML solutions (PySpark, MLFlow) on Databricks for seamless model development and deployment.
Implement DataOps and DevOps practices for continuous integration and deployment (CI/CD) of data-driven applications, including automating processes.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

21 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 Applied Python Data Engineering 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有4个模块

e.g. This is primarily aimed at first- and second-year undergraduates interested in engineering or science, along with high school students and professionals with an interest in programmingGain the skills for building efficient and scalable data pipelines. Explore essential data engineering platforms (Hadoop, Spark, and Snowflake) as well as learn how to optimize and manage them. Delve into Databricks, a powerful platform for executing data analytics and machine learning tasks, while honing your Python data science skills with PySpark. Finally, discover the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and learn how to integrate it with Databricks.

This course is designed for learners who want to pursue or advance their career in data science or data engineering, or for software developers or engineers who want to grow their data management skill set. In addition to the technologies you will learn, you will also gain methodologies to help you hone your project management and workflow skills for data engineering, including applying Kaizen, DevOps, and Data Ops methodologies and best practices. With quizzes to test your knowledge throughout, this comprehensive course will help guide your learning journey to become a proficient data engineer, ready to tackle the challenges of today's data-driven world.

In this module, you will learn how to work with different data engineering platforms, such as Hadoop and Spark, and apply their concepts to real-world scenarios. First, you will explore the fundamentals of Hadoop to store and process big data. Next, you will delve into Spark concepts, distributed computing, deferred execution, and Spark SQL. By the end of the week, you will gain hands-on experience with PySpark DataFrames, DataFrame methods, and deferred execution strategies.

涵盖的内容

10个视频10篇阅读材料7个作业1个讨论话题2个非评分实验室

10个视频总计25分钟

Meet your Co-Instructor: Kennedy Behrman 1分钟
Meet your Co-Instructor: Noah Gift 1分钟
Overview of Big Data Platforms 2分钟
Getting Started with Hadoop 1分钟
Getting Started with Spark 2分钟
Introduction to Resilient Distributed Datasets (RDD) 2分钟
Resilient Distributed Datasets (RDD) Demo 4分钟
Introduction to Spark SQL 2分钟
PySpark Dataframe Demo: Part 1 3分钟
PySpark Dataframe Demo: Part 2 7分钟

10篇阅读材料总计100分钟

Welcome to Data Engineering Platforms with Python! 10分钟
Report a problem with the course 10分钟
What is Apache Hadoop? 10分钟
What is Apache Spark? 10分钟
Use Apache Spark in Azure Databricks (optional) 10分钟
Choosing between Hadoop and Spark 10分钟
What are RDDs? 10分钟
Getting Started: Creating RDD's with PySpark 10分钟
Spark SQL, Dataframes and Datasets 10分钟
PySpark and Spark SQL 10分钟

7个作业总计210分钟

PySpark 30分钟
Big Data Platforms 30分钟
Apache Hadoop Concepts 30分钟
Apache Spark Concepts 30分钟
RDD Concepts 30分钟
Spark SQL Concepts 30分钟
PySpark Dataframe Concepts 30分钟

1个讨论话题总计10分钟

Meet and Greet (optional) 10分钟

2个非评分实验室总计120分钟

Practice: Creating RDD's with PySpark 60分钟
Practice: Reading Data into Dataframes 60分钟

In this module, you will explore the Snowflake platform, gaining insights into its architecture and key concepts. Through hands-on practice in the Snowflake Web UI, you'll learn to create tables, manage warehouses, and use the Snowflake Python Connector to interact with tables. By the end of this week, you'll solidify your understanding of Snowflake's architecture and practical applications, emerging with the ability to effectively navigate and leverage the platform for data management and analysis.

涵盖的内容

8个视频5篇阅读材料6个作业

8个视频总计27分钟

What is Snowflake? 2分钟
Snowflake Layers 2分钟
Snowflake Web UI 4分钟
Navigating Snowflake 4分钟
Creating a Table in Snowflake 5分钟
Snowflake Warehouses 4分钟
Writing to Snowflake 3分钟
Reading from Snowflake 3分钟

5篇阅读材料总计50分钟

Accessing Snowflake 10分钟
Detailed View Inside Snowflake 10分钟
Snowsight: The Snowflake Web Interface 10分钟
Working with Warehouses 10分钟
Python Connector Documentation 10分钟

6个作业总计180分钟

Snowflake 30分钟
Snowflake Architecture 30分钟
Snowflake Layers 30分钟
Navigating Snowflake 30分钟
Creating a Table 30分钟
Writing to Snowflake 30分钟

In this module, you will practice the essential skills for seamlessly managing machine learning workflows using Databricks and MLFlow. First, you will create a Databricks workspace and configure a cluster, setting the stage for efficient data analysis. Next, you will load a sample dataset into the Databricks workspace using the power of PySpark, enabling data manipulation and exploration. Finally, you will install MLFlow either locally or within the Databricks environment, gaining the ability to orchestrate the entire machine learning lifecycle. By the end of this week, you will be able to craft, track, and manage machine learning experiments within Databricks, ensuring precision, reproducibility, and optimal decision-making throughout your data-driven journey.

涵盖的内容

16个视频7篇阅读材料4个作业1个非评分实验室

16个视频总计72分钟

Accessing Databricks 1分钟
Spark Notebooks with Databricks 5分钟
Using Data with Databricks 5分钟
Working with Workspaces in Databricks 3分钟
Advanced Capabilities of Databricks 2分钟
PySpark Introduction on Databricks 7分钟
Exploring Databricks Azure Features 4分钟
Using the DBFS to AutoML Workflow 4分钟
Load, Register and Deploy ML Models 3分钟
Databricks Model Registry 3分钟
Model Serving on Databricks 2分钟
What is MLOps? 13分钟
Exploring Open-Source MLFlow Frameworks 6分钟
Running MLFlow with Databricks 6分钟
End to End Databricks MLFlow 4分钟
Databricks Autologging with MLFlow 4分钟

7篇阅读材料总计70分钟

What is Azure Databricks? 10分钟
Introduction to Databricks Machine Learning 10分钟
What is the Databricks File System (DBFS)? 10分钟
Serverless Compute with Databricks 10分钟
MLOps Workflow on Azure Databricks 10分钟
Run MLFlow Projects on Azure Databricks 10分钟
Databricks Autologging 10分钟

4个作业总计120分钟

DataBricks 30分钟
PySpark SQL 30分钟
PySpark DataFrames 30分钟
MLFlow with Databricks 30分钟

1个非评分实验室总计60分钟

ETL-Part-1: Keyword Extractor Tool to HashTag Tool 60分钟

In this module, you will explore the concepts of Kaizen, DevOps, and DataOps and how these methodologies synergistically contribute to efficient and seamless data engineering workflows. Through practical examples, you will learn how Kaizen's continuous improvement philosophy, DevOps' collaborative practices, and DataOps' focus on data quality and integration converge to enhance the development, deployment, and management of data engineering platforms. By the end of this week, you will have the knowledge and perspective needed to optimize data engineering processes and deliver scalable, reliable, and high-quality solutions.

涵盖的内容

21个视频7篇阅读材料4个作业1个非评分实验室

21个视频总计502分钟

Kaizen Methodology for Data 4分钟
Introducing GitHub CodeSpaces 9分钟
Compiling Python in GitHub Codespaces 18分钟
Walking through Sagemaker Studio Lab 29分钟
Pytest Master Class (Optional) 166分钟
What is DevOps? 2分钟
DevOps Key Concepts 36分钟
Continuous Integration Overview 32分钟
Build an NLP in Cloud9 with Python 43分钟
Build a Continuously Deployed Containerized FastAPI Microservice 44分钟
Hugo Continuous Deploy on AWS 19分钟
Container Based Continuous Delivery 9分钟
What is DataOps? 1分钟
DataOps and MLOps with Snowflake 62分钟
Building Cloud Pipelines with Step Functions and Lambda 17分钟
What is a Data Lake? 2分钟
Data Warehouse vs. Feature Store 2分钟
Big Data Challenges 1分钟
Types of Big Data Processing 1分钟
Real-World Data Engineering Pipeline 2分钟
Data Feedback Loop 1分钟

7篇阅读材料总计70分钟

GitHub Codespaces Overview 10分钟
Getting Started with Amazon SageMaker Studio Lab 10分钟
Teaching MLOps at Scale with GitHub (Optional) 10分钟
Getting Started with DevOps and Cloud Computing 10分钟
Benefits of Serverless ETL Technologies 10分钟
Next Steps 10分钟
Share your learning experience 10分钟

4个作业总计120分钟

DataOps and Operations Methodologies 30分钟
Kaizen Methodology 30分钟
DevOps 30分钟
DataOps 30分钟

1个非评分实验室总计60分钟

ETL-Part2: SQLite ETL Destination 60分钟

获得职业证书

将此证书添加到您的 LinkedIn 个人资料、简历或履历中。在社交媒体和绩效考核中分享。

位教师

授课教师评分

(17个评价)

Noah Gift

Duke University

40 门课程 265,705 名学生

提供方

Duke University

从 Machine Learning 浏览更多内容

Packt
Data Engineering with Scala and Spark
课程
Coursera
Data Engineering: Pipelines, ETL, Hadoop
课程
Duke University
Applied Python Data Engineering
专项课程
Coursera
Engineering Data Ecosystems: Pipelines, ETL, Spark
课程

人们为什么选择 Coursera 来帮助自己实现职业发展

Felipe M.

自 2018开始学习的学生

''能够按照自己的速度和节奏学习课程是一次很棒的经历。只要符合自己的时间表和心情，我就可以学习。'

Jennifer J.

自 2020开始学习的学生

''我直接将从课程中学到的概念和技能应用到一个令人兴奋的新工作项目中。'

Larry W.

自 2021开始学习的学生

''如果我的大学不提供我需要的主题课程，Coursera 便是最好的去处之一。'

Chaitanya A.

''学习不仅仅是在工作中做的更好：它远不止于此。Coursera 让我无限制地学习。'

学生评论

5 stars
53.12%
4 stars
17.18%
3 stars
9.37%
2 stars
9.37%
1 star
10.93%

显示 3/64 个

已于 Aug 6, 2024审阅

Great course, detailed steps by step walkthrough that really simplifies understanding

已于 Jan 15, 2024审阅

A course that cover all aspects basic of data engineer, i love it

查看更多评论

通过 Coursera Plus 开启新生涯

无限制访问 10,000+ 世界一流的课程、实践项目和就业就绪证书课程 - 所有这些都包含在您的订阅中

了解更多

通过在线学位推动您的职业生涯

获取世界一流大学的学位 - 100% 在线

探索学位

加入超过 3400 家选择 Coursera for Business 的全球公司

提升员工的技能，使其在数字经济中脱颖而出

了解更多

常见问题

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.