Big Data Analytics

Big Data Analytics

位教师：Dr. Mohit Bhatnagar

访问权限由 New York State Department of Labor 提供

11个模块

深入了解一个主题并学习基础知识。

初级等级

推荐体验

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

11个模块

深入了解一个主题并学习基础知识。

初级等级

推荐体验

3 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将学到什么

Gain a deep understanding of Hadoop and Spark ecosystems for managing big data. Become familiar with tools like Hive and Pig to query large datasets.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

16 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

该课程共有11个模块

The Big Data Analytics course offers a deep dive into the technologies, tools, and techniques used to process and analyze large-scale data. Learners will explore the Hadoop and Spark ecosystems, gaining hands-on experience with essential components such as Hadoop Distributed File System (HDFS), MapReduce, Pig, and Hive. The course also covers both relational (SQL) and nonrelational (NoSQL) databases, helping learners understand the appropriate contexts for each type of data storage.

A significant focus is placed on Apache Spark, known for its high-speed, in-memory data processing capabilities, which is vital for handling big data applications. Learners will also work through real-world exercises, including implementing and deploying a machine learning application that processes streaming data on the cloud. Designed for professionals with a background in predictive analytics, basic SQL, and Python programming, this course equips learners with the practical skills to manage data characterized by high volume, velocity, and variety. By the end of the course, participants will be able to derive actionable insights from big data and apply them in business contexts, contributing to improved decision-making and competitive advantage in data-driven environments.

Welcome to the Big Data Analytics course! By the end of this course, you will develop an understanding of the various technologies associated with Hadoop and the Spark ecosystem of tools and technologies. You will get hands-on experience working with core Hadoop components like MapReduce and Hadoop Distributed File System (HDFS). You will learn to write Pig scripts and Hive queries and extract data stored across Hadoop clusters. You will also learn about relational (SQL) and nonrelational (NoSQL) databases and discuss scenarios in which one is preferred over the other for data storage. You will also gain insight into the Spark ecosystem which makes running jobs across clusters very fast, thereby having several emerging applications. You will also learn a hands-on example of implementing and deploying a machine-learning application that handles streaming data on the cloud. This is an advanced-level course, intended for learners with a background using predictive tools and techniques, experience in writing basic Structured Query Language (SQL) queries, and an understanding of Python programming. The knowledge you gain from this course will help you make a career as a business analyst. You will gain skills to draw insights from data that has characteristics of high velocity, volume, and variety. The data with such characteristics is called big data and is increasingly being used by organizations for competitive advantage and decision-making. In this module, you will learn about Big Data applications and the various components of the Hadoop ecosystem. The module also discusses the MapReduce paradigm that facilitates distributed processing of data. You will also gain an insight into the HDFS and use it for storing files. Hands-on examples are provided using Hortonworks Data Platform Sandbox, which can be installed on a Windows/Mac computer with at least 8 GB of available RAM.

涵盖的内容

13个视频4篇阅读材料2个作业1个讨论话题

13个视频总计96分钟

Course Introduction 2分钟
Introduction to Big Data 7分钟
Data Types and Applications 4分钟
The Need and Evolution of Hadoop 5分钟
The Hadoop Ecosystem 7分钟
Hortonworks Data Platform Sandbox Installation (Desktop/Laptop) 9分钟
Hortonworks Data Platform Sandbox Installation (Google Cloud) 15分钟
The HDFS File System 6分钟
Hands-On with HDFS on HDP Sandbox (Desktop/Laptop) 10分钟
Hands-On with HDFS on HDP Sandbox (Google Cloud) 14分钟
Distributed Computing Using YARN 5分钟
Introduction to MapReduce 6分钟
Hands-On with MapReduce Using Python 7分钟

4篇阅读材料总计180分钟

Essential Reading: Introduction to Big Data 60分钟
Recommended Reading: Introduction to Hadoop Ecosystem 30分钟
Essential Reading: Hands-On with Hadoop 60分钟
Recommended Reading: mrjob Python Library 30分钟

2个作业总计39分钟

Introduction to Big Data and Hadoop Ecosystem 24分钟
Hands-On with Hadoop 15分钟

1个讨论话题总计20分钟

Applications of Big Data Analytics 20分钟

This assessment is a graded quiz based on the module covered in this week.

涵盖的内容

1个作业

In this module, you will learn about the Hive scripting language and its usage for mining data from Hadoop clusters. Hive provides an SQL dialect called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster. Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly. Hive makes it easier for developers to port SQL-based applications to Hadoop, compared with other Hadoop languages and tools. Like all SQL dialects in widespread use, it does not fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive supports several sizes of integer and floating-point types, a boolean type, and character strings of arbitrary length. Lastly, taking a real-world data set, you will load it in the Ambari environment for analysis using HDFS and HQL. You will go through the process of creating tables, loading data, and analyzing it using a Hive Query Language.

涵盖的内容

9个视频2篇阅读材料2个作业1个讨论话题

9个视频总计67分钟

Recap of Basic Concepts 6分钟
Introduction to Hive 6分钟
Hive Data Types 6分钟
HQL Commands and Uses 7分钟
HiveQL Data Definition and Manipulation 6分钟
Getting Started with Hive 11分钟
Using the Hive View on Ambari 8分钟
Practice Example on Hive 8分钟
Challenge: Hands-On 9分钟

2篇阅读材料总计105分钟

Essential Reading: Introduction to Hive 15分钟
Essential Reading: Hands-On with Hive 90分钟

2个作业总计30分钟

Introduction to Hive 18分钟
Hands-On with Hive 12分钟

1个讨论话题总计15分钟

Introduction to HIVE 15分钟

This assessment is a graded quiz based on the modules covered this week. 

涵盖的内容

1个作业

In this module, you will learn about the Pig Latin scripting language and how you can leverage it to query big data on Hadoop clusters. You will also learn about the different data types and commands available in the Pig Latin language and how they can be used to define and manipulate data in the Hadoop ecosystem. Furthermore, you will be to work on a practical example of a publicly available data set to run Pig Latin scripts for data analysis.

涵盖的内容

7个视频2篇阅读材料2个作业

7个视频总计57分钟

Introduction to Pig Latin 8分钟
Pig Data Types 7分钟
Pig Latin Commands and Uses 7分钟
Pig Data Definition and Manipulation 9分钟
Running Pig View on Ambari 6分钟
Example on Pig View 10分钟
Practice Problem as a Challenge 11分钟

2篇阅读材料总计105分钟

Essential Reading: Introduction to Pig Language 15分钟
Recommended Reading: Hands-On with Pig 90分钟

2个作业总计30分钟

Introduction to Pig Language 24分钟
Hands-On with Pig 6分钟

In this module, you will be introduced to the need for NoSQL databases. You will also get introduced to HBase, a NoSQL database, and its role in the Hadoop ecosystem. You will learn about the CAP theorem and how it affects the trade-offs between choosing the different NoSQL database options available on Hadoop. You will also learn about CAP consistency, availability, and partition tolerance in detail and how they affect our choice of technology to access and manipulate data on Hadoop. Lastly, you will get insights into other emerging cloud-based NoSQL solutions.

涵盖的内容

8个视频2篇阅读材料2个作业1个讨论话题

8个视频总计59分钟

Introduction to Data Warehouses 8分钟
Need for NoSQL Databases 8分钟
CAP Theorem 8分钟
Making a Choice of a Database 8分钟
Introduction to HBase 7分钟
Architecture of Hbase 8分钟
HBase data model 6分钟
Running and Setting Up Hbase on Ambari and Hands-On with Hbase 7分钟

2篇阅读材料总计135分钟

Essential Reading: Introduction to NoSQL Databases 45分钟
Recommended Reading: Hands-On with HBase 90分钟

2个作业总计30分钟

Introduction to NoSQL Databases 15分钟
Hands-On with HBase 15分钟

1个讨论话题总计15分钟

Architecture of HBase 15分钟

This assessment is a graded quiz based on the modules covered this week.

涵盖的内容

1个作业

In this module, you will be introduced to the popular Apache Spark platform for Big Data processing. You will explore the key components of Apache Spark that provide significant benefits in distributed computing. You will also be introduced to the Resilient Distributed Datastores (RDD) and the Spark DataFrames. Furthermore, you will be introduced to Spark SQL and Spark Streaming.

涵盖的内容

11个视频4篇阅读材料2个作业1个讨论话题

11个视频总计70分钟

The Need for Spark 5分钟
Spark Background and Applications 6分钟
The Resilient Distributed Dataset (RDD) 7分钟
Hands-On with the PySpark Library in Python 8分钟
Working with Spark DataFrames and Spark SQL 5分钟
Hands-On with Structured Queries on Spark 7分钟
Need for Processing Streaming Data 5分钟
Introduction to Spark Streaming 6分钟
Hands-On with DStream API 7分钟
Structured Streaming 6分钟
Hands-On with Structured Streaming 6分钟

4篇阅读材料总计360分钟

Essential Reading: Introduction to Spark 180分钟
Recommended Reading: Quick Start on Spark 60分钟
Essential Reading: Introduction to Spark Streaming 90分钟
Recommended Reading: Spark Structured Streaming 30分钟

2个作业总计30分钟

Introduction to the Building Blocks of Spark 15分钟
Introduction to Spark Streaming 15分钟

1个讨论话题总计20分钟

Windowing in Structured Streaming 20分钟

This assessment is a graded quiz based on the module covered in this week.

涵盖的内容

1个作业

In this module, you will learn about MLlib, which is used for making predictions on large datasets that need distributed processing. You will be working on regression and classification tasks for large datasets. Then, a hands-on exercise with streaming data from the twitter API is implemented. This is a predictive streaming application to show participants an end-to-end big data scenario.