Building a data pipeline is easy. Building one that automatically recovers from failures, maintains data integrity during outages, and runs reliably in production—that's what separates junior engineers from platform architects.
This course teaches you to design self-healing pipelines with automated recovery, fault tolerance, and disaster recovery built in from day one. You'll learn to build and schedule streaming workflows using modern orchestrators like Airflow and Prefect, implement reliability patterns including idempotence, checkpointing, and dead-letter queues for exactly-once-ish processing, and design multi-region recovery strategies that keep data flowing during regional failures.
Through hands-on labs and real-world examples from Airbnb, LinkedIn, Netflix, and Uber, you'll master the orchestration and recovery techniques that turn fragile scripts into production-grade infrastructure. Learn to handle automated retries, run safe backfills, implement checkpoint-based recovery, and execute disaster recovery playbooks that restore pipelines after outages.
Engineers who build or maintain real-time data pipelines and need stronger orchestration, reliability, and recovery skills.
Basics of Python & SQL, Linux CLI, and Kafka fundamentals. Cloud account helpful but optional.
By the end of the course, learners will be able to design, orchestrate, and recover real-time data pipelines that run reliably at production scale.
Learners set up a modern orchestrator and build a first DAG/flow that runs reliably. We cover scheduling, retries, task dependencies, and lightweight observability. By the end, learners will ship a minimal but production-aware pipeline.
涵盖的内容
4个视频2篇阅读材料1次同伴评审
显示有关单元内容的信息
4个视频•总计31分钟
Why Orchestration Matters: From Cron to DAGs•3分钟
Build Your First DAG (Airflow)•9分钟
Flows the Pythonic Way (Prefect)•9分钟
Demo: Scheduling, Retries, and Alerting End-to-End•10分钟
2篇阅读材料•总计10分钟
Welcome to the Course: Course Overview•5分钟
Choosing an Orchestrator: Airflow vs. Prefect•5分钟
1次同伴评审•总计20分钟
Hands-On-Learning: Ship a Minimal Reliable DAG/Flow•20分钟
Reliability Patterns for Streaming: Idempotence, Checkpoints, and DLQs
第 2 单元•小时 后完成
单元详情
We move from “works on my machine” to “recovers on its own.” Learners add exactly-once-ish processing, checkpointing, schema controls, and dead-letter queues. The module emphasizes designing for replay and safe backfills.
涵盖的内容
3个视频1篇阅读材料1次同伴评审
显示有关单元内容的信息
3个视频•总计32分钟
Exactly-Once with Kafka: What You Really Get•14分钟
Checkpointing & State: Replaying Without Duplicates•8分钟
DLQs in Practice: From Error Handling to Triaging•10分钟
1篇阅读材料•总计5分钟
Checkpoints & WAL in Structured Streaming•5分钟
1次同伴评审•总计20分钟
Hands-On-Learning: Make a Stream Bulletproof: Checkpoints, DLQ, Idempotence•20分钟
Recovery & DR: Backfills, Time Travel, and Cross-Region Replication
第 3 单元•小时 后完成
单元详情
Learners design for failure domains—task, job, cluster, and region. We cover backfills vs. reprocessing, Delta time travel for safe fixes, and Kafka replication patterns (MirrorMaker 2, uReplicator) for DR.
涵盖的内容
4个视频2篇阅读材料1个作业2次同伴评审
显示有关单元内容的信息
4个视频•总计34分钟
Backfills & Reprocessing Without Breaking SLAs•10分钟
Coursera brings together a diverse network of subject matter experts who have demonstrated their expertise through professional industry experience or strong academic backgrounds. These instructors design and teach courses that make practical, career-relevant skills accessible to learners worldwide.
What is pipeline orchestration and recovery in this course?
It means designing a real-time data pipeline as a coordinated workflow that can schedule work, manage dependencies, and recover cleanly when something fails. The course focuses on making pipelines reliable over time, not just getting a script or job to run once.
When would you use this kind of workflow orchestration?
You would use it when a pipeline needs to run repeatedly, stay observable, and keep data moving even when tasks fail, records are bad, or a dependency becomes unstable. In this course, it is used for real-time and batch-adjacent workflows that need safe retries, replays, and recovery paths.
How does orchestration and recovery fit into a broader workflow?
It sits between writing the logic for individual pipeline steps and running the whole system reliably over time. In this course, that layer turns separate tasks into a repeatable process you can schedule, monitor, backfill, and restore.
How is an orchestrated, recoverable pipeline different from running separate jobs manually?
Manual jobs mainly rely on separate reruns and human judgment, while an orchestrated, recoverable pipeline has defined dependencies, retries, and recovery paths. The course emphasizes coordinated execution and controlled recovery rather than ad hoc fixes after something breaks.
Do you need any prerequisites before learning pipeline orchestration and recovery?
A basic understanding of Python, SQL, the Linux command line, and Kafka fundamentals is helpful before starting this course. Because it is intermediate, it assumes you can follow how tasks, state, and data movement behave in a real pipeline.
What tools, platforms, or methods are used in this course?
The course uses modern workflow orchestrators such as Airflow and Prefect, along with recovery methods like checkpointing and dead-letter queues.
What specific tasks will you practice or complete in this course?
You practice building scheduled workflows with dependencies and retries, and using logs or alerts to investigate failures. You also work on recovery tasks such as restarting from checkpoints, handling bad records safely, and running controlled backfills or failover steps.