Transformer Architectures and Multimodal Models

本课程是 Advanced Deep Learning Architectures 专项课程的一部分

位教师：Edureka

访问权限由 Coursera Learning Team 提供

4个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

1 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

4个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

1 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将学到什么

Understand attention mechanisms and complete transformer architectures.
Implement multi-head attention and positional encoding techniques.
Analyze and optimize efficient transformer components like Flash Attention and MoE.
Build multimodal and similarity-based models using transformer foundations.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

13 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 Advanced Deep Learning Architectures 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有4个模块

This course explores the foundations and evolution of modern transformer architectures, taking you from early sequence models to advanced multimodal systems that power today’s AI breakthroughs. Combining strong conceptual depth with practical demonstrations, this course provides a structured journey through attention mechanisms, transformer design, efficiency innovations, and large-scale training strategies.

You will begin by understanding Recurrent Neural Networks (RNNs), LSTMs, and GRUs—examining their strengths and limitations in modeling sequential data. From there, you’ll transition into attention mechanisms and multi-head attention, uncovering how transformers overcame long-standing challenges like vanishing gradients and long-term dependency modeling. As the course progresses, you’ll build a deep understanding of encoder-decoder architectures, positional encoding techniques such as sinusoidal embeddings and RoPE, and efficiency innovations like Flash Attention, GQA, and Mixture of Experts (MoE). The course then expands into multimodal learning and similarity-based systems. You’ll explore Vision Transformers (ViTs), embedding alignment techniques, contrastive learning, and large-scale distributed training strategies. Through demonstrations and analysis, you’ll see how modern transformer systems scale to massive datasets while maintaining performance and memory efficiency. By the end of this course, you will be able to: • Explain the limitations of traditional RNN-based sequence models and how attention mechanisms address them. • Implement and analyze multi-head attention and transformer encoder-decoder architectures. • Compare positional encoding strategies and understand their impact on model generalization. • Evaluate efficiency techniques such as Flash Attention, GQA, and MoE for scaling transformers. • Understand Vision Transformers and multimodal representation learning. • Apply similarity learning concepts using embeddings and distance metrics. • Design scalable transformer training systems using distributed and memory-optimized strategies. • Architect transformer-based systems for real-world NLP and multimodal applications. This course is ideal for AI engineers, machine learning practitioners, researchers, and advanced students who want a rigorous understanding of transformer systems beyond surface-level usage. A foundational understanding of Python and basic neural networks will be helpful. Join us to master transformer architectures, explore multimodal intelligence, and build the technical depth required to understand and scale the models shaping modern AI.

Build a strong foundation in sequence modeling by exploring RNNs, LSTMs, GRUs, and the evolution toward attention mechanisms. Understand gradient challenges, long-term dependency solutions, and how self-attention transforms contextual learning. Through guided demonstrations, you’ll visualize sequence flow, attention behavior, and multi-head representations in action.

涵盖的内容

11个视频5篇阅读材料4个作业

11个视频总计61分钟

Specialization Introduction4分钟
Course Introduction3分钟
Recurrent Neural Networks and Backpropagation6分钟
Demonstration: Forward Pass in RNNs7分钟
Demonstration: Vanishing Gradient Illustration in RNN7分钟
LSTM and GRU: Gated Architectures4分钟
Demonstration: LSTM Networks for Sequence Modeling6分钟
Demonstration: GRU Based Sequence Modeling7分钟
Self-Attention and Multi-Head Attention Explained4分钟
Demonstration: Multi-Head Attention in Transformer6分钟
Demonstration : Head Contribution Analysis7分钟

5篇阅读材料总计85分钟

Welcome to Transformer Architectures and Multimodal Models10分钟
Understanding RNNs: Sequence Modeling and Gradient Challenges20分钟
Gated Recurrent Networks: Solving Long-Term Dependency Problems20分钟
Attention Mechanisms: From Context Weighting to Multi-Head Representations20分钟
Module Summary: Sequence Models and Attention Foundations15分钟

4个作业总计48分钟

Practice Knowledge Check: Recurrent Neural Networks (RNN) Foundations6分钟
Practice Knowledge Check: Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)6分钟
Practice Knowledge Check: Attention and Multi-Head Attention Mechanisms6分钟
Knowledge Check: Sequence Models and Attention Foundations30分钟

Explore the full transformer architecture, from encoder–decoder models to positional encoding and efficiency optimizations. Learn how attention layers, masking, and autoregressive decoding work together to power modern language models. Through practical walkthroughs, you’ll analyze transformer blocks, positional strategies like RoPE, and scalable design techniques such as Flash Attention and Mixture of Experts.

涵盖的内容

14个视频4篇阅读材料4个作业

14个视频总计66分钟

Encoder and Decoder Architecture4分钟
Demonstration: Encoder Forward Pass in Transformer Encoders: Attention Foundations4分钟
Demonstration: Encoder Forward Pass in Transformer Encoders: Encoder Stack5分钟
Demonstration: Autoregressive Decoding in Transformer Decoders: Core Components4分钟
Demonstration: Autoregressive Decoding in Transformer Decoders: Autoregressive Generation5分钟
Sinusoidal and RoPE Encodings3分钟
Demonstration: RoPE Implementation7分钟
Demonstration: Encoding Comparison: Positional Encoding Mechanism7分钟
Demonstration: Encoding Comparison: Encoding Impact Analysis7分钟
Flash Attention GQA and MoE4分钟
Demonstration: Memory Efficient Attention: Standard Attention Baseline4分钟
Demonstration: Memory Efficient Attention: Optimized Attention4分钟
Demonstration: Expert Routing Visualization: Token to Expert Routing3分钟
Demonstration: Expert Routing Visualization: Capacity and Load Balancing 5分钟

4篇阅读材料总计75分钟

Transformer Encoder Decoder Models20分钟
Positional Encoding Methods20分钟
Efficient Transformer Design20分钟
Module Summary: Complete Transformer Architectures15分钟

4个作业总计48分钟

Practice Knowledge Check: Transformer Blocks6分钟
Practice Knowledge Check: Positional Encoding Techniques6分钟
Practice Knowledge Check: Efficient Transformer Components6分钟
Knowledge Check: Complete Transformer Architectures30分钟

Expand beyond text to understand how transformers power multimodal AI and semantic similarity systems. Learn how vision and language models align embeddings, how similarity learning structures semantic space, and how large models scale through distributed training. Through applied demos, you’ll explore embedding alignment, semantic search concepts, and large-scale transformer optimization strategies.

涵盖的内容

15个视频4篇阅读材料4个作业

15个视频总计74分钟

Vision Transformers and Multimodal Learning4分钟
Demonstration: Image and Text Embedding Alignment: Similarity Computation7分钟
Demonstration: Image and Text Embedding Alignment: Retrieval Visualization5分钟
Demonstration: Multimodal Representation Analysis: Similarity Evaluation7分钟
Demonstration: Multimodal Representation Analysis: Representation Geometry7分钟
Text Embeddings and Similarity Learning4分钟
Demonstration: Semantic Text Similarity: Computation and Heatmap Analysis 5分钟
Demonstration: Semantic Text Similarity: Embedding Space Geometry4分钟
Demonstration: Embedding Distance Metrics: Similarity Foundations5分钟
Demonstration: Embedding Distance Metrics: Visualizing and Ranking Analysis4分钟
Distributed Transformer Training3分钟
Demonstration: Large Model Training Setup: Architecture Setup6分钟
Demonstration: Large Model Training Setup: Training and Optimisation5分钟
Demonstration: Memory Usage Optimization: Model Setup 5分钟
Demonstration: Memory Usage Optimization: Benchmark and Comparison4分钟

4篇阅读材料总计75分钟

Multimodal Deep Learning20分钟
Similarity Learning for Text20分钟
Scaling Transformer Systems20分钟
Module Summary: Multimodal and Similarity-Based Models15分钟

4个作业总计48分钟

Practice Knowledge Check: Multimodal Models6分钟
Practice Knowledge Check: Similarity Models6分钟
Practice Knowledge Check: Scaling Strategies6分钟
Knowledge Check: Multimodal and Similarity-Based Models30分钟

Apply your knowledge of sequence models, transformers, multimodal learning, and scaling strategies in a comprehensive practice project. Integrate architectural concepts, embedding techniques, and efficiency optimizations into a cohesive system-level design. Through guided implementation and evaluation, you’ll strengthen your ability to analyze, compare, and optimize transformer-based AI systems in real-world scenarios.