Benchmark & Optimize LLM App Performance

本课程是 Build Next-Gen LLM Apps with LangChain & LangGraph 专项课程的一部分

位教师：Starweaver

访问权限由 New York State Department of Labor 提供

3个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

4 小时完成

灵活的计划

自行安排学习进度

3个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

4 小时完成

灵活的计划

自行安排学习进度

您将学到什么

Optimize LLM behavior using structured prompting and self-checks to reduce variance and errors.
Design scalable middleware to manage API requests, retries, caching, and token budgets for performance targets.
Build user-centered interfaces that collect feedback and improve LLM accuracy and user trust.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

1 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 Build Next-Gen LLM Apps with LangChain & LangGraph 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有3个模块

Benchmark & Optimize LLM App Performance is a hands-on journey from “it works” to “it flies.” You’ll start by treating speed and cost as product features-defining a baseline with the right metrics (p50/p95 latency, tokens/sec, throughput, determinism, cost per task) and building a lightweight benchmarking harness you can rerun on every change. Next, you’ll learn to hunt bottlenecks across the stack-network, model, prompt, and post-processing-using practical patterns that cut tokens without cutting quality, plus caching strategies for embeddings, RAG, and tool calls. Then you’ll run A/B/C experiments to compare models and prompts on the same dataset, interpret results with simple stats, and choose a winner confidently. Finally, you’ll harden for production with concurrency limits, queues, timeouts, fallbacks, and a 30-day optimization playbook. Expect reusable templates, clear checklists, and realistic demos designed for busy developers and product builders who want measurable gains-not hype.

This course is designed for machine learning engineers, AI developers, data scientists, and product engineers who want to optimize and scale LLM-based applications for production environments. It’s also ideal for backend engineers and DevOps professionals aiming to enhance system performance, reduce latency, and improve cost-efficiency in AI deployments. Additionally, product managers and technical leads overseeing AI-powered systems will benefit from the practical insights provided, helping them to drive improvements in app performance and ensure that their LLM models deliver reliable, high-quality results at scale. This course requires basic knowledge of Python or JavaScript, familiarity with REST APIs, and a high-level understanding of how Large Language Models (LLMs) function. These skills will help you effectively engage with the course content, optimize performance, and implement solutions. By the end of this course, you'll have the skills to optimize LLM performance, tackle real-world bottlenecks, and implement efficient, scalable AI systems. You'll be ready to apply these techniques confidently, making your AI solutions faster, more reliable, and production-ready!

This module establishes why performance is a product feature, not a backend afterthought. We connect latency, cost, and answer quality to user-perceived speed (p50 vs p95, jitter) and trust. You’ll define a minimal metric set-latency, throughput, tokens/sec, determinism, and win-rate-then build a lightweight benchmarking harness that runs a small eval set, logs prompts/outputs, and exports clean CSVs. By the end, you’ll have a reproducible baseline you can rerun on every change.

涵盖的内容

4个视频2篇阅读材料1次同伴评审

4个视频总计26分钟

Welcome to Benchmarking LLM Apps2分钟
Metrics That Matter: Latency, Throughput & Token Efficiency7分钟
Building a Minimal Benchmark Harness (Design Walkthrough)9分钟
Run Your First Baseline & Export the Data8分钟

2篇阅读材料总计10分钟

Welcome to the Course: Course Overview5分钟
Evaluation Best Practices (OpenAI Docs)5分钟

1次同伴评审总计25分钟

Hands-On-Learning: Baseline or Bust: Your First Reproducible Benchmark25分钟

In this module, you'll trace where time actually goes: network hops, model inference, prompt bloat, and post-processing. You’ll learn practical prompt patterns that cut tokens without cutting quality, plus schema-first I/O that improves stability and parsing. We’ll add caching strategies for embeddings, RAG retrievals, and tool calls, including cache keys and invalidation rules to avoid stale answers. Expect clear heuristics for cold vs warm paths and a simple checklist to shave seconds-not just milliseconds.

涵盖的内容

3个视频1篇阅读材料1次同伴评审

The final module turns tuning into a disciplined workflow. You’ll run A/B/C tests across model tiers and prompt variants on the same dataset to compare latency, cost per task, and quality with simple stats - then pick a winner. We’ll cover safe scaling: concurrency limits, queues, backpressure, retries, timeouts, and graceful degradation/fallbacks. You’ll leave with a 30-day optimization plan and a production playbook that keeps your app fast, affordable, and reliable after launch.