我能在完成项目后从中下载作品吗？

是，您可以从课程中下载并保留您创建的任何文件。为此，请先确保您已将所有文件和工作保存到您的设备，然后再退出产品环境。

我需要具备多少经验才能做这个项目？

在页面顶部，您可以查看为此课程推荐的经验级别。

我能直接通过 Web 浏览器来完成此项目，而不必安装特殊软件吗？

是，您在浏览器中即可获得完成课程所需的一切。

Large Multimodal Model Prompting with Gemini

通过 Coursera Plus 提高技能，仅需 239 美元/年（原价 399 美元）。立即节省

Large Multimodal Model Prompting with Gemini

位教师：Erwin Huizenga

2,679 人已注册

项目

通过分步说明，培养热门的工作技能

4.9

（28 条评论）

初级等级

推荐体验

2 hours

自行安排学习进度

实践学习

了解更多

项目

通过分步说明，培养热门的工作技能

4.9

（28 条评论）

初级等级

推荐体验

2 hours

自行安排学习进度

实践学习

了解更多

您将学到什么

Learn state-of-the-art techniques for getting the most out of multimodal AI with Google’s Gemini model family.
Leverage the power of Gemini’s cross-modal attention to fuse information from text, images, and video for complex reasoning tasks.
Extend Gemini’s capabilities with external knowledge and live data via function calling and API integration.

您将练习的技能

您将使用的工具

要了解的详细信息

授课语言：英语（English）

无需下载或安装

仅桌面可用

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

在 2 小时内学习、练习并应用岗位必备技能

接受行业专家的培训
获得解决实训工作任务的实践经验

关于此项目

Multimodal models like Gemini are pushing the boundaries of what’s possible by unifying traditionally siloed data modalities. With Gemini, you can build applications that seamlessly understand and reason across text, images, and videos, enabling a new class of intelligent systems. For example, building a virtual interior designer that can analyze a user’s room images, understand their style preferences from a text description, and generate personalized design recommendations. Or creating a smart document processing pipeline that can extract structured data from complex PDFs, answer questions based on the content, and generate human-like summaries.

You’ll learn prompt engineering techniques to guide Gemini’s behavior and optimize its performance for diverse use cases, from creative story generation to analytical report writing. And you’ll discover how to integrate Gemini with external APIs and databases using function calling, with the ability to infuse your applications with real-time data and dynamic content. What you’ll learn, in detail: 1. Introduction to Gemini Models: Explore the Gemini model family, and understand the key differences and use cases for Gemini Nano, Pro, Flash, and Ultra. Understand how to select optimal models based on capability, latency, and cost considerations. 2. Multimodal Prompting and Parameter Control: Learn advanced techniques for structuring effective text-image-video prompts to elicit desired model behavior. Fine-tune key parameters like temperature, top_p, top_k to control model creativity vs determinism. 3. Best Practices for Multimodal Prompting: Get experience with prompt engineering for Gemini multimodal models, and best practices around role assignment, task decomposition, and formatting. Analyze the impact of prompt-image ordering on model performance for different objectives. 4. Creating Use Cases with Images: Build engaging multimodal applications like interior design assistants and receipt itemization tools. Leverage Gemini’s cross-modal reasoning capabilities to analyze relationships between entities across multiple images. 5. Developing Use Cases with Videos: Implement “needle in the haystack” semantic video search powered by Gemini’s large context window. Explore techniques for long-form video QA and content summarization. 6. Integrating Real-Time Data with Function Calling: Extend Gemini with external knowledge and live data via function calling and API integration. Combine Gemini’s Natural Language Understanding (NLU) capabilities with APIs for up-to-date facts and interactive services. Through this course, you’ll become well-versed in Gemini’s capabilities, how to maximize them in different use cases, and a portfolio of practical techniques for architecting advanced multimodal AI applications. Note that due to technical requirements, this course features downloadable-only notebooks on the learning platform. You are free to download, review, and run these notebooks on your own.