Modern AI Models for Vision and Multimodal Understanding

Modern AI Models for Vision and Multimodal Understanding

本课程是 Computer Vision 专项课程的一部分

位教师：Tom Yeh

3,395 人已注册

包含在中

了解更多

4个模块

深入了解一个主题并学习基础知识。

30 条评论

高级设置等级

推荐体验

灵活的计划

1 周在 10 小时一周

自行安排学习进度

攻读学位

了解更多

4个模块

深入了解一个主题并学习基础知识。

30 条评论

高级设置等级

推荐体验

灵活的计划

1 周在 10 小时一周

自行安排学习进度

攻读学位

了解更多

您将学到什么

Apply Nonlinear Support Vector Machines (NSVMs) and Fourier transforms to analyze and process visual data.
Use probabilistic reasoning and implement Recurrent Neural Networks (RNNs) to model temporal sequences and contextual dependencies in visual data.
Explain the principles of transformer architectures and how Vision Transformers (ViT) perform image classification and visual understanding tasks.
Implement CLIP for multimodal learning, and utilize diffusion models to generate high-fidelity images.

您将获得的技能

您将学习的工具

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

19 项作业

授课语言：英语（English）

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

积累特定领域的专业知识

本课程是 Computer Vision 专项课程专项课程的一部分

在注册此课程时，您还会同时注册此专项课程。

向行业专家学习新概念
获得对主题或工具的基础理解
通过实践项目培养工作相关技能
获得可共享的职业证书

该课程共有4个模块

Step into the frontier of artificial intelligence with this advanced course designed to explore the latest models powering visual and multimodal intelligence. From foundational mathematical tools to state-of-the-art architectures, you'll gain the skills to understand and build systems that interpret images, text, and more—just like today’s leading AI models.

You'll begin by discovering how Nonlinear Support Vector Machines (NSVMs) and Fourier transforms lay the groundwork for signal processing and pattern recognition in visual data. You'll then build a strong foundation in probabilistic reasoning and temporal modeling with RNNs, enabling AI systems to understand sequences and context. After, you'll learn how transformer architectures revolutionize both language and vision tasks. Finally, you'll dive into multimodal learning with CLIP, which connects images and text, and explore diffusion models that generate high-fidelity images through iterative refinement. This course is ideal for learners who want to go beyond traditional deep learning and explore the models shaping the future of AI. With a blend of theory, code, and real-world applications, you'll be equipped to tackle cutting-edge challenges in computer vision and multimodal AI. This course can be taken for academic credit as part of CU Boulder’s MS in Data Science or MS in Computer Science degrees offered on the Coursera platform. These fully accredited graduate degrees offer targeted courses, short 8-week sessions, and pay-as-you-go tuition. Admission is based on performance in three preliminary courses, not academic history. CU degrees on Coursera are ideal for recent graduates or working professionals. Learn more: MS in Data Science: https://hua.dididi.sbs/degrees/master-of-science-data-science-boulder MS in Computer Science: https://coursera.org/degrees/ms-computer-science-boulder

Welcome to Modern AI Models for Vision and Multimodal Understanding, the third course in the Computer Vision specialization. In this first module, you’ll explore foundational mathematical tools used in modern AI models for vision and multimodal understanding. You’ll begin with Support Vector Machines (SVMs), learning how linear and radial basis function (RBF) kernels define decision boundaries and how support vectors influence classification. Then, you’ll dive into the Fourier Transform, starting with 1D signals and progressing to 2D applications. You’ll learn how to move between time/spatial and frequency domains using the Discrete Fourier Transform (DFT) and its inverse, and how these transformations reveal patterns and structures in data. By the end of this module, you’ll understand how SVMs and Fourier analysis contribute to feature extraction, signal decomposition, and model interpretability in AI systems.

涵盖的内容

14个视频8篇阅读材料5个作业

14个视频总计85分钟

Meet Your Instructor 3分钟
Linear SVM11分钟
Visualize Linear8分钟
Radial Basis Function (RBF)6分钟
RBF Kernel4分钟
Visualize a RBF SVM10分钟
1D DFT6分钟
1D Inverse DFT 7分钟
1D Basic Functions5分钟
Frequency and Time6分钟
2D DFT7分钟
2D Inverse DFT3分钟
2D Basic Functions5分钟
Frequency and Spatial 4分钟

8篇阅读材料总计50分钟

Course Updates and Accessibility Support1分钟
Earn Academic Credit for your Work!10分钟
Course Support10分钟
Inside the Course5分钟
Assessment Expectations10分钟
AI Citation and Acknowledgement10分钟
Get the Workbook: SVM2分钟
Get the Workbook: Fourier 1D & 2D2分钟

5个作业总计80分钟

AI Policy Quiz5分钟
SMV and Fourier30分钟
Support Vector Machine (SVM)15分钟
Fourier 1D15分钟
Fourier 2D15分钟

This module invites you to explore how probability theory and sequential modeling power modern AI systems. You’ll begin by examining how conditional and joint probabilities shape predictions in language and image models, and how the chain rule enables structured generative processes. Then, you’ll transition to recurrent neural networks (RNNs), learning how they handle sequential data through hidden states and feedback loops. You’ll compare RNNs to feedforward models, explore architectures like one-to-many and sequence-to-sequence, and address challenges like vanishing gradients. By the end, you’ll understand how probabilistic reasoning and temporal modeling combine to support tasks ranging from text generation to autoregressive image synthesis.

涵盖的内容

15个视频2篇阅读材料5个作业

15个视频总计123分钟

Probability in Language Models 10分钟
Conditional Probabilities 9分钟
The Chain Rule of Probabilities11分钟
Calculating Joint Probabilities 12分钟
Pixel-Base Image Models13分钟
Autoregressive Image Model16分钟
Attention Mechanisms in Transformer Models14分钟
Batch vs Recurrent4分钟
MLP vs RNN12分钟
Many to One4分钟
One to Many2分钟
One to One6分钟
Sequence to Sequence2分钟
Deep RNN5分钟
Autoregressive RNN3分钟

2篇阅读材料总计4分钟

Get the Workbook: Probability2分钟
Get the Workbook: RNN2分钟

5个作业总计90分钟

Probability and RNN30分钟
Probability Part One15分钟
Probability Part Two15分钟
RNN Part One15分钟
RNN Part Two15分钟

This module explores how attention-based architectures have reshaped the landscape of deep learning for both language and vision. You’ll begin by unpacking the mechanics of the Transformer, including self-attention, multi-head attention, and the encoder-decoder structure that enables parallel sequence modeling. Then, you’ll transition to Vision Transformers (ViTs), where images are tokenized and processed using the same principles that revolutionized NLP. Along the way, you’ll examine how normalization, positional encoding, and projection layers contribute to model performance. By the end, you’ll understand how Transformers and ViTs unify sequence and spatial reasoning in modern AI systems.

涵盖的内容

15个视频2篇阅读材料5个作业

15个视频总计81分钟

Batch vs Recurrent vs Attention7分钟
Attention + MLP5分钟
Dot-Product Self-Attention4分钟
QKV Self-Attention4分钟
Transformer Encoder4分钟
Self vs Cross Attention5分钟
Encoder and Decoder for Transformer7分钟
Decoder Output Layer3分钟
Image to Tokens11分钟
Normalization for ViT4分钟
Self-Attention for ViT6分钟
Multi-Head Attention9分钟
MLP Forward Feed4分钟
ViT Output Layer5分钟
Loss Gradient for ViT4分钟

2篇阅读材料总计4分钟

Get the Workbook: Transformer2分钟
Get the Workbook: ViT2分钟

5个作业总计90分钟

Transformer and ViT30分钟
Transformer Part One15分钟
Transformer Part Two15分钟
ViT Part One15分钟
ViT Part Two15分钟

In this module, you’ll explore two transformative approaches in multimodal and generative AI. First, you’ll dive into CLIP, a model that learns a shared embedding space for images and text using contrastive pre-training. You’ll see how CLIP enables zero-shot classification by comparing image embeddings to textual descriptions, without needing labeled training data. Then, you’ll shift to diffusion models, which generate images through a gradual denoising process. You’ll learn how noise prediction, time conditioning, and reverse diffusion combine to produce high-quality samples. This module highlights how foundational models can bridge modalities and synthesize data with remarkable flexibility.

涵盖的内容

11个视频2篇阅读材料4个作业

11个视频总计75分钟

Batch of Pairs6分钟
Image Encoder (Batch)6分钟
Text Encoder (Batch)10分钟
Joint Embedding5分钟
Contrastive Pre-Training13分钟
Zero-Shot Image Classifier6分钟
Zero-Shot Image Prediction7分钟
Diffusion Introduction5分钟
Noise Prediction6分钟
Time Conditioning and Parallel Training5分钟
Reverse Diffusion6分钟

2篇阅读材料总计4分钟

Get the Workbook: CLIP2分钟
Get the Workbook: Diffusion2分钟

4个作业总计75分钟

CLIP and Diffusion30分钟
CLIP Part One15分钟
CLIP Part Two15分钟
Diffusion15分钟

获得职业证书

将此证书添加到您的 LinkedIn 个人资料、简历或履历中。在社交媒体和绩效考核中分享。

攻读学位

课程是 University of Colorado Boulder提供的以下学位课程的一部分。如果您被录取并注册，您已完成的课程可计入您的学位学习，您的学习进度也可随之转移。

位教师

授课教师评分

(7个评价)

Tom Yeh

University of Colorado Boulder

4 门课程21,067 名学生

提供方

University of Colorado Boulder

从 Algorithms 浏览更多内容

Coursera
Fine-tune Multimodal Models with Transfer Learning
课程
University of Colorado Boulder
Deep Learning for Computer Vision
课程
Coursera
Analyze Multimodal AI for Business Insights
课程
Coursera
End-to-End Multimodal AI: Fine-Tuning, Fusion, and MLOps
课程

人们为什么选择 Coursera 来帮助自己实现职业发展

Felipe M.

自 2018开始学习的学生

''能够按照自己的速度和节奏学习课程是一次很棒的经历。只要符合自己的时间表和心情，我就可以学习。'

Jennifer J.

自 2020开始学习的学生

''我直接将从课程中学到的概念和技能应用到一个令人兴奋的新工作项目中。'

Larry W.

自 2021开始学习的学生

''如果我的大学不提供我需要的主题课程，Coursera 便是最好的去处之一。'

Chaitanya A.

''学习不仅仅是在工作中做的更好：它远不止于此。Coursera 让我无限制地学习。'

通过订阅解锁 10,000 多门课程的访问权限
通过在线学位推动您的职业生涯
获取世界一流大学的学位 - 100% 在线
加入全球超过 4,700 家选择 Coursera for Business 的公司

常见问题

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

When you enroll in the course, you get access to all of the courses in the Specialization, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.