When will I have access to the lectures and assignments?

To access the course materials, assignments and to earn a Certificate, you will need to purchase the Certificate experience when you enroll in a course. You can try a Free Trial instead, or apply for Financial Aid. The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I purchase the Certificate?

When you purchase a Certificate you get access to all course materials, including graded assignments. Upon completing the course, your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile.

Is financial aid available?

Yes. In select learning programs, you can apply for financial aid or a scholarship if you can’t afford the enrollment fee. If fin aid or scholarship is available for your learning program selection, you’ll find a link to apply on the description page.

Multicore and GPGPU Programming

位教师：Kunal Kishore Korgaonkar

包含在中

了解更多

12个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

8 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

12个模块

深入了解一个主题并学习基础知识。

中级等级

推荐体验

8 周完成

在 10 小时一周

灵活的计划

自行安排学习进度

您将学到什么

Understand the fundamentals of multi-threaded programming and its applications in multicore systems.
Develop shared memory programs in OpenMP and distributed programming using MPI.
Gain a foundational understanding of GPGPU architecture and the CUDA programming model.

您将获得的技能

您将学习的工具

C (Programming Language)

要了解的详细信息

可分享的证书

添加到您的领英档案

作业

124 项作业

授课语言：英语（English）

91% of learners achieved a positive career outcome

了解顶级公司的员工如何掌握热门技能

了解关于 Coursera for Business 的更多信息

Petrobras, TATA, Danone, Capgemini, P&G 和 L'Oreal 的徽标

该课程共有12个模块

The course "Multicore and GPGPU Programming" provides a foundational understanding of parallel programming, focusing on developing high-performance, multi-threaded applications in both CPU and GPU environments. Beginning with a review of multicore processor architectures, caching mechanisms, and Non-Uniform Memory Access (NUMA) systems, students will learn the essentials of shared memory programming, synchronisation techniques, and the use of locks to ensure data integrity across threads.

The course delves into designing shared memory data structures and introduces advanced synchronisation concepts, including lazy synchronisation, crucial for scalable and efficient concurrent applications. Additionally, students will explore the architecture and programming model of General-Purpose Graphics Processing Units (GPGPUs) and learn CUDA programming to leverage GPU parallelism for compute-intensive tasks. By the end of the course, students will be adept in optimising multi-threaded and many-core applications, balancing workload across CPUs and GPUs to achieve high throughput and efficient resource utilisation. This course is essential for those aiming to develop expertise in high-performance computing and parallel programming for modern multi-core and GPU-based systems.

单元详情

In this module, the learners will be introduced to the course and its syllabus, setting the foundation for their learning journey. The course's introductory video will provide them with insights into the valuable skills and knowledge they can expect to gain throughout the duration of this course. Additionally, the syllabus reading will comprehensively outline essential course components, including course values, assessment criteria, grading system, schedule, details of live sessions, and a recommended reading list that will enhance the learner’s understanding of the course concepts. Moreover, this module offers the learners the opportunity to connect with fellow learners as they participate in a discussion prompt designed to facilitate introductions and exchanges within the course community.

涵盖的内容

4个视频1篇阅读材料1个讨论话题

In this module, students will gain foundational knowledge of parallel and multi-threaded programming, exploring the core principles that underlie the efficient utilisation of modern multi-core and many-core processors. Beginning with an overview of parallel programming concepts, this module covers different types of parallelism, including data parallelism, task parallelism, and pipeline parallelism. Students will also examine critical performance metrics like speedup, efficiency, and scalability, which help in evaluating the benefits and trade-offs of parallel approaches.

涵盖的内容

12个视频2篇阅读材料12个作业1个讨论话题

12个视频总计73分钟

Need for Ever-Increasing Performance8分钟
Parallel Systems and Parallel Programs8分钟
Concurrent, Parallel, Distributed Systems5分钟
Types of Parallelism: Data, Task and Pipeline Parallelism8分钟
Speedup and Efficiency5分钟
Amdahl’s Law 5分钟
Gustafson’s Law 5分钟
Scalability in Parallel Systems5分钟
Cost of Parallelisation7分钟
Sources of Overhead in Parallel Programs 5分钟
Timing Parallel Programs: Methods and Best Practices7分钟
GPU Performance5分钟

2篇阅读材料总计120分钟

Recommended Reading: Fundamentals of Parallel Computing60分钟
Recommended Reading: Introduction to Performance Metrics in Parallel Computing60分钟

12个作业总计36分钟

Need for Ever-Increasing Performance3分钟
Parallel Systems and Parallel Programs3分钟
Concurrent, Parallel, Distributed Systems3分钟
Types of Parallelism: Data, Task and Pipeline Parallelism3分钟
Speedup and Efficiency3分钟
Amdahl’s Law 3分钟
Gustafson’s Law 3分钟
Scalability in MIMD Systems3分钟
Cost of Parallelisation3分钟
Sources of Overhead in Parallel Programs3分钟
Taking Timings of Parallel Programs3分钟
GPU Performance3分钟

1个讨论话题总计30分钟

Why Parallelism? Revisiting the Roots of Multicore Programming30分钟

This module provides an in-depth exploration of multicore processor architectures, examining the design principles, performance considerations, and challenges involved in building efficient multicore systems. Students will study how multiple cores interact within a processor, focusing on memory hierarchies, caching mechanisms, and the role of parallelism in improving computational performance.

涵盖的内容

15个视频2篇阅读材料15个作业1个讨论话题

15个视频总计160分钟

The Von Neumann Architecture7分钟
Processes, Multitasking, and Threads5分钟
The Basics of Caching7分钟
Virtual Memory7分钟
Instruction-Level Parallelism9分钟
Hardware Multithreading6分钟
Classifications of Parallel Computers6分钟
SIMD and MIMD Systems7分钟
Interconnection Networks: Shared Memory Systems6分钟
Interconnection Networks: Distributed Memory Systems8分钟
Cache Coherence8分钟
Shared-Memory vs. Distributed-Memory4分钟
Parallel Software: Coordinating Process and Threads11分钟
Distributed Memory Software7分钟
Recording of Multicore and GPGPU Programming: Week 2 - Live Session on 25-05-30 18:35:08 [02:05]62分钟

2篇阅读材料总计100分钟

Recommended Reading: Architecture Background40分钟
Recommended Reading: Parallel Hardware and Software60分钟

15个作业总计114分钟

Graded Quiz - Modules 1 and 2 60分钟
The Von Neumann Architecture3分钟
Processes, Multitasking, and Threads3分钟
The Basics of Caching3分钟
Virtual Memory3分钟
Instruction-Level Parallelism3分钟
Hardware Multithreading3分钟
Classifications of Parallel Computer3分钟
SIMD and MIMD Systems3分钟
Interconnection Networks: Shared Memory Systems3分钟
Interconnection Networks: Distributed Memory Systems6分钟
Cache Coherence3分钟
Shared-Memory vs. Distributed-Memory3分钟
Parallel Software: Coordinating Process and Threads12分钟
Distributed Memory Software3分钟

1个讨论话题总计30分钟

From Von Neumann to Multicore: Evolving Architectures and Memory Realities30分钟

This module introduces students to the architectural principles of General-Purpose GPU (GPGPU) systems and the CUDA programming model. It explores the hardware components, including Streaming Multiprocessors (SMs), CUDA cores, and memory hierarchy, which form the foundation of GPU computing. The module also provides an overview of the CUDA programming model, emphasising its thread hierarchy, grid, and block organisation. By understanding these fundamental concepts, students will develop the ability to harness GPU architecture for high-performance parallel computing.

涵盖的内容

15个视频2篇阅读材料14个作业1个讨论话题

15个视频总计127分钟

GPUs and GPGPU5分钟
GPU Architecture5分钟
Heterogeneous Computing4分钟
Paradigm of Heterogeneous Computing5分钟
Introduction to CUDA5分钟
Structure of a CUDA Program8分钟
Threads, Blocks, and Grid9分钟
Managing Memory7分钟
Writing and Verifying Your Kernel6分钟
Compiling and Running CUDA Program4分钟
Nvidia Compute Capabilities and Device Architecture6分钟
Timing Your Kernel7分钟
Organising Parallel Threads5分钟
Managing Devices4分钟
Recording of Multicore and GPGPU Programming: Week 3 - Live Session on 25-06-06 18:31:21 [44:50]45分钟

2篇阅读材料总计75分钟

Recommended Reading: GPGPU Architecture and CUDA15分钟
Recommended Reading: Programming Model Overview60分钟

14个作业总计48分钟

GPUs and GPGPU6分钟
GPU Architecture3分钟
Heterogeneous Computing3分钟
Paradigm of Heterogeneous Computing3分钟
Introduction to CUDA3分钟
Structure of a CUDA Program3分钟
Threads, Blocks, and Grid6分钟
Managing Memory3分钟
Writing and Verifying Your Kernel3分钟
Compiling and Running CUDA Program3分钟
Nvidia Compute Capabilities and Device Architecture3分钟
Timing Your Kernel3分钟
Organising Parallel Threads3分钟
Managing Devices3分钟

1个讨论话题总计30分钟

Harnessing GPU Power: Exploring CUDA and the Architecture of Parallelism30分钟

This module provides a comprehensive understanding of how CUDA executes programs on GPUs. It covers key concepts such as warps, warp scheduling, and resource partitioning, which are critical for understanding GPU hardware behaviour. The module delves into branch divergence and its impact on performance, offering strategies to minimise its effects. It also emphasises exposing parallelism effectively by leveraging CUDA’s hierarchical execution model. Students will learn how to design and optimise GPU programs by aligning with the underlying execution model to maximise efficiency and throughput.

涵盖的内容

15个视频2篇阅读材料15个作业1个讨论话题

15个视频总计135分钟

Introduction to CUDA Execution Model7分钟
Warps and Thread Blocks4分钟
Warp Divergence9分钟
Resource Partitioning6分钟
Latency Hiding10分钟
Occupancy5分钟
Synchronization4分钟
Scalability5分钟
Exposing Parallelism10分钟
Checking Active Warps with Nvprof6分钟
Checking Memory Operations with Nvprof7分钟
Avoiding Branch Divergence3分钟
The Parallel Reduction Problem and Thread Divergence7分钟
Improving Divergence in Parallel Reduction6分钟
Recording of Multicore and GPGPU Programming: Week 4 - Live Session on 25-06-13 18:32:39 [49:37]45分钟

2篇阅读材料总计120分钟

Recommended Reading: Structure of a CUDA Program60分钟
Recommended Reading: Exposing Parallelism and Avoiding Branch Divergence60分钟

15个作业总计105分钟

Graded Quiz - Modules 3 and 4 60分钟
Introduction to CUDA Execution Model3分钟
Warps and Thread Blocks 3分钟
Warp Divergence3分钟
Resource Partitioning6分钟
Latency Hiding3分钟
Occupancy3分钟
Synchronization3分钟
Scalability3分钟
Exposing Parallelism3分钟
Checking Active Warps with Nvprof3分钟
Checking Memory Operations with Nvprof3分钟
Avoiding Branch Divergence3分钟
The Parallel Reduction Problem and Thread Divergence3分钟
Improving Divergence in Parallel Reduction3分钟

1个讨论话题总计30分钟

Under the Hood: Warps, Divergence, and CUDA Execution Dynamics30分钟

The CUDA Memory Model & Streams and Concurrency module introduces students to the intricacies of memory hierarchy in CUDA, including global, shared, and local memory. It emphasises the importance of memory coalescing and efficient memory access patterns to optimise performance on GPUs. The module also covers CUDA streams, explaining how concurrent kernel execution and memory operations can be managed to enhance parallelism. By understanding these concepts, students will gain the ability to design GPU programs that maximise throughput and minimise latency.

涵盖的内容

14个视频2篇阅读材料14个作业1个讨论话题1个非评分实验室

14个视频总计126分钟

Introduction to CUDA Memory Model8分钟
Memory Allocation and Deallocation6分钟
Zero Copy Memory4分钟
Unified Virtual Addressing and Unified Memory 3分钟
Aligned and Coalesced Access6分钟
CUDA Shared Memory6分钟
Shared Memory Banks and Access Mode 7分钟
Configuring the Amount of Shared Memory5分钟
Synchronisation9分钟
CUDA Streams7分钟
Stream Scheduling and Priorities6分钟
CUDA Events6分钟
Concurrent Kernel Execution6分钟
Recording of Multicore and GPGPU Programming: Week 5 - Live Session on 25-06-20 18:31:59 [47:36]48分钟

2篇阅读材料总计120分钟

Recommended Reading: CUDA Memory Model60分钟
Recommended Reading: Streams and Concurrency60分钟

14个作业总计342分钟

SGA-1: CUDA Programming and Performance Optimisation300分钟
Introduction to CUDA Memory Model3分钟
Memory Allocation and Deallocation3分钟
Zero Copy Memory3分钟
Unified Virtual Addressing and Unified Memory 3分钟
Aligned and Coalesced Access3分钟
CUDA Shared Memory6分钟
Shared Memory Banks and Access Mode 3分钟
Configuring the Amount of Shared Memory3分钟
Synchronisation3分钟
CUDA Streams3分钟
Stream Scheduling and Priorities3分钟
CUDA Events3分钟
Concurrent Kernel Execution3分钟

1个讨论话题总计30分钟

Smart Memory and Seamless Concurrency: CUDA Memory and Streams30分钟

1个非评分实验室总计60分钟

Hands on lab: Parallel Matrix Addition Using CUDA60分钟

This module explains in depth the difference between processes and threads and introduces multithreaded programming using pthreads library. Students are expected to learn about the various functions in pthreads library and implement those to solve real-world problems through a multithreaded approach. It also discusses precautions to take while developing an algorithm that uses multi-threading.

涵盖的内容

10个视频11篇阅读材料10个作业1个讨论话题

10个视频总计116分钟

Processes, Threads and Pthreads4分钟
Hello World!!9分钟
Matrix-Vector Multiplication13分钟
Critical Sections5分钟
Busy Waiting6分钟
Mutexes5分钟
Semaphores7分钟
Barriers and Condition Variables13分钟
Caches, Cache-Coherence and False Sharing9分钟
Recording of Multicore and GPGPU Programming: Week 6 - Live Session on 25-06-27 18:38:36 [43:53]44分钟

11篇阅读材料总计295分钟

Recommended Reading: Processes, Threads and Pthreads10分钟
Recommended Reading: Hello World!!60分钟
Recommended Reading: Matrix-Vector Multiplication15分钟
Recommended Reading: Critical Sections30分钟
Recommended Reading: Busy Waiting20分钟
Recommended Reading: Mutexes15分钟
Recommended Reading: Semaphores30分钟
Recommended Reading: Barriers and Condition Variables30分钟
Recommended Reading: Read-Write Locks60分钟
Recommended Reading: Caches, Cache-Coherence and False Sharing15分钟
Lab Instruction Document10分钟

10个作业总计135分钟

Graded Quiz - Modules 5 and 6 60分钟
Processes, Threads and Pthreads9分钟
Hello World!!9分钟
Matrix-Vector Multiplication9分钟
Critical Sections9分钟
Busy Waiting9分钟
Mutexes9分钟
Semaphores6分钟
Barriers and Condition Variables6分钟
Caches, Cache-Coherence and False Sharing9分钟

1个讨论话题总计10分钟

Thread Synchronization and Shared Memory: Building Reliable Parallel Programs with Pthreads10分钟

This module aims to introduce students to Distributed memory programming using the Message Passing Interface (MPI). Students will learn about the functions provided by the MPI library and their descriptions. It will enable students to develop parallel programming codes and also to convert a serial programmed code into a parallel code with the help of the MPI functions.

涵盖的内容

7个视频9篇阅读材料7个作业1个讨论话题

7个视频总计70分钟

Introduction to MPI4分钟
MPI Setup and Communicator Functions6分钟
SPMD and Communication10分钟
Potential Pitfalls4分钟
Simple Serial Sorting Algorithm20分钟
Parallel Odd-Even Transposition Sort19分钟
Safety in MPI Programs7分钟

9篇阅读材料总计125分钟

Recommended Reading: Introduction to MPI15分钟
Recommended Reading: MPI Setup and Communicator Functions15分钟
Recommended Reading: SPMD and Communication15分钟
Recommended Reading: Potential Pitfalls15分钟
Recommended Reading: Simple Serial Sorting Algorithm15分钟
Recommended Reading: Parallel Odd-Even Transposition Sort15分钟
Recommended Reading: Safety in MPI Programs 15分钟
Lab: Practice Code10分钟
Lab: Practice Solution10分钟

7个作业总计63分钟

Introduction to MPI9分钟
MPI Setup and Communicator Functions9分钟
SPMD and Communication9分钟
Potential Pitfalls9分钟
Simple Serial Sorting Algorithm9分钟
Parallel Odd-Even Transposition Sort9分钟
Safety in MPI Programs9分钟

1个讨论话题总计30分钟

MPI in Action: Understanding Setup, Communication, and Parallel Sorting30分钟

This module aims to introduce the shared memory programming model with the help of the OpenMP library. Students will gain exposure to the functions in the OpenMP library and methods to implement those in code to implement parallelism using shared memory. Students will explore the foundational concepts of OpenMP through videos and readings, starting with the basics of the library and progressing to more advanced topics such as reduction clauses, variable scoping, and mutual exclusion. Through worked examples like the Trapezoidal Rule and sorting functions, learners will understand how to parallelise loops, manage scheduling, and apply critical sections and locks for safe concurrent execution. The module also covers tasking in OpenMP and classic concurrency problems like producers and consumers.

涵盖的内容

12个视频12篇阅读材料13个作业1个讨论话题

12个视频总计94分钟

Introduction to OpenMP5分钟
Programming in OpenMP10分钟
Trapezoidal Rule10分钟
Scope of Variables4分钟
Reduction Clause7分钟
Parallel-For Directive and Caveats in Them8分钟
Sorting Functions20分钟
Scheduling6分钟
Producers and Consumers6分钟
Termination, Startup and Atomic Directive7分钟
Critical Sections and Locks6分钟
Tasking5分钟

12篇阅读材料总计152分钟

Recommended Reading: Introduction to OpenMP15分钟
Recommended Reading: Programming in OpenMP15分钟
Recommended Reading: Trapezoidal Rule15分钟
Recommended Reading: Scope of Variables15分钟
Recommended Reading: Reduction Clause15分钟
Recommended Reading: Parallel-For Directive and Caveats in Them15分钟
Recommended Reading: Sorting Functions15分钟
Recommended Reading: Scheduling 15分钟
Recommended Reading: Producers and Consumers15分钟
Recommended Reading: Termination, Startup and Atomic Directive1分钟
Recommended Reading: Critical Sections and Locks1分钟
Recommended Reading: Tasking15分钟

13个作业总计168分钟

Graded Quiz - Modules 7 and 860分钟
Introduction to OpenMP9分钟
Programming in OpenMP9分钟
Trapezoidal Rule9分钟
Scope of Variables9分钟
Reduction Clause9分钟
Parallel-For Directive and Caveats in Them9分钟
Sorting Functions9分钟
Scheduling9分钟
Producers and Consumers9分钟
Termination, Startup and Atomic Directive9分钟
Critical Sections and Locks9分钟
Tasking9分钟

1个讨论话题总计30分钟

Mastering OpenMP: From Parallel Patterns to Synchronisation30分钟

This module will introduce the n-body problem in physics, examining its significance in simulating gravitational interactions among multiple particles. It will explore classical and modern algorithmic approaches to solving the n-body problem, followed by a discussion on their computational complexity. Emphasis will be placed on identifying opportunities for parallelisation, and students will analyse and implement efficient parallel solutions using the programming languages and parallel computing directives covered in the course.

涵盖的内容

13个视频13篇阅读材料13个作业1个讨论话题

13个视频总计107分钟

Introduction to N-body Problem8分钟
Serial Solutions to the N-body Problem16分钟
Parallelising Strategy13分钟
Parallelising Basic Solver Using OpenMP9分钟
Parallelising Reduced Solver Using OpenMP 11分钟
Evaluating OpenMP Performance5分钟
Parallelising Basic Solver Using Pthreads 4分钟
Parallelising Basic Solver Using MPI 9分钟
Parallelising Reduced Solver Using MPI9分钟
Evaluating MPI Performance6分钟
Parallelising Basic Solver Using CUDA7分钟
Evaluating CUDA Solver and Improving Performance4分钟
Using Shared Memory for Solvers7分钟

13篇阅读材料总计195分钟

Recommended Reading: Introduction to N-body Problem15分钟
Recommended Reading: Serial Solutions to the N-body Problem15分钟
Recommended Reading: Parallelising Strategy15分钟
Recommended Reading: Parallelising Basic Solver Using OpenMP15分钟
Recommended Reading: Parallelising Reduced Solver Using OpenMP15分钟
Recommended Reading: Evaluating OpenMP performance15分钟
Recommended Reading: Parallelising Basic Solver Using Pthreads15分钟
Recommended Reading: Parallelising Basic Solver Using MPI15分钟
Recommended Reading: Parallelising Reduced Solver Using MPI15分钟
Recommended Reading: Evaluating MPI Performance15分钟
Recommended Reading: Parallelising Basic Solver Using CUDA15分钟
Recommended Reading: Evaluating CUDA Solver and Improving Performance15分钟
Recommended Reading: Using Shared Memory for Solvers15分钟

13个作业总计138分钟

Introduction to N-body Problem9分钟
Serial Solutions to the N-body Problem9分钟
Parallelising Strategy9分钟
Parallelising Basic Solver Using OpenMP9分钟
Parallelising Reduced Solver Using OpenMP9分钟
Evaluating OpenMP Performance9分钟
Parallelising Basic Solver Using Pthreads9分钟
Parallelising Basic Solver Using MPI30分钟
Parallelising Reduced Solver Using MPI9分钟
Evaluating MPI Performance9分钟
Parallelising Basic Solver Using CUDA9分钟
Evaluating CUDA Solver and Improving Performance9分钟
Using Shared Memory for Solvers9分钟

1个讨论话题总计30分钟

The N-Body Solver: Exploring Parallelism Across Models30分钟

This module focuses on hands-on implementations of the Sample Sort algorithm using OpenMP, Pthreads, MPI, and CUDA. Students will explore the strengths and limitations of each parallel programming model through practical coding exercises. The module includes performance benchmarking and comparative analysis of the implementations to highlight trade-offs in scalability, efficiency, and suitability for different architectures. By the end of the module, students will have a strong grasp of each API and be equipped to make informed decisions about the most appropriate tool for a given parallel computing task.

涵盖的内容

8个视频9篇阅读材料10个作业1个讨论话题

8个视频总计61分钟

Sample Sort and Bucket Sort10分钟
Map17分钟
Implementing Sample Sort Using OpenMP: First Implementation5分钟
Implementing Sample Sort Using OpenMP: Second Implementation7分钟
Implementing Sample Sort Using Pthreads 4分钟
Implementing Sample Sort Using MPI6分钟
Implementing Sample Sort Using MPI: Example5分钟
Implementing Sample Sort Using CUDA 7分钟

9篇阅读材料总计115分钟

Recommended Reading: Sample Sort and Bucket Sort15分钟
Recommended Reading: Map10分钟
Recommended Reading: Implementing Sample Sort Using OpenMP: First Implementation15分钟
Recommended Reading: Implementing Sample Sort Using OpenMP: Second Implementation15分钟
Recommended Reading: Implementing Sample Sort Using Pthreads10分钟
Recommended Reading: Implementing Sample Sort Using MPI15分钟
Recommended Reading: Implementing Sample Sort Using MPI: Example15分钟
Recommended Reading: Implementing Sample Sort Using CUDA10分钟
Recommended Reading: Which API?10分钟