For those who want to understand what .to(“cuda”) does
Nowadays, when we talk about deep learning, it is very common to associate its implementation with utilizing GPUs in order to improve performance.
GPUs (Graphical Processing Units) were originally designed to accelerate rendering of images, 2D, and 3D graphics. However, due to their capability of performing many parallel operations, their utility extends beyond that to applications such as deep learning.
The use of GPUs for deep learning models started around the mid to late 2000s and became very popular around 2012 with the emergence of AlexNet. AlexNet, a convolution neural network designed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. This victory marked a milestone as it demonstrated the effectiveness of deep neural networks for image classification and the use of GPUs for training large models.
Following this breakthrough, the use of GPUs for deep learning models became increasingly popular, which contributed to the creation of frameworks like PyTorch and TensorFlow.
Nowadays, we just write .to("cuda")
in PyTorch to send data to GPU and expect the training to be accelerated. But how does deep learning algorithms take advantage of GPUs computation performance in practice? Let’s find out!
Deep learning architectures like neural networks, CNNs, RNNs and transformers are basically constructed using mathematical operations such as matrix addition, matrix multiplication and applying a function a matrix. Thus, if we find a way to optimize these operations, we can improve the performance of the deep learning models.
So, let’s start simple. Imagine you want to add two vectors C = A + B.

A simple implementation of this in C would be:
void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
As you can notice, the computer must iterate over the vector, adding each pair of elements on each iteration sequentially. But these operations are independent of each other. The addition of the ith pair of elements does not rely on any other pair. So, what if we could execute these operations concurrently, adding all of the pairs of elements in parallel?
A straightforward approach would be using CPU multithreading in order to run all of the computation in parallel. However, when it comes to deep learning models, we are dealing with massive vectors, with millions of elements. A common CPU can only handle around a dozen threads simultaneously. That’s when the GPUs come into action! Modern GPUs can run millions of threads simultaneously, enhancing performance of these mathematical operations on massive vectors.
GPU vs. CPU comparison
Although CPU computations can be faster than GPU for a single operation, the advantage of GPUs relies on its parallelization capabilities. The reason for this is that they are designed with different goals. While CPU is designed to execute a sequence of operations (thread) as fast as possible (and can only execute dozens of them simultaneously), the GPU is designed to execute millions of them in parallel (while sacrificing speed of individual threads).
To illustrate, imagine that a CPU is like a Ferrari, and the GPU as a bus. If your task is to move one person, the Ferrari (CPU) is the better choice. However, if you are moving several people, even though the Ferrari (CPU) is faster per trip, the bus (GPU) can transport everyone in one go, transporting all people at once faster than the Ferrari traveling the route several times. So CPUs are better designed for handling sequential operations and GPUs for parallel operations.

In order to provide higher parallel capabilities, GPU designs allocate more transistors for data processing than to data caching and flow control, unlike CPUs which allocate a significant portion of transistors for that purpose, in order to optimize single-threaded performance and complex instruction execution.
The figure below illustrates the distribution of chip resources for CPU vs GPU.

CPUs have powerful cores and a more complex cache memory architecture (allocating a significant amount of transistors for that). This design enables faster handling of sequential operations. On the other hand, GPUs prioritize having a large number of cores to achieve a higher level of parallelism.
Now that we understood these basic concepts, how can we take advantage of this parallel computation capabilities in practice?
Introduction to CUDA
When you are running some deep learning model, probably your choice is to use some popular Python library such as PyTorch or TensorFlow. However, it is well-known that the core of these libraries run C/C++ code underneath. Also, as we mentioned before, you might use GPUs to speed up processing. That’s where CUDA comes in! CUDA stands for Compute Unified Architecture and it is a platform developed by NVIDIA for general-purpose processing on their GPUs. Thus, while DirectX is used by game engines to handle graphical computation, CUDA enables developers to integrate NVIDIA’s GPU computational power into their general-purpose software applications, extending beyond just graphics rendering.
In order to implement that, CUDA provides a simple C/C++ based interface (CUDA C/C++) that grants access to the GPU’s virtual intruction set and specific operations (such as moving data between CPU and GPU).
Before we go further, let’s understand some basic CUDA Programming concepts and terminology:
- host: refers to the CPU and its memory;
- device: refers to the GPU and its memory;
- kernel: refers to a function that is executed on the device (GPU);
So, in a basic code written using CUDA, the program runs on the host (CPU), sends data to the device (GPU) and launches kernels (functions) to be executed on the device (GPU). These kernels are executed by several threads in parallel. After the execution, the results are transfered back from the device (GPU) to the host (CPU).
So let’s go back to our problem of adding two vectors:
#include <stdio.h>
void AddTwoVectors(flaot A[], float B[], float C[]) {
for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}
int main() {
...
AddTwoVectors(A, B, C);
...
}
In CUDA C/C++, the programmers can define C/C++ functions, called kernels, that when called, are executed N times in parallel by N different CUDA threads.