Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

229 lines
8.7 KiB

  1. High-performance parallel computing is all the buzz right now, and new technologies such as CUDA is making it more accessible to do GPU computing. However, it is vital to know in what scenarios GPU/CPU processing is faster. This post explores several variables that affect CUDA vs. CPU performance.
  2. The full [Jupyter notebook](https://github.com/jrtechs/RandomScripts/blob/master/notebooks/cuda-vs-cpu.ipynb) for this blog post is posted on my GitHub.
  3. For reference, I am using a Nvidia GTX 1060 running CUDA version 10.2 on Linux.
  4. ```python
  5. !nvidia-smi
  6. ```
  7. ```
  8. Wed Jul 1 11:16:12 2020
  9. +-----------------------------------------------------------------------------+
  10. | NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
  11. |-------------------------------+----------------------+----------------------+
  12. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  13. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  14. |===============================+======================+======================|
  15. | 0 GeForce GTX 1060.. Off | 00000000:01:00.0 On | N/A |
  16. | 0% 49C P2 26W / 120W | 2808MiB / 3016MiB | 2% Default |
  17. +-------------------------------+----------------------+----------------------+
  18. +-----------------------------------------------------------------------------+
  19. | Processes: GPU Memory |
  20. | GPU PID Type Process name Usage |
  21. |=============================================================================|
  22. | 0 1972 G /usr/libexec/Xorg 59MiB |
  23. | 0 2361 G /usr/libexec/Xorg 280MiB |
  24. | 0 2485 G /usr/bin/gnome-shell 231MiB |
  25. | 0 5777 G /usr/lib64/firefox/firefox 2MiB |
  26. | 0 33033 G /usr/lib64/firefox/firefox 4MiB |
  27. | 0 37575 G /usr/lib64/firefox/firefox 167MiB |
  28. | 0 37626 G /usr/lib64/firefox/firefox 2MiB |
  29. | 0 90844 C /home/jeff/Documents/python/ml/bin/python 1881MiB |
  30. +-----------------------------------------------------------------------------+
  31. ```
  32. The first thing we can do is write a function that will measure how fast we can compute every element's sin in a matrix.
  33. ```python
  34. import torch
  35. import time # times in seconds
  36. def time_torch(size):
  37. x = torch.rand(size, size, device=torch.device("cuda"))
  38. start = time.time()
  39. x.sin_()
  40. end = time.time()
  41. return(end - start)
  42. def time_cpu(size):
  43. x = torch.rand(size, size, device=torch.device("cpu"))
  44. start = time.time()
  45. x.sin_()
  46. end = time.time()
  47. return(end - start)
  48. ```
  49. To make this easier to graph, we will create a wrapper function that will generate two lists: one for the CPU times and the other for the CUDA times.
  50. ```python
  51. def get_cuda_cpu_times(sizes):
  52. cpuTimes = []
  53. cudaTimes = []
  54. for s in sizes:
  55. cpuTimes += [time_cpu(s)]
  56. cudaTimes += [time_torch(s)]
  57. return cpuTimes, cudaTimes
  58. ```
  59. Some vanilla Matplotlib code can get used to plot the CUDA vs. CPU performances. Note: a lower execution time is better.
  60. ```python
  61. import matplotlib.pyplot as plt
  62. def plot_cuda_vs_cpu(cpuTimes, cudaTimes, sizes, xLab="Matrix Width"):
  63. plt.title("CUDA vs CPU")
  64. cpu_plot = plt.plot(sizes, cpuTimes, label="CPU")
  65. cuda_plot = plt.plot(sizes, cudaTimes, label="CUDA")
  66. plt.legend(bbox_to_anchor=(0.8, 0.98), loc='upper left', borderaxespad=0.)
  67. plt.xlabel(xLab)
  68. plt.ylabel('Execution Time (Seconds)')
  69. plt.show()
  70. sizes = range(1, 50, 1)
  71. cpu_t, cuda_t = get_cuda_cpu_times(sizes)
  72. plot_cuda_vs_cpu(cpu_t, cuda_t, sizes)
  73. ```
  74. ![png](media/cuda-performance/output_5_0.png)
  75. ```python
  76. sizes = range(1, 5000, 100)
  77. cpu_t, cuda_t = get_cuda_cpu_times(sizes)
  78. plot_cuda_vs_cpu(cpu_t, cuda_t, sizes)
  79. ```
  80. ![png](media/cuda-performance/output_6_0.png)
  81. It is interesting to note that it is faster to perform the task on the CPU for small matrixes. Where for larger arrays, the CUDA outperforms the CPU by large margins.
  82. On a large scale, it looks like the CUDA times are not increasing, but if we only plot the CUDA times, we can see that it also increases linearly.
  83. ```python
  84. cuda_plot = plt.plot(sizes, cuda_t, label="CUDA")
  85. plt.legend(bbox_to_anchor=(0.8, 0.98), loc='upper left', borderaxespad=0.)
  86. plt.xlabel('Matrix Size')
  87. plt.ylabel('Execution Time (Seconds)')
  88. plt.show()
  89. ```
  90. ![png](media/cuda-performance/output_7_0.png)
  91. It is useful to know that on larger matrixes, the GPU outperforms the CPU, but that doesn't tell the whole story. There are reasons why we don't run everything on the GPU.
  92. It takes time to copy data between the GPU's memory and main memory (RAM).
  93. This code is similar to what we did previously, but this time, we initialize the matrix on the main memory and then transfer it to the GPU to perform the computation.
  94. ```python
  95. import time # times in seconds
  96. def time_torch_copy(size):
  97. x = torch.rand(size, size, device=torch.device("cpu"))
  98. start = time.time()
  99. x = x.cuda()
  100. x.sin_()
  101. end = time.time()
  102. return(end - start)
  103. def get_cuda_cpu_times_with_copy(sizes):
  104. cpuTimes = []
  105. cudaTimes = []
  106. for s in sizes:
  107. cpuTimes += [time_cpu(s)]
  108. cudaTimes += [time_torch_copy(s)]
  109. return cpuTimes, cudaTimes
  110. sizes = range(1, 5000, 100)
  111. cpu_t, cuda_t = get_cuda_cpu_times_with_copy(sizes)
  112. plot_cuda_vs_cpu(cpu_t, cuda_t, sizes)
  113. ```
  114. ![png](media/cuda-performance/output_9_0.png)
  115. After copying the matrix to the GPU, we see that the CUDA and CPU performances are nearly identical in time complexities.
  116. However, in real-world applications, we don't just leave the data sitting on the GPU: we also need to copy it back to the main memory.
  117. This test initializes the matrix on the main memory, copies it to the GPU to operate, and then copies the array back to the main memory.
  118. ```python
  119. import time # times in seconds
  120. def time_torch_copy_and_back(size):
  121. x = torch.rand(size, size, device=torch.device("cpu"))
  122. start = time.time()
  123. x = x.cuda()
  124. x.sin_()
  125. x = x.cpu()
  126. end = time.time()
  127. return(end - start)
  128. def get_cuda_cpu_times_with_copy(sizes):
  129. cpuTimes = []
  130. cudaTimes = []
  131. for s in sizes:
  132. cpuTimes += [time_cpu(s)]
  133. cudaTimes += [time_torch_copy_and_back(s)]
  134. return cpuTimes, cudaTimes
  135. sizes = range(1, 5000, 100)
  136. cpu_t, cuda_t = get_cuda_cpu_times_with_copy(sizes)
  137. plot_cuda_vs_cpu(cpu_t, cuda_t, sizes)
  138. ```
  139. ![png](media/cuda-performance/output_10_0.png)
  140. In this trial, it is interesting that CUDA is slower than just running on the CPU by a significant margin.
  141. In the previous trial, we copied the matrix to the GPU to do a single operation, but in this trial, we vary the number of procedures performed on the same matrix.
  142. ```python
  143. def time_torch_operation_repetition(size, iterations):
  144. x = torch.rand(size, size, device=torch.device("cpu"))
  145. start = time.time()
  146. x = x.cuda()
  147. for _ in range(0, iterations):
  148. x.sin()
  149. end = time.time()
  150. return(end - start)
  151. def time_cpu_operation_repetition(size, iterations):
  152. x = torch.rand(size, size, device=torch.device("cpu"))
  153. start = time.time()
  154. for _ in range(0, iterations):
  155. x.sin()
  156. end = time.time()
  157. return(end - start)
  158. def get_cuda_cpu_times_with_iterations(iterations):
  159. cpuTimes = []
  160. cudaTimes = []
  161. for i in iterations:
  162. cpuTimes += [time_cpu_operation_repetition(300, i)]
  163. cudaTimes += [time_torch_operation_repetition(300, i)]
  164. return cpuTimes, cudaTimes
  165. iterations = range(1, 500, 5)
  166. cpu_t, cuda_t = get_cuda_cpu_times_with_iterations(iterations)
  167. plot_cuda_vs_cpu(cpu_t, cuda_t, iterations, xLab="Number of Operations")
  168. ```
  169. ![](media/cuda-performance/output_14_0.png)
  170. As we see in this trial, as we perform more consecutive operations on the matrix without changing devices, we see significant performance benefits for using CUDA.
  171. As we see, whether GPU vs. CPU computing is going to be faster isn't always a clear cut answer.
  172. The CPU is very good at performing tasks fast, but it is not excellent at performing a large number of parallel computations, which is where GPU computing excels.
  173. IO is another driving factor in whether doing GPU vs. CPU computing will be faster.
  174. If the program has a lot of IO bottlenecks, then CPU computing may be faster.
  175. When designing an application that leverages GPU processing, it is essential to limit the number of times needed to transfer data to the main memory.
  176. ![](media/cuda-performance/brother.jpg)