Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net

379 lines
15 KiB

  1. More cores = better performance, simple maths.
  2. Although it makes intuitive sense that more processing power would be advantageous, that is not always the case.
  3. There are multiple implementation paradigms and libraries for making a program run parallel; you have parallel streams, threads, and thread pools.
  4. However, there is no clear cut winner because many factors affect parallel program performance: the number of tasks, IO, amount of work in each task.
  5. This post aims to review the significant paradigms and factors affecting multi-threaded performance by looking at empirical data generated in Java. Additionally, topics like NQ and Amdahl's law are examined within the context of the experiments presented.
  6. # Single Threaded
  7. To gain ground for comparison, we establish a baseline task to get performed. We want to execute a list of tasks and store the results in a collection. On a single thread, this is merely a traversal with a mapping.
  8. ```java
  9. @Override
  10. public List<E> runTasks(Vector<Work<E>> tasks)
  11. {
  12. return tasks.stream()
  13. .map(Work::runTask)
  14. .collect(Collectors.toList());
  15. }
  16. ```
  17. I am using the lambda streaming notation due to its conciseness for list processing; if you are not familiar with lambda statements, I recommend checking out my latest [blog post on functional Java](https://jrtechs.net/java/fun-with-functional-java) on it.
  18. # Threads
  19. To execute multiple tasks, spawning a thread for each job would appear reasonable.
  20. After all the threads have finished running, we aggregate and return the results. Note: we are heavily using vectors in this code because vectors are thread-safe.
  21. ```java
  22. @Override
  23. public List<E> runTasks(Vector<Work<E>> tasks)
  24. {
  25. List<E> results = new Vector<>();
  26. List<Thread> threads = tasks.stream()
  27. .map(task ->
  28. new Thread(() -> results.add(task.runTask())))
  29. .collect(Collectors.toList());
  30. threads.forEach(Thread::start);
  31. threads.forEach(t-> {
  32. try {
  33. t.join();
  34. } catch (InterruptedException e) {
  35. e.printStackTrace();
  36. }
  37. });
  38. return results;
  39. }
  40. ```
  41. # Thread Pooling
  42. This method extends the thread method but aims to cut down on the number of threads spawned.
  43. Instead, the manager spawns N threads to execute tasks from a queue. Ideally, the number of threads is equal to the number of processors on your computer; this enables you to fully utilize the CPU while avoiding overhead associated with swapping running processes.
  44. The thread pool code was inspired by my blog post [Multi Threaded File IO](https://jrtechs.net/programming/multi-threaded-file-io).
  45. We are using a thread-safe queue to track which tasks we still need to run.
  46. Each thread spawned will sequentially run tasks from the queue and finish once the task queue is empty.
  47. ```java
  48. public List<E> runTasks(Vector<Work<E>> tasks)
  49. {
  50. List<E> results = new Vector<>();
  51. Queue<Integer> taskQueue = new LinkedList<>();
  52. taskQueue.addAll(IntStream.range(0, tasks.size())
  53. .boxed().collect(Collectors.toList()));
  54. int desiredThreads = Math.min(threadCount, tasks.size());
  55. Thread[] runners = new Thread[desiredThreads];
  56. for(int i = 0; i < desiredThreads; i++)
  57. {
  58. runners[i] = new Thread(()->
  59. {
  60. Work<E> t;
  61. while(true)
  62. {
  63. Integer nextTask;
  64. synchronized (taskQueue)
  65. {
  66. nextTask = taskQueue.poll();
  67. }
  68. if(nextTask == null)
  69. return;
  70. results.add(tasks.get(nextTask).runTask());
  71. }
  72. });
  73. runners[i].start();
  74. }
  75. for(Thread t: runners)
  76. {
  77. try
  78. {
  79. t.join();
  80. } catch (InterruptedException e) {
  81. e.printStackTrace();
  82. }
  83. }
  84. return results;
  85. }
  86. ```
  87. Java has some method already written that manages running thread pools.
  88. The version that Java implemented returns Futures, which are nice because you can use them to handle exceptions.
  89. ```java
  90. @Override
  91. public List<E> runTasks(Vector<Work<E>> tasks)
  92. {
  93. ExecutorService executor = Executors.newCachedThreadPool();
  94. List<Callable<E>> callables = new ArrayList<Callable<E>>();
  95. for(Work<E> work: tasks)
  96. {
  97. Callable<E> c = new Callable<E>() {
  98. @Override
  99. public E call() throws Exception {
  100. return work.runTask();
  101. }
  102. };
  103. callables.add(c);
  104. }
  105. List<E> results = new ArrayList<>();
  106. try
  107. {
  108. List<Future<E>> futures = executor.invokeAll(callables);
  109. for(Future<E> future: futures)
  110. {
  111. try {
  112. results.add(future.get());
  113. } catch (ExecutionException e) {
  114. e.printStackTrace();
  115. }
  116. }
  117. }
  118. catch (InterruptedException e)
  119. {
  120. e.printStackTrace();
  121. }
  122. return results;
  123. }
  124. ```
  125. # Parallel Streams
  126. First implemented in Java 1.8, Streams is a part of Java's push towards more functional syntax operators.
  127. Parallel streams use an approach similar to a thread pool to execute tasks from a stream in parallel.
  128. ```java
  129. @Override
  130. public List<E> runTasks(Vector<Work<E>> tasks)
  131. {
  132. return tasks.parallelStream()
  133. .map(Work::runTask)
  134. .collect(Collectors.toList());
  135. }
  136. ```
  137. # Comparison
  138. To neatly test all the different implementations, I wrote a class that times each method's performance using the same tasks.
  139. Initially, this was simply a static method, but I refactored it to be an entire class because I wanted to work with any generic type that I define for my Work object.
  140. ```java
  141. public class GenericTester<E>
  142. {
  143. public long timeTrialMS(ParallelExecutor<E> executor, Vector<Work<E>> tasks)
  144. {
  145. long start = System.nanoTime();
  146. executor.runTasks(tasks);
  147. long finish = System.nanoTime();
  148. return (finish-start)/1000000;
  149. }
  150. public Result testAll(Vector<Work<E>> tasks)
  151. {
  152. ParallelExecutor<E> streams = new ParallelStreamsExecutor<>();
  153. ParallelExecutor<E> threads = new RunThreads<>();
  154. ParallelExecutor<E> manager = new Manager<>(8);
  155. ParallelExecutor<E> single = new SingleThread<>();
  156. ParallelExecutor<E> pool = new ThreadPoolExecutor<>();
  157. Result res = new Result();
  158. res.streams = timeTrialMS(streams, tasks);
  159. res.manager = timeTrialMS(manager, tasks);
  160. res.threads = timeTrialMS(threads, tasks);
  161. res.pool = timeTrialMS(pool, tasks);
  162. res.singleThread = timeTrialMS(single, tasks);
  163. return res;
  164. }
  165. }
  166. ```
  167. Since I'm not a big fan of Java plotting libraries, I graphed the results using Python and Matplotlib.
  168. ```python
  169. import matplotlib.pyplot as plt
  170. def plot_result(single, threads, manager, streams, sizes, xLab="Tasks", yLab="Execution Time (MS)", title="Execution Times"):
  171. plt.title(title)
  172. if len(sizes) == len(single):
  173. plt.plot(sizes, single, label="Single Threaded")
  174. if len(sizes) == len(threads):
  175. plt.plot(sizes, threads, label="Vanilla Threads")
  176. plt.plot(sizes, manager, label="Parallel Task Queue")
  177. plt.plot(sizes, streams, label="Parallel Stream")
  178. plt.legend(bbox_to_anchor=(0.6, 0.95), loc='upper left', borderaxespad=0.)
  179. plt.xlabel(xLab)
  180. plt.ylabel(yLab)
  181. plt.show()
  182. ```
  183. ## Overhead With Implementations
  184. This task simply returns the value passed into it-- in the "real world," you would never do this. This test gives us a gauge of how much overhead is associated with running a task in each method.
  185. ```java
  186. public class DoNothing<E> extends WorkGenerator<E>
  187. {
  188. @Override
  189. Work<E> generateWork(E param) {
  190. return new Work<E>() {
  191. @Override
  192. E runTask() {
  193. return param;
  194. }
  195. };
  196. }
  197. }
  198. ```
  199. ![Overhead large](media/parallel-java/overhead.png)
  200. Based on the results, we can see that spawning an entire thread for each task is not the most efficient method.
  201. The overhead associated with spawning a thread is much more arduous than the other methods -- including the single-threaded approach.
  202. After removing the threaded approach, the graph looks like this:
  203. ![Overhead small](media/parallel-java/operational_small.png)
  204. ## Tasks that Sleep
  205. This test just bites time by putting the thread to sleep.
  206. Similar to the initial task, this also returns the parameter that got passed into it.
  207. ```java
  208. public class SleepWork<E> extends WorkGenerator<E>
  209. {
  210. @Override
  211. Work<E> generateWork(E param) {
  212. return new Work<E>() {
  213. @Override
  214. E runTask() {
  215. try {
  216. Thread.sleep(500);
  217. } catch (InterruptedException e) {
  218. e.printStackTrace();
  219. }
  220. return param;
  221. }
  222. };
  223. }
  224. }
  225. ```
  226. ![Sleeping tasks performance](media/parallel-java/sleep.png)
  227. The single-threaded implementation ran the slowest since it only had one thread to put to sleep N times.
  228. However, an interesting side effect of this test is that the inefficient thread's method ran faster than the thread pool methods.
  229. In the threaded method, each thread could theoretically sleep simultaneously; however, there were very few threads to sleep on in the thread pooled approach.
  230. This experiment illustrates how IO-bound tasks may perform better on a thread rather than a thread pool and that we shouldn't abject the 1-1 thread approach for parallel task processing.
  231. ## Tasks Doing Arithmetic
  232. To test the performance of doing arithmetic (or "real work"), I generated an obnoxiously confusing math statement to execute.
  233. Note Math.random() takes a severe performance hit when running in a multi-threaded environment.
  234. For multi-threaded code, it is better to use the ThreadLocalRandom class.
  235. ```java
  236. public class DoMaths<E> extends WorkGenerator<Double>
  237. {
  238. @Override
  239. Work<Double> generateWork(Double param) {
  240. return new Work<Double>() {
  241. @Override
  242. Double runTask()
  243. {
  244. return IntStream.range(0, (int)Math.round(param))
  245. .boxed()
  246. .map(i -> Math.sin(i * ThreadLocalRandom.current().nextDouble()))
  247. .mapToDouble(java.lang.Double::doubleValue)
  248. .sum();
  249. }
  250. };
  251. }
  252. }
  253. ```
  254. ![Arithmatic test Parallel](media/parallel-java/math_small.png)
  255. The reason why the 1-1 thread method was slower than the single-threaded implementation is arduous to explain.
  256. Even if there was some price X to pay for thread maintenance on each task, if the job takes an ample amount of time, then the overhead price will be worth it-- this has gotten coined in the field as the NQ model.
  257. $$
  258. NQ > 10,000
  259. $$
  260. where:
  261. - N = number of data items
  262. - Q = amount of work per item.
  263. The NQ model is often used to determine whether parallelism will offer a speedup.
  264. N and Q work with each other: for problems with trivially small Q, you would need a much larger N to parallel approaches worth it.
  265. NQ is generally a useful heuristic because small tasks or tasks that don't divide nicely, doing parallel computing, just adds overhead that you may not make up for by utilizing multiple cores.
  266. In the latter arithmetic example, an N of 1000 was not enough to make parallelization using the threaded method advantageous.
  267. ![](media/parallel-java/math_big.png)
  268. Even the less optimal approach saw significant performance gains over the single-threaded program after our N and Q were sufficiently large.
  269. The results lead us into a discussion about Amdahl's Law:
  270. $$
  271. S_{\text{latency}}(s)={\frac {1}{(1-p)+{\frac {p}{s}}}}
  272. $$
  273. where
  274. - $S_{\text{latency}}$ is the theoretical speedup of the execution of the whole task.
  275. - s is the speedup of the part of the job that benefits from improved system resources;
  276. - p is the proportion of execution time that the segment was benefiting from improved resources initially occupied.
  277. Amdahl's law states that the sequential part of your program bounds the max increase in performance you can get using parallel computing. IE: even if you made 95% of your application parallelized, the max performance increase you can achieve is 20x. However, if the program were somehow 100% parallel, the speedup would be directly correlated to the number of processors you had.
  278. ![Amdahls Law](media/parallel-java/AmdahlsLaw.png)
  279. *Graph Courtesy of [Wikipedia](https://en.wikipedia.org/wiki/Amdahl%27s_law) [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0)*
  280. Looking at our program, we can figure out the max theoretical speed up we could achieve.
  281. First, we need to calculate the performance increase we saw initially. I'm taking the times at N=10,000.
  282. - single: 3598 ms
  283. - thread manager: 533 ms
  284. - threaded: 1358 ms
  285. After doing a simple ratio, we see that the thread pool implementations had a 6.75x performance increase, and the threaded version saw a 2.65x performance increase.
  286. Using the 6.75x as our $S_{\text{latency}}$, and eight as our S (I have eight processors in my CPU), we can calculate P in our equation. P will tell us the portion of the entire program that runs in parallel.
  287. $$
  288. S_{\text{latency}}(s)={\frac {1}{(1-p)+{\frac {p}{s}}}}\\
  289. 6.75=\frac{1}{1-p+\frac{p}{8}}\\
  290. 1 = 6.75 - 6.75p + \frac{6.75}{8}p\\
  291. p = 0.98
  292. $$
  293. After some algebra, we get a P-value of 0.98: 98% of the program is ran in parallel, 2% is sequential.
  294. To figure out the max theoretical speedup, if we had a gratuitous amount of threads, we take the limit of Amdahl's law as S (our processor count) approaches infinity.
  295. $$
  296. {\displaystyle {\begin{cases}S_{\text{latency}}(s)\leq {\dfrac {1}{1-p}}\\[8pt]\lim \limits _{s\to \infty }S_{\text{latency}}(s)={\dfrac {1}{1-p}}.\end{cases}}}
  297. $$
  298. Using our value of 0.98 for P, we can see that the max parallel performance increase we can accomplish is 50x. Neet. Why is this important, you might ask? Well, Amdahl's law illustrates the law of diminishing returns. The first few cores that we add will be more beneficial than the last few threads.
  299. The following graph graphs Amdahl's law with our P-value of 0.98:
  300. ![](media/parallel-java/notLog.png)
  301. In the original Amdahls graph, the X-axis was logarithmic.
  302. If we plot the entire graph in base 10, it is apparent how the law of diminishing returns plays a role in parallelization.
  303. If I was obtaining a computer to run this program, I might decide that it is best to stop around 50 cores. Although this would only yield a 25x performance increase, you would need thousands of more threads to get closer to the theoretical 50x performance increase.
  304. # Takeaways
  305. - Thread pools are almost always more efficient than spawning a thread for each task, unless the functions are heavily IO bound and have blocked calls.
  306. - Different thread pool implementations such as Java's Parallel Streams and ManagedThreadPool will yield similar big-O complexities.
  307. - Parallelization is only faster if you have a sufficient amount of sufficiently large tasks to complete in parallel-- the law of NQ.
  308. - Amdahl's law can compute your program's theoretical speed with more processors.