Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

315 lines
10 KiB

  1. # Background
  2. Suppose that you have a large quantity of files that you want your program to ingest.
  3. Is it faster to sequentially read each file or, read the files in parallel using multiple threads?
  4. Browsing this question online will give varied answers from confused people on discussion threads.
  5. Reading resent research papers on multi threaded IO did not quite provide a clear answer my question.
  6. A ton of people argue that multiple threads don't increase file throughput
  7. because at the end of the day a HHD can only read
  8. one file at a time. Adding more CPU cores into the mix would
  9. actually slow down the file ingest because the HHD would have to
  10. take turns between reading fragments of different files. The seek speed of the HHD would
  11. heavily degrade the performance.
  12. Other people on the internet claim that using the same number of
  13. threads as your computer has is the most efficient way to read in files.
  14. With these claims I did not find any evidence backing their statements.
  15. However, I did find plenty of dead links -- you gotta love old internet forms.
  16. The argument typically goes that using more threads will decrease the idle time of
  17. the HHD. Plus, if you are running any form of raid or software storage like
  18. [CEPH](https://ceph.com/ceph-storage/) you have plenty to gain from using multiple threads.
  19. If you use more threads than your CPU has, you will obviously suffer performance wise because
  20. the threads will be idle while they wait for each other to finish.
  21. I decided to set out and determine what was the fastest approach by taking empirical measurements.
  22. # Results
  23. ## Small Files
  24. I started the experiment by reading 500 randomly generated
  25. 10 KB text files. I tested using a few hard drive types and
  26. configurations. The computer which I ran all the tests on
  27. has a Ryzen 1700x processor with 16 threads and 8 cores.
  28. Each test point took a sample of 20 trials to find the average.
  29. ![slow HHD reading 500 10kb files](media/fileIO/100ThreadsOldSpinningHHD.png)
  30. ![slow HHD reading 500 10kb files](media/fileIO/slowDrive500.png)
  31. ![Raidz 2 500 10kb files](media/fileIO/FreenasServer.png)
  32. ![nvme SSD drive](media/fileIO/nvmeDrive.png)
  33. ![nvme SSD drive](media/fileIO/nvme500Threads.png)
  34. ## Big Files
  35. After seeing the astonishing results of reading 10 KB files,
  36. I wondered if anything changed if the files were larger.
  37. Unlike the last set of trials, I did not take a sample size
  38. to calculate the average. Although it would have made smoother graphs,
  39. that would have taken way too long to do on the HHD I was using.
  40. I only tested with 1 MB and 3.6 MB as the "large" files because any larger
  41. files *really* slowed down the tests. I initially wanted to test with 18 MB
  42. files running 10 trials at each thread interval. I backed away from doing that
  43. because it would have required terabytes of file IO to be performed and, Java *really* does
  44. not like to read 18 MB files as strings.
  45. ![1MB files](media/fileIO/1mbFilesx124OldHHD.png)
  46. ![3.6MB files](media/fileIO/3.6MBFilex31.png)
  47. # Conclusions
  48. For small files it is clear that using more CPU threads gives you a
  49. very substantial boost in performance. However, once you use the same
  50. number of threads as your CPU has, you hit the optimal performance and any more
  51. threads decreases performance.
  52. For larger files, more threads does not equate to better performance. Although
  53. there is a small bump in performance using two threads, any more threads heavily
  54. degrades performance.
  55. ![CPU Usage During Big File Test](media/fileIO/bigFileCPU.png)
  56. Reading large files like this totally renders your computer unusable.
  57. While I was running the 3.6 MB file test my computer was completely unresponsive.
  58. I had to fight just to get the screenshot that I included in this blog post.
  59. # Code
  60. All of the code I wrote for this can be found on my [github](https://github.com/jrtechs/RandomScripts/tree/master/multiThreadedFileIO)
  61. ;however, I also want to put it here so I can make a few remarks about it.
  62. ## Basic File IO
  63. This first class is used to represent a task which we want to run in parallel with other tasks. In this case it is just
  64. reading a file from the disk. Nothing exciting about this file.
  65. ```java
  66. /**
  67. * Simple method to be used by the task manager to do
  68. * file io.
  69. *
  70. * @author Jeffery Russell
  71. */
  72. public class ReadTask
  73. {
  74. private String filePath;
  75. public ReadTask(String fileName)
  76. {
  77. this.fileName = fileName;
  78. }
  79. public void runTask()
  80. {
  81. String fileContent = new String();
  82. try
  83. {
  84. BufferedReader br = new BufferedReader(
  85. new InputStreamReader(new FileInputStream(filePath)));
  86. String line;
  87. while ((line = br.readLine()) != null)
  88. fileContent = fileContent.concat(line);
  89. br.close();
  90. }
  91. catch (IOException e)
  92. {
  93. e.printStackTrace();
  94. }
  95. }
  96. }
  97. ```
  98. ## Multi Threaded Task Manager
  99. This is where the exciting stuff happens. Essentially, this class lets your specify
  100. a list of tasks which must be complete.
  101. This class will then run all of those tasks in parallel using X threads
  102. until they are all complete. What is interesting is how I managed the race conditions
  103. to prevent multi threaded errors while keeping the execution of tasks efficient.
  104. ```java
  105. import java.util.List;
  106. import java.util.Vector;
  107. /**
  108. * A class which enables user to run a large chunk of
  109. * tasks in parallel efficiently.
  110. *
  111. * @author Jeffery 1-29-19
  112. */
  113. public class TaskManager
  114. {
  115. /** Number of threads to use at once */
  116. private int threadCount;
  117. /** Meaningless tasks to run in parallel */
  118. private List<ReadTask> tasks;
  119. public TaskManager(int threadCount)
  120. {
  121. this.threadCount = threadCount;
  122. //using vectors because they are thread safe
  123. this.tasks = new Vector<>();
  124. }
  125. public void addTask(ReadTask t)
  126. {
  127. tasks.add(t);
  128. }
  129. /**
  130. * This is the fun method.
  131. *
  132. * This will run all of the tasks in parallel using the
  133. * desired amount of threads untill all of the jobs are
  134. * complete.
  135. */
  136. public void runTasks()
  137. {
  138. int desiredThreads = threadCount > tasks.size() ?
  139. tasks.size() : threadCount;
  140. Thread[] runners = new Thread[desiredThreads];
  141. for(int i = 0; i < desiredThreads; i++)
  142. {
  143. runners[i] = new Thread(()->
  144. {
  145. ReadTask t = null;
  146. while(true)
  147. {
  148. //need synchronized block to prevent
  149. //race condition between isEmpty and remove
  150. synchronized (tasks)
  151. {
  152. if(!tasks.isEmpty())
  153. t = tasks.remove(0);
  154. }
  155. if(t == null)
  156. {
  157. break;
  158. }
  159. else
  160. {
  161. t.runTask();
  162. t = null;
  163. }
  164. }
  165. });
  166. runners[i].start();
  167. }
  168. for(int i = 0; i < desiredThreads; i++)
  169. {
  170. try
  171. runners[i].join();
  172. catch (Exception e)
  173. e.printStackTrace();
  174. }
  175. }
  176. }
  177. ```
  178. ## Random Data Generation
  179. To prevent caching or anything funky, I wanted to use completely random files of equal size.
  180. I initially generated a random character and then concated that onto a string.
  181. After repeating that step a few thousand times, you have a random string to save
  182. to your disk.
  183. ```java
  184. private static char rndChar()
  185. {
  186. // or use Random or whatever
  187. int rnd = (int) (Math.random() * 52);
  188. char base = (rnd < 26) ? 'A' : 'a';
  189. return (char) (base + rnd % 26);
  190. }
  191. ```
  192. Problem: string concatenation is terribly inefficient in Java.
  193. When attempting to make a 18 MB file it took nearly 4 minutes and I wanted
  194. to create 500 files. Yikes.
  195. Solution: create a random byte array so I don't need to do any string concatenations.
  196. This turned out to be **very** fast.
  197. ```java
  198. for(int i = 0; i < 500; i++)
  199. {
  200. byte[] array = new byte[2000000];
  201. new Random().nextBytes(array);
  202. String s = new String(array, Charset.forName("UTF-8"));
  203. saveToDisk(s, "./testData/" + i + ".txt");
  204. System.out.println("Saved " + i + ".txt");
  205. }
  206. ```
  207. ## Running the Experiments
  208. I created an ugly main method to run all the experiments with the
  209. task manager. To run trials with a different number of CPU threads or sample size,
  210. I simply adjusted the loop variables.
  211. ```java
  212. import java.util.*;
  213. /**
  214. * File to test the performance of multi threaded file
  215. * io by reading a large quantity of files in parallel
  216. * using a different amount of threads.
  217. *
  218. * @author Jeffery Russell 1-31-19
  219. */
  220. public class MultiThreadedFileReadTest
  221. {
  222. public static void main(String[] args)
  223. {
  224. List<Integer> x = new ArrayList<>();
  225. List<Double> y = new ArrayList<>();
  226. for(int i = 1; i <= 64; i++) //thread count
  227. {
  228. long threadTotal = 0;
  229. for(int w = 0; w < 20; w++) // sample size
  230. {
  231. TaskManager boss = new TaskManager(i);
  232. for(int j = 0; j < 500; j++) // files to read
  233. {
  234. boss.addTask(new ReadTask("./testData/" + i + ".txt"));
  235. }
  236. long startTime = System.nanoTime();
  237. boss.runTasks();
  238. long endTime = System.nanoTime();
  239. long durationMS = (endTime - startTime)/1000000;
  240. threadTotal+= durationMS;
  241. }
  242. x.add(i);
  243. y.add(threadTotal/20.0); //finds average
  244. }
  245. System.out.println(x);
  246. System.out.println(y);
  247. }
  248. }
  249. ```
  250. ## Graphing the Results
  251. I am not going to lie, most Java graphics libraries are terrible.
  252. Simply pasting the output of a Java list into [matplotlib](https://matplotlib.org/)
  253. is the easiest way to make presentable graphs. If I had more time, I would export the results to
  254. JSON for a javascript graph like [D3](https://d3js.org/) which I could embed on my website.
  255. ```python
  256. import matplotlib.pyplot as plt
  257. xList = []
  258. yList = []
  259. plt.plot(xList, yList)
  260. plt.xlabel('Number of Threads')
  261. plt.ylabel('Execution Time (MS)')
  262. plt.show()
  263. ```