| @ -0,0 +1,316 @@ | |||
| # Background | |||
| Suppose that you have a large quantity of files that you want your program to ingest. | |||
| Is it faster to sequentially read each file or, read the files in parallel using multiple threads? | |||
| Browsing this question online will give varied answers from confused people on discussion threads. | |||
| Reading resent research papers on multi threaded IO did not quite provide a clear answer my question. | |||
| A ton of people argue that multiple threads don't increase file throughput | |||
| because at the end of the day a HHD can only read | |||
| one file at a time. Adding more CPU cores into the mix would | |||
| actually slow down the file ingest because the HHD would have to | |||
| take turns between reading fragments of different files. The seek speed of the HHD would | |||
| heavily degrade the performance. | |||
| Other people on the internet claim that using the same number of | |||
| threads as your computer has is the most efficient way to read in files. | |||
| With these claims I did not find any evidence backing their statements. | |||
| However, I did find plenty of dead links -- you gotta love old internet forms. | |||
| The argument typically goes that using more threads will decrease the idle time of | |||
| the HHD. Plus, if you are running any form of raid or software storage like | |||
| [CEPH](https://ceph.com/ceph-storage/) you have plenty to gain from using multiple threads. | |||
| If you use more threads than your CPU has, you will obviously suffer performance wise because | |||
| the threads will be idle while they wait for each other to finish. | |||
| I decided to set out and determine what was the fastest approach by taking empirical measurements. | |||
| # Results | |||
| ## Small Files | |||
| I started the experiment by reading 500 randomly generated | |||
| 10 KB text files. I tested using a few hard drive types and | |||
| configurations. The computer which I ran all the tests on | |||
| has a Ryzen 1700x processor with 16 threads and 8 cores. | |||
| Each test point took a sample of 20 trials to find the average. | |||
|  | |||
|  | |||
|  | |||
|  | |||
|  | |||
| ## Big Files | |||
| After seeing the astonishing results of reading 10 KB files, | |||
| I wondered if anything changed if the files were larger. | |||
| Unlike the last set of trials, I did not take a sample size | |||
| to calculate the average. Although it would have made smoother graphs, | |||
| that would have taken way too long to do on the HHD I was using. | |||
| I only tested with 1 MB and 3.6 MB as the "large" files because any larger | |||
| files *really* slowed down the tests. I initially wanted to test with 18 MB | |||
| files running 10 trials at each thread interval. I backed away from doing that | |||
| because it would have required terabytes of file IO to be performed and, Java *really* does | |||
| not like to read 18 MB files as strings. | |||
|  | |||
|  | |||
| # Conclusions | |||
| For small files it is clear that using more CPU threads gives you a | |||
| very substantial boost in performance. However, once you use the same | |||
| number of threads as your CPU has, you hit the optimal performance and any more | |||
| threads decreases performance. | |||
| For larger files, more threads does not equate to better performance. Although | |||
| there is a small bump in performance using two threads, any more threads heavily | |||
| degrades performance. | |||
|  | |||
| Reading large files like this totally renders your computer unusable. | |||
| While I was running the 3.6 MB file test my computer was completely unresponsive. | |||
| I had to fight just to get the screenshot that I included in this blog post. | |||
| # Code | |||
| All of the code I wrote for this can be found on my [github](https://github.com/jrtechs/RandomScripts/tree/master/multiThreadedFileIO) | |||
| ;however, I also want to put it here so I can make a few remarks about it. | |||
| ## Basic File IO | |||
| This first class is used to represent a task which we want to run in parallel with other tasks. In this case it is just | |||
| reading a file from the disk. Nothing exciting about this file. | |||
| ```java | |||
| /** | |||
| * Simple method to be used by the task manager to do | |||
| * file io. | |||
| * | |||
| * @author Jeffery Russell | |||
| */ | |||
| public class ReadTask | |||
| { | |||
| private String filePath; | |||
| public ReadTask(String fileName) | |||
| { | |||
| this.fileName = fileName; | |||
| } | |||
| public void runTask() | |||
| { | |||
| String fileContent = new String(); | |||
| try | |||
| { | |||
| BufferedReader br = new BufferedReader( | |||
| new InputStreamReader(new FileInputStream(filePath))); | |||
| String line; | |||
| while ((line = br.readLine()) != null) | |||
| fileContent = fileContent.concat(line); | |||
| br.close(); | |||
| } | |||
| catch (IOException e) | |||
| { | |||
| e.printStackTrace(); | |||
| } | |||
| } | |||
| } | |||
| ``` | |||
| ## Multi Threaded Task Manager | |||
| This is where the exciting stuff happens. Essentially, this class lets your specify | |||
| a list of tasks which must be complete. | |||
| This class will then run all of those tasks in parallel using X threads | |||
| until they are all complete. What is interesting is how I managed the race conditions | |||
| to prevent multi threaded errors while keeping the execution of tasks efficient. | |||
| ```java | |||
| import java.util.List; | |||
| import java.util.Vector; | |||
| /** | |||
| * A class which enables user to run a large chunk of | |||
| * tasks in parallel efficiently. | |||
| * | |||
| * @author Jeffery 1-29-19 | |||
| */ | |||
| public class TaskManager | |||
| { | |||
| /** Number of threads to use at once */ | |||
| private int threadCount; | |||
| /** Meaningless tasks to run in parallel */ | |||
| private List<ReadTask> tasks; | |||
| public TaskManager(int threadCount) | |||
| { | |||
| this.threadCount = threadCount; | |||
| //using vectors because they are thread safe | |||
| this.tasks = new Vector<>(); | |||
| } | |||
| public void addTask(ReadTask t) | |||
| { | |||
| tasks.add(t); | |||
| } | |||
| /** | |||
| * This is the fun method. | |||
| * | |||
| * This will run all of the tasks in parallel using the | |||
| * desired amount of threads untill all of the jobs are | |||
| * complete. | |||
| */ | |||
| public void runTasks() | |||
| { | |||
| int desiredThreads = threadCount > tasks.size() ? | |||
| tasks.size() : threadCount; | |||
| Thread[] runners = new Thread[desiredThreads]; | |||
| for(int i = 0; i < desiredThreads; i++) | |||
| { | |||
| runners[i] = new Thread(()-> | |||
| { | |||
| ReadTask t = null; | |||
| while(true) | |||
| { | |||
| //need synchronized block to prevent | |||
| //race condition between isEmpty and remove | |||
| synchronized (tasks) | |||
| { | |||
| if(!tasks.isEmpty()) | |||
| t = tasks.remove(0); | |||
| } | |||
| if(t == null) | |||
| { | |||
| break; | |||
| } | |||
| else | |||
| { | |||
| t.runTask(); | |||
| t = null; | |||
| } | |||
| } | |||
| }); | |||
| runners[i].start(); | |||
| } | |||
| for(int i = 0; i < desiredThreads; i++) | |||
| { | |||
| try | |||
| runners[i].join(); | |||
| catch (Exception e) | |||
| e.printStackTrace(); | |||
| } | |||
| } | |||
| } | |||
| ``` | |||
| ## Random Data Generation | |||
| To prevent caching or anything funky, I wanted to use completely random files of equal size. | |||
| I initially generated a random character and then concated that onto a string. | |||
| After repeating that step a few thousand times, you have a random string to save | |||
| to your disk. | |||
| ```java | |||
| private static char rndChar() | |||
| { | |||
| // or use Random or whatever | |||
| int rnd = (int) (Math.random() * 52); | |||
| char base = (rnd < 26) ? 'A' : 'a'; | |||
| return (char) (base + rnd % 26); | |||
| } | |||
| ``` | |||
| Problem: string concatenation is terribly inefficient in Java. | |||
| When attempting to make a 18 MB file it took nearly 4 minutes and I wanted | |||
| to create 500 files. Yikes. | |||
| Solution: create a random byte array so I don't need to do any string concatenations. | |||
| This turned out to be **very** fast. | |||
| ```java | |||
| for(int i = 0; i < 500; i++) | |||
| { | |||
| byte[] array = new byte[2000000]; | |||
| new Random().nextBytes(array); | |||
| String s = new String(array, Charset.forName("UTF-8")); | |||
| saveToDisk(s, "./testData/" + i + ".txt"); | |||
| System.out.println("Saved " + i + ".txt"); | |||
| } | |||
| ``` | |||
| ## Running the Experiments | |||
| I created an ugly main method to run all the experiments with the | |||
| task manager. To run trials with a different number of CPU threads or sample size, | |||
| I simply adjusted the loop variables. | |||
| ```java | |||
| import java.util.*; | |||
| /** | |||
| * File to test the performance of multi threaded file | |||
| * io by reading a large quantity of files in parallel | |||
| * using a different amount of threads. | |||
| * | |||
| * @author Jeffery Russell 1-31-19 | |||
| */ | |||
| public class MultiThreadedFileReadTest | |||
| { | |||
| public static void main(String[] args) | |||
| { | |||
| List<Integer> x = new ArrayList<>(); | |||
| List<Double> y = new ArrayList<>(); | |||
| for(int i = 1; i <= 64; i++) //thread count | |||
| { | |||
| long threadTotal = 0; | |||
| for(int w = 0; w < 20; w++) // sample size | |||
| { | |||
| TaskManager boss = new TaskManager(i); | |||
| for(int j = 0; j < 500; j++) // files to read | |||
| { | |||
| boss.addTask(new ReadTask("./testData/" + i + ".txt")); | |||
| } | |||
| long startTime = System.nanoTime(); | |||
| boss.runTasks(); | |||
| long endTime = System.nanoTime(); | |||
| long durationMS = (endTime - startTime)/1000000; | |||
| threadTotal+= durationMS; | |||
| } | |||
| x.add(i); | |||
| y.add(threadTotal/20.0); //finds average | |||
| } | |||
| System.out.println(x); | |||
| System.out.println(y); | |||
| } | |||
| } | |||
| ``` | |||
| ## Graphing the Results | |||
| I am not going to lie, most Java graphics libraries are terrible. | |||
| Simply pasting the output of a Java list into [matplotlib](https://matplotlib.org/) | |||
| is the easiest way to make presentable graphs. If I had more time, I would export the results to | |||
| JSON for a javascript graph like [D3](https://d3js.org/) which I could embed on my website. | |||
| ```python | |||
| import matplotlib.pyplot as plt | |||
| xList = [] | |||
| yList = [] | |||
| plt.plot(xList, yList) | |||
| plt.xlabel('Number of Threads') | |||
| plt.ylabel('Execution Time (MS)') | |||
| plt.show() | |||
| ``` | |||