diff --git a/blogContent/headerImages/maxCPU.png b/blogContent/headerImages/maxCPU.png new file mode 100644 index 0000000..7c76e90 Binary files /dev/null and b/blogContent/headerImages/maxCPU.png differ diff --git a/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png b/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png index 43ba481..0cbd440 100644 Binary files a/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png and b/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png differ diff --git a/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png b/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png index ca87a76..c87d815 100644 Binary files a/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png and b/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png differ diff --git a/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png b/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png index 426e0fc..5107889 100644 Binary files a/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png and b/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png differ diff --git a/blogContent/posts/programming/media/fileIO/FreenasServer.png b/blogContent/posts/programming/media/fileIO/FreenasServer.png index c537b03..f185170 100644 Binary files a/blogContent/posts/programming/media/fileIO/FreenasServer.png and b/blogContent/posts/programming/media/fileIO/FreenasServer.png differ diff --git a/blogContent/posts/programming/media/fileIO/bigFileCPU.png b/blogContent/posts/programming/media/fileIO/bigFileCPU.png new file mode 100644 index 0000000..e4622b5 Binary files /dev/null and b/blogContent/posts/programming/media/fileIO/bigFileCPU.png differ diff --git a/blogContent/posts/programming/media/fileIO/nvme500Threads.png b/blogContent/posts/programming/media/fileIO/nvme500Threads.png index 8c85333..f970e5e 100644 Binary files a/blogContent/posts/programming/media/fileIO/nvme500Threads.png and b/blogContent/posts/programming/media/fileIO/nvme500Threads.png differ diff --git a/blogContent/posts/programming/media/fileIO/nvmeDrive.png b/blogContent/posts/programming/media/fileIO/nvmeDrive.png index fc984f2..b3d9018 100644 Binary files a/blogContent/posts/programming/media/fileIO/nvmeDrive.png and b/blogContent/posts/programming/media/fileIO/nvmeDrive.png differ diff --git a/blogContent/posts/programming/media/fileIO/slowDrive500.png b/blogContent/posts/programming/media/fileIO/slowDrive500.png index a43027a..aa74715 100644 Binary files a/blogContent/posts/programming/media/fileIO/slowDrive500.png and b/blogContent/posts/programming/media/fileIO/slowDrive500.png differ diff --git a/blogContent/posts/programming/multi-threaded-file-io.md b/blogContent/posts/programming/multi-threaded-file-io.md new file mode 100644 index 0000000..0cc81e2 --- /dev/null +++ b/blogContent/posts/programming/multi-threaded-file-io.md @@ -0,0 +1,316 @@ +# Background + +Suppose that you have a large quantity of files that you want your program to ingest. +Is it faster to sequentially read each file or, read the files in parallel using multiple threads? +Browsing this question online will give varied answers from confused people on discussion threads. +Reading resent research papers on multi threaded IO did not quite provide a clear answer my question. + +A ton of people argue that multiple threads don't increase file throughput + because at the end of the day a HHD can only read +one file at a time. Adding more CPU cores into the mix would +actually slow down the file ingest because the HHD would have to +take turns between reading fragments of different files. The seek speed of the HHD would +heavily degrade the performance. + +Other people on the internet claim that using the same number of +threads as your computer has is the most efficient way to read in files. +With these claims I did not find any evidence backing their statements. +However, I did find plenty of dead links -- you gotta love old internet forms. +The argument typically goes that using more threads will decrease the idle time of +the HHD. Plus, if you are running any form of raid or software storage like +[CEPH](https://ceph.com/ceph-storage/) you have plenty to gain from using multiple threads. +If you use more threads than your CPU has, you will obviously suffer performance wise because +the threads will be idle while they wait for each other to finish. + +I decided to set out and determine what was the fastest approach by taking empirical measurements. + +# Results + +## Small Files +I started the experiment by reading 500 randomly generated + 10 KB text files. I tested using a few hard drive types and +configurations. The computer which I ran all the tests on + has a Ryzen 1700x processor with 16 threads and 8 cores. + Each test point took a sample of 20 trials to find the average. + +![slow HHD reading 500 10kb files](media/fileIO/100ThreadsOldSpinningHHD.png) + +![slow HHD reading 500 10kb files](media/fileIO/slowDrive500.png) + +![Raidz 2 500 10kb files](media/fileIO/FreenasServer.png) + +![nvme SSD drive](media/fileIO/nvmeDrive.png) + +![nvme SSD drive](media/fileIO/nvme500Threads.png) + +## Big Files + +After seeing the astonishing results of reading 10 KB files, +I wondered if anything changed if the files were larger. +Unlike the last set of trials, I did not take a sample size +to calculate the average. Although it would have made smoother graphs, +that would have taken way too long to do on the HHD I was using. +I only tested with 1 MB and 3.6 MB as the "large" files because any larger +files *really* slowed down the tests. I initially wanted to test with 18 MB +files running 10 trials at each thread interval. I backed away from doing that +because it would have required terabytes of file IO to be performed and, Java *really* does +not like to read 18 MB files as strings. + +![1MB files](media/fileIO/1mbFilesx124OldHHD.png) + +![3.6MB files](media/fileIO/3.6MBFilex31.png) + +# Conclusions + +For small files it is clear that using more CPU threads gives you a +very substantial boost in performance. However, once you use the same +number of threads as your CPU has, you hit the optimal performance and any more +threads decreases performance. + +For larger files, more threads does not equate to better performance. Although +there is a small bump in performance using two threads, any more threads heavily +degrades performance. + +![CPU Usage During Big File Test](media/fileIO/bigFileCPU.png) + +Reading large files like this totally renders your computer unusable. +While I was running the 3.6 MB file test my computer was completely unresponsive. +I had to fight just to get the screenshot that I included in this blog post. + +# Code + +All of the code I wrote for this can be found on my [github](https://github.com/jrtechs/RandomScripts/tree/master/multiThreadedFileIO) +;however, I also want to put it here so I can make a few remarks about it. + +## Basic File IO + +This first class is used to represent a task which we want to run in parallel with other tasks. In this case it is just +reading a file from the disk. Nothing exciting about this file. + +```java +/** + * Simple method to be used by the task manager to do + * file io. + * + * @author Jeffery Russell + */ +public class ReadTask +{ + private String filePath; + + public ReadTask(String fileName) + { + this.fileName = fileName; + } + + public void runTask() + { + String fileContent = new String(); + try + { + BufferedReader br = new BufferedReader( + new InputStreamReader(new FileInputStream(filePath))); + String line; + while ((line = br.readLine()) != null) + fileContent = fileContent.concat(line); + br.close(); + } + catch (IOException e) + { + e.printStackTrace(); + } + } +} +``` + +## Multi Threaded Task Manager + +This is where the exciting stuff happens. Essentially, this class lets your specify +a list of tasks which must be complete. +This class will then run all of those tasks in parallel using X threads +until they are all complete. What is interesting is how I managed the race conditions +to prevent multi threaded errors while keeping the execution of tasks efficient. + +```java +import java.util.List; +import java.util.Vector; + +/** + * A class which enables user to run a large chunk of + * tasks in parallel efficiently. + * + * @author Jeffery 1-29-19 + */ +public class TaskManager +{ + /** Number of threads to use at once */ + private int threadCount; + + /** Meaningless tasks to run in parallel */ + private List tasks; + + public TaskManager(int threadCount) + { + this.threadCount = threadCount; + //using vectors because they are thread safe + this.tasks = new Vector<>(); + } + + public void addTask(ReadTask t) + { + tasks.add(t); + } + + /** + * This is the fun method. + * + * This will run all of the tasks in parallel using the + * desired amount of threads untill all of the jobs are + * complete. + */ + public void runTasks() + { + int desiredThreads = threadCount > tasks.size() ? + tasks.size() : threadCount; + + Thread[] runners = new Thread[desiredThreads]; + for(int i = 0; i < desiredThreads; i++) + { + runners[i] = new Thread(()-> + { + ReadTask t = null; + while(true) + { + //need synchronized block to prevent + //race condition between isEmpty and remove + synchronized (tasks) + { + if(!tasks.isEmpty()) + t = tasks.remove(0); + } + if(t == null) + { + break; + } + else + { + t.runTask(); + t = null; + } + } + }); + runners[i].start(); + } + for(int i = 0; i < desiredThreads; i++) + { + try + runners[i].join(); + catch (Exception e) + e.printStackTrace(); + } + } +} +``` + +## Random Data Generation + +To prevent caching or anything funky, I wanted to use completely random files of equal size. +I initially generated a random character and then concated that onto a string. +After repeating that step a few thousand times, you have a random string to save +to your disk. + +```java +private static char rndChar() +{ + // or use Random or whatever + int rnd = (int) (Math.random() * 52); + char base = (rnd < 26) ? 'A' : 'a'; + return (char) (base + rnd % 26); +} +``` + +Problem: string concatenation is terribly inefficient in Java. +When attempting to make a 18 MB file it took nearly 4 minutes and I wanted +to create 500 files. Yikes. + +Solution: create a random byte array so I don't need to do any string concatenations. +This turned out to be **very** fast. + +```java +for(int i = 0; i < 500; i++) +{ + byte[] array = new byte[2000000]; + new Random().nextBytes(array); + String s = new String(array, Charset.forName("UTF-8")); + saveToDisk(s, "./testData/" + i + ".txt"); + System.out.println("Saved " + i + ".txt"); +} +``` + + +## Running the Experiments + +I created an ugly main method to run all the experiments with the +task manager. To run trials with a different number of CPU threads or sample size, +I simply adjusted the loop variables. + +```java +import java.util.*; + +/** + * File to test the performance of multi threaded file + * io by reading a large quantity of files in parallel + * using a different amount of threads. + * + * @author Jeffery Russell 1-31-19 + */ +public class MultiThreadedFileReadTest +{ + public static void main(String[] args) + { + List x = new ArrayList<>(); + List y = new ArrayList<>(); + for(int i = 1; i <= 64; i++) //thread count + { + long threadTotal = 0; + for(int w = 0; w < 20; w++) // sample size + { + TaskManager boss = new TaskManager(i); + + for(int j = 0; j < 500; j++) // files to read + { + boss.addTask(new ReadTask("./testData/" + i + ".txt")); + } + long startTime = System.nanoTime(); + boss.runTasks(); + long endTime = System.nanoTime(); + long durationMS = (endTime - startTime)/1000000; + threadTotal+= durationMS; + } + x.add(i); + y.add(threadTotal/20.0); //finds average + } + System.out.println(x); + System.out.println(y); + } +} +``` + +## Graphing the Results + +I am not going to lie, most Java graphics libraries are terrible. +Simply pasting the output of a Java list into [matplotlib](https://matplotlib.org/) +is the easiest way to make presentable graphs. If I had more time, I would export the results to +JSON for a javascript graph like [D3](https://d3js.org/) which I could embed on my website. + +```python +import matplotlib.pyplot as plt + +xList = [] +yList = [] +plt.plot(xList, yList) +plt.xlabel('Number of Threads') +plt.ylabel('Execution Time (MS)') + +plt.show() +``` \ No newline at end of file diff --git a/docs/sqlConfig.md b/docs/sqlConfig.md index 867fdb2..372a1c2 100644 --- a/docs/sqlConfig.md +++ b/docs/sqlConfig.md @@ -43,12 +43,6 @@ download_count mediumint not null, primary key(download_id) ); -create table popular_posts( -popular_post_id mediumint unsigned not null AUTO_INCREMENT, -post_id mediumint unsigned not null, -primary key(popular_post_id) -); - create table traffic_log( log_id mediumint unsigned not null AUTO_INCREMENT, url varchar(60) not null,