Finished writing blog post on multi threaded file io performance.

6 years ago · 3bd3e8ff3d
--- a/blogContent/headerImages/maxCPU.png
+++ b/blogContent/headerImages/maxCPU.png
--- a/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png
+++ b/blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png
--- a/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png
+++ b/blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png
--- a/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png
+++ b/blogContent/posts/programming/media/fileIO/3.6MBFilex31.png
--- a/blogContent/posts/programming/media/fileIO/FreenasServer.png
+++ b/blogContent/posts/programming/media/fileIO/FreenasServer.png
--- a/blogContent/posts/programming/media/fileIO/bigFileCPU.png
+++ b/blogContent/posts/programming/media/fileIO/bigFileCPU.png
--- a/blogContent/posts/programming/media/fileIO/nvme500Threads.png
+++ b/blogContent/posts/programming/media/fileIO/nvme500Threads.png
--- a/blogContent/posts/programming/media/fileIO/nvmeDrive.png
+++ b/blogContent/posts/programming/media/fileIO/nvmeDrive.png
--- a/blogContent/posts/programming/media/fileIO/slowDrive500.png
+++ b/blogContent/posts/programming/media/fileIO/slowDrive500.png
--- a/blogContent/posts/programming/multi-threaded-file-io.md
+++ b/blogContent/posts/programming/multi-threaded-file-io.md
@ -0,0 +1,316 @@
 # Background
 Suppose that you have a large quantity of files that you want your program to ingest.
 Is it faster to sequentially read each file or, read the files in parallel using multiple threads?
 Browsing this question online will give varied answers from confused people on discussion threads.
 Reading resent research papers on multi threaded IO did not quite provide a clear answer my question.
 A ton of people argue that multiple threads don't increase file throughput
 because at the end of the day a HHD can only read
 one file at a time. Adding more CPU cores into the mix would
 actually slow down the file ingest because the HHD would have to 
 take turns between reading fragments of different files. The seek speed of the HHD would
 heavily degrade the performance.
 Other people on the internet claim that using the same number of
 threads as your computer has is the most efficient way to read in files.
 With these claims I did not find any evidence backing their statements.
 However, I did find plenty of dead links -- you gotta love old internet forms. 
 The argument typically goes that using more threads will decrease the idle time of
 the HHD. Plus, if you are running any form of raid or software storage like
 [CEPH](https://ceph.com/ceph-storage/) you have plenty to gain from using multiple threads.
 If you use more threads than your CPU has, you will obviously suffer performance wise because
 the threads will be idle while they wait for each other to finish. 
 I decided to set out and determine what was the fastest approach by taking empirical measurements.
 # Results
 ## Small Files
 I started the experiment by reading 500 randomly generated 
 10 KB text files. I tested using a few hard drive types and 
 configurations. The computer which I ran all the tests on
 has a Ryzen 1700x processor with 16 threads and 8 cores. 
 Each test point took a sample of 20 trials to find the average. 
 ![slow HHD reading 500 10kb files](media/fileIO/100ThreadsOldSpinningHHD.png) 
 ![slow HHD reading 500 10kb files](media/fileIO/slowDrive500.png)   
 ![Raidz 2 500 10kb files](media/fileIO/FreenasServer.png)   
 ![nvme SSD drive](media/fileIO/nvmeDrive.png)
 ![nvme SSD drive](media/fileIO/nvme500Threads.png)   
 ## Big Files
 After seeing the astonishing results of reading 10 KB files,
 I wondered if anything changed if the files were larger.
 Unlike the last set of trials, I did not take a sample size
 to calculate the average. Although it would have made smoother graphs,
 that would have taken way too long to do on the HHD I was using. 
 I only tested with 1 MB and 3.6 MB as the "large" files because any larger
 files *really* slowed down the tests. I initially wanted to test with 18 MB
 files running 10 trials at each thread interval. I backed away from doing that
 because it would have required terabytes of file IO to be performed and, Java *really* does
 not like to read 18 MB files as strings. 
 ![1MB files](media/fileIO/1mbFilesx124OldHHD.png)   
 ![3.6MB files](media/fileIO/3.6MBFilex31.png)
 # Conclusions
 For small files it is clear that using more CPU threads gives you a 
 very substantial boost in performance. However, once you use the same
 number of threads as your CPU has, you hit the optimal performance and any more
 threads decreases performance.
 For larger files, more threads does not equate to better performance. Although
 there is a small bump in performance using two threads, any more threads heavily 
 degrades performance. 
 ![CPU Usage During Big File Test](media/fileIO/bigFileCPU.png)
 Reading large files like this totally renders your computer unusable.
 While I was running the 3.6 MB file test my computer was completely unresponsive.
 I had to fight just to get the screenshot that I included in this blog post. 
 # Code
 All of the code I wrote for this can be found on my [github](https://github.com/jrtechs/RandomScripts/tree/master/multiThreadedFileIO)
 ;however, I also want to put it here so I can make a few remarks about it. 
 ## Basic File IO
 This first class is used to represent a task which we want to run in parallel with other tasks. In this case it is just
 reading a file from the disk. Nothing exciting about this file. 
 ```java
 /**
 * Simple method to be used by the task manager to do
 * file io.
 *
 * @author Jeffery Russell
 */
 public class ReadTask
 {
    private String filePath;
    public ReadTask(String fileName)
    {
        this.fileName = fileName;
    }
    public void runTask()
    {
        String fileContent = new String();
        try
        {
            BufferedReader br = new BufferedReader(
                    new InputStreamReader(new FileInputStream(filePath)));
            String line;
            while ((line = br.readLine()) != null)
                fileContent = fileContent.concat(line);
            br.close();
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
 }
 ```
 ## Multi Threaded Task Manager
 This is where the exciting stuff happens. Essentially, this class lets your specify
 a list of tasks which must be complete.
 This class will then run all of those tasks in parallel using X threads
 until they are all complete. What is interesting is how I managed the race conditions
 to prevent multi threaded errors while keeping the execution of tasks efficient. 
 ```java
 import java.util.List;
 import java.util.Vector;
 /**
 * A class which enables user to run a large chunk of
 * tasks in parallel efficiently.
 *
 * @author Jeffery 1-29-19
 */
 public class TaskManager
 {
    /** Number of threads to use at once */
    private int threadCount;
    /** Meaningless tasks to run in parallel */
    private List<ReadTask> tasks;
    public TaskManager(int threadCount)
    {
        this.threadCount = threadCount;
        //using vectors because they are thread safe
        this.tasks = new Vector<>();
    }
    public void addTask(ReadTask t)
    {
        tasks.add(t);
    }
    /**
     * This is the fun method.
     *
     * This will run all of the tasks in parallel using the
     * desired amount of threads untill all of the jobs are
     * complete.
     */
    public void runTasks()
    {
        int desiredThreads = threadCount > tasks.size() ?
                tasks.size() : threadCount;
        Thread[] runners = new Thread[desiredThreads];
        for(int i = 0; i < desiredThreads; i++)
        {
            runners[i] = new Thread(()->
            {
                ReadTask t = null;
                while(true)
                {
                    //need synchronized block to prevent
                    //race condition between isEmpty and remove
                    synchronized (tasks)
                    {
                        if(!tasks.isEmpty())
                            t = tasks.remove(0);
                    }
                    if(t == null)
                    {
                        break;
                    }
                    else
                    {
                        t.runTask();
                        t = null;
                    }
                }
            });
            runners[i].start();
        }
        for(int i = 0; i < desiredThreads; i++)
        {
            try
                runners[i].join();
            catch (Exception e)
                e.printStackTrace();
        }
    }
 }
 ```
 ## Random Data Generation
 To prevent caching or anything funky, I wanted to use completely random files of equal size.
 I initially generated a random character and then concated that onto a string.
 After repeating that step a few thousand times, you have a random string to save
 to your disk.
 ```java
 private static char rndChar()
 {
    // or use Random or whatever
    int rnd = (int) (Math.random() * 52);
    char base = (rnd < 26) ? 'A' : 'a';
    return (char) (base + rnd % 26);
 }
 ```
 Problem: string concatenation is terribly inefficient in Java.
 When attempting to make a 18 MB file it took nearly 4 minutes and I wanted
 to create 500 files. Yikes.
 Solution: create a random byte array so I don't need to do any string concatenations.
 This turned out to be **very** fast.
 ```java
 for(int i = 0; i < 500; i++)
 {
    byte[] array = new byte[2000000];
    new Random().nextBytes(array);
    String s = new String(array, Charset.forName("UTF-8"));
    saveToDisk(s, "./testData/" + i + ".txt");
    System.out.println("Saved " + i + ".txt");
 }
 ```
 ## Running the Experiments
 I created an ugly main method to run all the experiments with the
 task manager. To run trials with a different number of CPU threads or sample size,
 I simply adjusted the loop variables. 
 ```java
 import java.util.*;
 /**
 * File to test the performance of multi threaded file
 * io by reading a large quantity of files in parallel
 * using a different amount of threads.
 *
 * @author Jeffery Russell 1-31-19
 */
 public class MultiThreadedFileReadTest
 {
    public static void main(String[] args)
    {
        List<Integer> x = new ArrayList<>();
        List<Double> y = new ArrayList<>();
        for(int i = 1; i <= 64; i++) //thread count
        {
            long threadTotal = 0;
            for(int w = 0; w < 20; w++) // sample size
            {
                TaskManager boss = new TaskManager(i);
                for(int j = 0; j < 500; j++) // files to read
                {
                    boss.addTask(new ReadTask("./testData/" + i + ".txt"));
                }
                long startTime = System.nanoTime();
                boss.runTasks();
                long endTime = System.nanoTime();
                long durationMS = (endTime - startTime)/1000000;
                threadTotal+= durationMS;
            }
            x.add(i);
            y.add(threadTotal/20.0); //finds average
        }
        System.out.println(x);
        System.out.println(y);
    }
 }
 ```
 ## Graphing the Results
 I am not going to lie, most Java graphics libraries are terrible.
 Simply pasting the output of a Java list into [matplotlib](https://matplotlib.org/)
 is the easiest way to make presentable graphs. If I had more time, I would export the results to
 JSON for a javascript graph like [D3](https://d3js.org/) which I could embed on my website. 
 ```python
 import matplotlib.pyplot as plt
 xList = []
 yList = []
 plt.plot(xList, yList)
 plt.xlabel('Number of Threads')
 plt.ylabel('Execution Time (MS)')
 plt.show()
 ```
--- a/docs/sqlConfig.md
+++ b/docs/sqlConfig.md
@ -43,12 +43,6 @@ download_count mediumint not null,
 primary key(download_id)
 );
 create table popular_posts(
 popular_post_id mediumint unsigned not null AUTO_INCREMENT,
 post_id mediumint unsigned not null,
 primary key(popular_post_id)
 );
 create table traffic_log(
 log_id mediumint unsigned not null AUTO_INCREMENT,
 url varchar(60) not null,