Browse Source

Finished writing blog post on multi threaded file io performance.

pull/60/head
jrtechs 5 years ago
parent
commit
3bd3e8ff3d
11 changed files with 316 additions and 6 deletions
  1. BIN
      blogContent/headerImages/maxCPU.png
  2. BIN
      blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png
  3. BIN
      blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png
  4. BIN
      blogContent/posts/programming/media/fileIO/3.6MBFilex31.png
  5. BIN
      blogContent/posts/programming/media/fileIO/FreenasServer.png
  6. BIN
      blogContent/posts/programming/media/fileIO/bigFileCPU.png
  7. BIN
      blogContent/posts/programming/media/fileIO/nvme500Threads.png
  8. BIN
      blogContent/posts/programming/media/fileIO/nvmeDrive.png
  9. BIN
      blogContent/posts/programming/media/fileIO/slowDrive500.png
  10. +316
    -0
      blogContent/posts/programming/multi-threaded-file-io.md
  11. +0
    -6
      docs/sqlConfig.md

BIN
blogContent/headerImages/maxCPU.png View File

Before After
Width: 675  |  Height: 141  |  Size: 19 KiB

BIN
blogContent/posts/programming/media/fileIO/100ThreadsOldSpinningHHD.png View File

Before After
Width: 640  |  Height: 480  |  Size: 22 KiB Width: 640  |  Height: 480  |  Size: 24 KiB

BIN
blogContent/posts/programming/media/fileIO/1mbFilesx124OldHHD.png View File

Before After
Width: 640  |  Height: 480  |  Size: 26 KiB Width: 640  |  Height: 480  |  Size: 29 KiB

BIN
blogContent/posts/programming/media/fileIO/3.6MBFilex31.png View File

Before After
Width: 760  |  Height: 480  |  Size: 26 KiB Width: 760  |  Height: 480  |  Size: 29 KiB

BIN
blogContent/posts/programming/media/fileIO/FreenasServer.png View File

Before After
Width: 640  |  Height: 480  |  Size: 20 KiB Width: 640  |  Height: 480  |  Size: 23 KiB

BIN
blogContent/posts/programming/media/fileIO/bigFileCPU.png View File

Before After
Width: 716  |  Height: 489  |  Size: 109 KiB

BIN
blogContent/posts/programming/media/fileIO/nvme500Threads.png View File

Before After
Width: 640  |  Height: 480  |  Size: 29 KiB Width: 640  |  Height: 480  |  Size: 32 KiB

BIN
blogContent/posts/programming/media/fileIO/nvmeDrive.png View File

Before After
Width: 640  |  Height: 480  |  Size: 21 KiB Width: 640  |  Height: 480  |  Size: 24 KiB

BIN
blogContent/posts/programming/media/fileIO/slowDrive500.png View File

Before After
Width: 640  |  Height: 480  |  Size: 27 KiB Width: 640  |  Height: 480  |  Size: 30 KiB

+ 316
- 0
blogContent/posts/programming/multi-threaded-file-io.md View File

@ -0,0 +1,316 @@
# Background
Suppose that you have a large quantity of files that you want your program to ingest.
Is it faster to sequentially read each file or, read the files in parallel using multiple threads?
Browsing this question online will give varied answers from confused people on discussion threads.
Reading resent research papers on multi threaded IO did not quite provide a clear answer my question.
A ton of people argue that multiple threads don't increase file throughput
because at the end of the day a HHD can only read
one file at a time. Adding more CPU cores into the mix would
actually slow down the file ingest because the HHD would have to
take turns between reading fragments of different files. The seek speed of the HHD would
heavily degrade the performance.
Other people on the internet claim that using the same number of
threads as your computer has is the most efficient way to read in files.
With these claims I did not find any evidence backing their statements.
However, I did find plenty of dead links -- you gotta love old internet forms.
The argument typically goes that using more threads will decrease the idle time of
the HHD. Plus, if you are running any form of raid or software storage like
[CEPH](https://ceph.com/ceph-storage/) you have plenty to gain from using multiple threads.
If you use more threads than your CPU has, you will obviously suffer performance wise because
the threads will be idle while they wait for each other to finish.
I decided to set out and determine what was the fastest approach by taking empirical measurements.
# Results
## Small Files
I started the experiment by reading 500 randomly generated
10 KB text files. I tested using a few hard drive types and
configurations. The computer which I ran all the tests on
has a Ryzen 1700x processor with 16 threads and 8 cores.
Each test point took a sample of 20 trials to find the average.
![slow HHD reading 500 10kb files](media/fileIO/100ThreadsOldSpinningHHD.png)
![slow HHD reading 500 10kb files](media/fileIO/slowDrive500.png)
![Raidz 2 500 10kb files](media/fileIO/FreenasServer.png)
![nvme SSD drive](media/fileIO/nvmeDrive.png)
![nvme SSD drive](media/fileIO/nvme500Threads.png)
## Big Files
After seeing the astonishing results of reading 10 KB files,
I wondered if anything changed if the files were larger.
Unlike the last set of trials, I did not take a sample size
to calculate the average. Although it would have made smoother graphs,
that would have taken way too long to do on the HHD I was using.
I only tested with 1 MB and 3.6 MB as the "large" files because any larger
files *really* slowed down the tests. I initially wanted to test with 18 MB
files running 10 trials at each thread interval. I backed away from doing that
because it would have required terabytes of file IO to be performed and, Java *really* does
not like to read 18 MB files as strings.
![1MB files](media/fileIO/1mbFilesx124OldHHD.png)
![3.6MB files](media/fileIO/3.6MBFilex31.png)
# Conclusions
For small files it is clear that using more CPU threads gives you a
very substantial boost in performance. However, once you use the same
number of threads as your CPU has, you hit the optimal performance and any more
threads decreases performance.
For larger files, more threads does not equate to better performance. Although
there is a small bump in performance using two threads, any more threads heavily
degrades performance.
![CPU Usage During Big File Test](media/fileIO/bigFileCPU.png)
Reading large files like this totally renders your computer unusable.
While I was running the 3.6 MB file test my computer was completely unresponsive.
I had to fight just to get the screenshot that I included in this blog post.
# Code
All of the code I wrote for this can be found on my [github](https://github.com/jrtechs/RandomScripts/tree/master/multiThreadedFileIO)
;however, I also want to put it here so I can make a few remarks about it.
## Basic File IO
This first class is used to represent a task which we want to run in parallel with other tasks. In this case it is just
reading a file from the disk. Nothing exciting about this file.
```java
/**
* Simple method to be used by the task manager to do
* file io.
*
* @author Jeffery Russell
*/
public class ReadTask
{
private String filePath;
public ReadTask(String fileName)
{
this.fileName = fileName;
}
public void runTask()
{
String fileContent = new String();
try
{
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream(filePath)));
String line;
while ((line = br.readLine()) != null)
fileContent = fileContent.concat(line);
br.close();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
```
## Multi Threaded Task Manager
This is where the exciting stuff happens. Essentially, this class lets your specify
a list of tasks which must be complete.
This class will then run all of those tasks in parallel using X threads
until they are all complete. What is interesting is how I managed the race conditions
to prevent multi threaded errors while keeping the execution of tasks efficient.
```java
import java.util.List;
import java.util.Vector;
/**
* A class which enables user to run a large chunk of
* tasks in parallel efficiently.
*
* @author Jeffery 1-29-19
*/
public class TaskManager
{
/** Number of threads to use at once */
private int threadCount;
/** Meaningless tasks to run in parallel */
private List<ReadTask> tasks;
public TaskManager(int threadCount)
{
this.threadCount = threadCount;
//using vectors because they are thread safe
this.tasks = new Vector<>();
}
public void addTask(ReadTask t)
{
tasks.add(t);
}
/**
* This is the fun method.
*
* This will run all of the tasks in parallel using the
* desired amount of threads untill all of the jobs are
* complete.
*/
public void runTasks()
{
int desiredThreads = threadCount > tasks.size() ?
tasks.size() : threadCount;
Thread[] runners = new Thread[desiredThreads];
for(int i = 0; i < desiredThreads; i++)
{
runners[i] = new Thread(()->
{
ReadTask t = null;
while(true)
{
//need synchronized block to prevent
//race condition between isEmpty and remove
synchronized (tasks)
{
if(!tasks.isEmpty())
t = tasks.remove(0);
}
if(t == null)
{
break;
}
else
{
t.runTask();
t = null;
}
}
});
runners[i].start();
}
for(int i = 0; i < desiredThreads; i++)
{
try
runners[i].join();
catch (Exception e)
e.printStackTrace();
}
}
}
```
## Random Data Generation
To prevent caching or anything funky, I wanted to use completely random files of equal size.
I initially generated a random character and then concated that onto a string.
After repeating that step a few thousand times, you have a random string to save
to your disk.
```java
private static char rndChar()
{
// or use Random or whatever
int rnd = (int) (Math.random() * 52);
char base = (rnd < 26) ? 'A' : 'a';
return (char) (base + rnd % 26);
}
```
Problem: string concatenation is terribly inefficient in Java.
When attempting to make a 18 MB file it took nearly 4 minutes and I wanted
to create 500 files. Yikes.
Solution: create a random byte array so I don't need to do any string concatenations.
This turned out to be **very** fast.
```java
for(int i = 0; i < 500; i++)
{
byte[] array = new byte[2000000];
new Random().nextBytes(array);
String s = new String(array, Charset.forName("UTF-8"));
saveToDisk(s, "./testData/" + i + ".txt");
System.out.println("Saved " + i + ".txt");
}
```
## Running the Experiments
I created an ugly main method to run all the experiments with the
task manager. To run trials with a different number of CPU threads or sample size,
I simply adjusted the loop variables.
```java
import java.util.*;
/**
* File to test the performance of multi threaded file
* io by reading a large quantity of files in parallel
* using a different amount of threads.
*
* @author Jeffery Russell 1-31-19
*/
public class MultiThreadedFileReadTest
{
public static void main(String[] args)
{
List<Integer> x = new ArrayList<>();
List<Double> y = new ArrayList<>();
for(int i = 1; i <= 64; i++) //thread count
{
long threadTotal = 0;
for(int w = 0; w < 20; w++) // sample size
{
TaskManager boss = new TaskManager(i);
for(int j = 0; j < 500; j++) // files to read
{
boss.addTask(new ReadTask("./testData/" + i + ".txt"));
}
long startTime = System.nanoTime();
boss.runTasks();
long endTime = System.nanoTime();
long durationMS = (endTime - startTime)/1000000;
threadTotal+= durationMS;
}
x.add(i);
y.add(threadTotal/20.0); //finds average
}
System.out.println(x);
System.out.println(y);
}
}
```
## Graphing the Results
I am not going to lie, most Java graphics libraries are terrible.
Simply pasting the output of a Java list into [matplotlib](https://matplotlib.org/)
is the easiest way to make presentable graphs. If I had more time, I would export the results to
JSON for a javascript graph like [D3](https://d3js.org/) which I could embed on my website.
```python
import matplotlib.pyplot as plt
xList = []
yList = []
plt.plot(xList, yList)
plt.xlabel('Number of Threads')
plt.ylabel('Execution Time (MS)')
plt.show()
```

+ 0
- 6
docs/sqlConfig.md View File

@ -43,12 +43,6 @@ download_count mediumint not null,
primary key(download_id)
);
create table popular_posts(
popular_post_id mediumint unsigned not null AUTO_INCREMENT,
post_id mediumint unsigned not null,
primary key(popular_post_id)
);
create table traffic_log(
log_id mediumint unsigned not null AUTO_INCREMENT,
url varchar(60) not null,

Loading…
Cancel
Save