Final changes

5 years ago · 0084d663b8
--- a/draft.tex
+++ b/draft.tex
@ -8,14 +8,14 @@

 \usepackage{hyperref}

 \title{K-means Clustering Over Time}
 \title{K-means Clustering Over Time\\CSCI-471-02}
 \shorttitle{Russell}
 %
 \author{Author: Jeffery B. Russell\\ Reviewer: Daniel Moore}
 \author{Author: Jeffery B. Russell\\ Reviewer: Daniel Moore\\Submitted: April 23, 2020}
 \affiliation{%
 Computer Science at RIT\\
 }%

 \date{\today}
 \begin{document}

 \maketitle
@ -54,11 +54,11 @@ Although you can trace the idea of k-means clustering back to 1967 with a paper

 % overview of the article

 A great deal of this article discusses optimally for the k-means algorithm, which is an important area to discuss, especially when considering the time at which the article got published. Back in 1967, computers were very slow and expensive. Although we had proofs that can guarantee that we could find an optimal solution, they were a NP-Hard problem\cite{np-hard}. 
 A great deal of this article discusses optimality for the k-means algorithm, which is an important area to discuss, especially when considering the time at which the article got published. Back in 1967, computers were very slow and expensive. Although we had proofs that can guarantee that we could find an optimal solution, they were a NP-Hard problem\cite{np-hard}. This is critical because NP-Hard problems are problems that are exponential to solve.

 Although the k-means algorithm did not guarantee the optimal solution, there was a subset of problems that it did guarantee an optimal solution-- the specifics of these problems got discussed later in the article. Nerveless, since this algorithm wasn't computationally expensive and generally gave good results, it was a huge breakthrough at the time. 

 In section three, the paper examines specific applications of the k-means algorithm. The paper ran these experiments with an IBM 7094. Section 3.1 looked at clustering student documents data in high dimensions to find syntactical differences. Section 3.2 looked at a more theoretical way of testing the document. They created four-dimensional data and clustered them into two groups. After running the algorithm, it was able to identify the correct class 87 percent of the times correctly. 
 In section three, the paper examines specific applications of the k-means algorithm. The paper ran these experiments with an IBM 7094. Section 3.1 looked at clustering student documents data in high dimensions to find syntactical differences. Section 3.2 looked at a more theoretical and mathmatical ways to test the k-means algorithm. They created four-dimensional data and clustered them into two groups. After running the algorithm, it was able to identify the correct class 87 percent of the times correctly. 
 Section 3.5 poses a unique approach that looks at the lexicographical analysis of papers written. 

 \subsection{K-Shape}
@ -73,15 +73,15 @@ Due to its immediate applications in fields of finance for stock prediction and

 % overview of the article

 The K-shape clustering algorithm was introduced in 2015 by Paparrizos J and Gravano L in their paper title "k-Shape: Efficient and Accurate Clustering of Time Series"\cite{k-shape}. 
 In 2015 Paparrizos J and Gravano L introduced the K-shape clustering algorithm in their paper title "k-Shape: Efficient and Accurate Clustering of Time Series"\cite{k-shape}. 

 Cyclical patterns that repeat over time is unique to time series analysis and requires special treatment. Unlike a typical k-means clustering algorithm, k-shape tries to find these shapes in the time series data that repeat itself. One cluster, for example, could be a gradual increase in values and then a sudden drop. Identification and categorization of these shapes is no trivial task. Compared to standard clustering, clustering on shape is far more time-consuming. Miss alignments and magnitude differences have to get accounted for in any time-series clustering algorithm. Aligning two sequences using dynamic time warping is much more time consuming than a typical comparison of two sequences.
 Cyclical patterns that repeat over time is unique to time series analysis and requires special treatment. Unlike a typical k-means clustering algorithm, k-shape tries to find these shapes in the time series data that repeat itself. One cluster, for example, could be a gradual increase in values and then a sudden drop. Identification and categorization of these shapes is no trivial task. Compared to standard clustering, clustering on shape is far more time-consuming. Misalighments and magnitude differences have to get accounted for in any time-series clustering algorithm. Aligning two sequences using dynamic time warping is much more time consuming than a typical comparison of two sequences.


 The three major things discussed in this paper are the distance measure used, the clustering algorithm, and performance experiments. The k-shape algorithm used a centroid based clustering technique (different from other methods like hierarchically and bottom-up/topdown) that uses a shaped based distance measure. 
 The three major things discussed in this paper are the distance measure used, the clustering algorithm, and performance experiments. The k-shape algorithm used a centroid based clustering technique (different from other methods like hierarchically and bottom-up/topdown) that uses a shape based distance measure. 


 This algorithm got tested on the ECGFiveDays dataset. Similar to the MNIST dataset for neural networks, the ECGFiveDays dataset commonly gets used in time-series analysis. This dataset looks at an electrocardiograph, which demonstrates the electrical activity of the heart over time. This dataset and implications have immediate usage since the rapid detection of anomalies in the data could indicate a fatal health issue. The experiment was able to identify the two different classes in the dataset correctly. Additional datasets were used in the experiments section later on. 
 This algorithm got tested on the ECGFiveDays dataset. Similar to the MNIST dataset for neural networks, the ECGFiveDays dataset commonly gets used in time-series analysis. This dataset looks at an electrocardiograph, which demonstrates the electrical activity of the heart over time. This dataset and implications have immediate usage since the rapid detection of anomalies in the electrocardiograph can help prevent fatal health issues. The experiment was able to identify the two different classes in the dataset correctly. Additional datasets were used in the experiments section later on. 



@ -99,14 +99,14 @@ This section critiques each article discussed in the prior section.
 %writing quality and delivery
 % importance of the article in the field

 The paper was well constructed and contained little to no grammatical errors. Most of the paper focused on the algorithm itself and all of the math that was associated with it. Rather than having separate sections for experiments, results, and applications, Macqueen lumped them all into a single application section that looked at different distance measures you could use with the algorithm. Compared to newer papers on the subject, this paper had relatively few computer simulations reflecting the cost and scarcity of computing at the time. 
 The paper was well constructed and contained little to no grammatical errors. Most of the paper focused on the algorithm itself and all of the math that was associated with it. Rather than having separate sections for experiments, results, and applications, MacQueen lumped them all into a single application section that looked at different distance measures you could use with the algorithm. Compared to newer papers on the subject, this paper had relatively few computer simulations reflecting the cost and scarcity of computing at the time. 

 Other than the math provided for proofs, no figures, tables, or other diagrams got provided in this paper. Compared to other papers at the time, this was common. Computer graphics were relatively new and very expensive to create.

 Due to the k-means algorithm not always converting on an optimum answer and being profoundly affected by outlines, it is rarely used by itself. However, it frequently used to bootstrap and influence other algorithms, as we saw in the K-shape algorithm. 
 As the author stated initially in his article, "there is no feasible, general method which always yield an optimal partition." Due to the nature of NP-Hard problems, this fact is unlikely to change anytime soon. More recently, people have rebooted k-means to include a beam search to avoid converging on local maxima-- this process is called "k-means++" and Vassilvitskii outlines in his  2007 paper titled "K-means++: the advantages of careful seeding"\cite{kplus}.
 As the author stated initially in his article, "there is no feasible, general method which always yield an optimal partition." Due to the nature of NP-Hard problems, this fact is unlikely to change anytime soon. More recently, people have rebooted k-means to include a beam search to avoid converging on local maxima-- this process is called "k-means++" and Vassilvitskii outlines it in his  2007 paper titled "K-means++: the advantages of careful seeding"\cite{kplus}.

 Another major question posed after this research is: how do we choose k? Although this paper outlines how we can cluster data into k clusters, it did not mention how we should select k. Newer papers are extending the work with k-means and investigating how we can automatically select k using the elbow technique or the goodness value fit GVF in Jenks clustering.
 Another major question posed after this research is: how do we choose k? Although this paper outlines how we can cluster data into k clusters, it did not mention how we should select k. Newer papers are extending the work with k-means and investigating how we can automatically select k using the elbow technique or the goodness value fit (GVF) in Jenks clustering.


 This paper has had a lasting effect on the field of machine learning. Most textbooks and AI classes cover k-means clustering as a starting point when teaching people about unsupervised learning. Moreover, algorithms to this day are still using k-means as a tool behind the scenes to pre-process data before it gets fed to the next step in the data pipeline. 
@ -130,7 +130,7 @@ This project released the source code -- something that many projects don't do.
 kshape
 .html}}. Disclosing source code is a notable thing to mention since it builds credibility on their paper because it makes it easier to reproduce the results of the experiments ran. However, it is worth noting that running all the experiments took two months to run on a 10 server cluster with Intel Xeon processors.

 Similar to many research articles, the verbiage of the k-shape paper makes it hard for people outside of this sphere of research to digest. Acronyms like Dynamic Time Warping(DTW) and Euclidean Distance(ED) were frequently used -- which is typical for research papers. Some terms, like ECG (Electrocardiograph), were even never defined in the paper. An appendix with acronyms would help less versed readers understand the research paper.  
 Similar to many research articles, the verbiage of the k-shape paper makes it hard for people outside of this sphere of research to digest. Acronyms like Dynamic Time Warping(DTW) and Euclidean Distance(ED) were frequently used -- which is typical for research papers. Some terms, like ECG (Electrocardiograph), were never even defined in the paper. An appendix with acronyms would help less versed readers understand the research paper.  



@ -139,8 +139,7 @@ Similar to many research articles, the verbiage of the k-shape paper makes it ha

 % in conclusions wrap up the paper

 Although being first introduced in 1967, k-means continues to be a flourishing facet of research in computer science. As we continue to gather and produce more data on the internet, state of the art for clustering has focused on time series and imagery to extract critical information. 

 Although being first introduced in 1967, k-means continues to be a flourishing facet of research in computer science. As we continue to gather and produce more data on the internet, state of the art clustering research has focused on time series and imagery to extract critical information\cite{state-of-art}. 


 \bibliographystyle{plain}
--- a/ref.bib
+++ b/ref.bib
@ -66,4 +66,6 @@ doi = {10.1145/1283383.1283494}
  title        = {Sieve: Actionable Insights from Monitored Metrics in Distributed Systems}
  booktitle    = {Proceedings of Middleware Conference (Middleware)},
  year         = {2017},
 }
 }

@ARTICLE{state-of-art,  author={A. {Ahmad} and S. S. {Khan}},  journal={IEEE Access},  title={Survey of State-of-the-Art Mixed Data Clustering Algorithms},   year={2019},  volume={7},  number={},  pages={31883-31902},}