

@ 0,0 +1,149 @@ 



% \documentclass[12pt]{extarticle} 



\documentclass[12pt]{apa6} 



\usepackage[utf8]{inputenc} 



\usepackage{cite} 







\usepackage{setspace} 



\doublespacing 







\usepackage{hyperref} 







\title{Kmeans Clustering Over Time} 



\shorttitle{Russell} 



% 



\author{Author: Jeffery B. Russell\\ Reviewer: Daniel Moore} 



\affiliation{% 



Computer Science at RIT\\ 



}% 







\begin{document} 







\maketitle 











\section{Overview} 







% abstract y section 







With the ubiquity of data in today's age, machine learning has become a driving force in research and innovation in both the public and private sectors. Due to the vast quantity of data generated by the internet, unsupervised learning has been at the forefront of data science over the past few decades. Clustering has consistently been an essential way of extracting data from large sets of information. 







This project aims to break down and analyze two papers that use kmeans clustering. In 1967 J. MacQueen wrote the historical article that introduced \cite{kmeans} the original kmeans clustering. MacQueen's paper initially coined the term kmeans\cite{kmeans} and is going to be reviewed because it provides useful historical insight into where this field of research started and what influenced it. Since 1967 the topics produced in the paper continues to be an exciting area of investigation because we can compare a relatively older research paper on clustering with a new one. Project Euclid(hosts 1.8 million pages of openaccess content) has this paper on its website\footnote{\url{https://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200512992}}. 







John Paparrizos and Luis Gravano in 2016 wrote a paper that covers how we can cluster time series data using an algorithm they call kshape the kmeans algorithm \cite{kshape} inspired this work. The University of Colombia's website\footnote{\url{http://web2.cs.columbia.edu/~gravano/Papers/2015/sigmod2015.pdf}} hosts this paper. 







For the scope of this project in CSCI471, up till section 3 is covered on the kshape algorithm, and we excluded section two from the original kmeans paper. 







\section{Summary} 







% two pages total 



% about one page per article 







This section provides a summary of each article in depth. 







\subsection{KMeans} 







% background of clustering 







The general idea of clustering is to group data with similar traits. The main benefit of this is the ability to extract information from new data because you know what it is most similar to, thus giving you valuable insight. In the field of machine learning, this is considered as unsupervised learning because it requires no labels on the data  the algorithm auto assigns clusters, and you infer behavior off of those clusters. 







Clustering has many applications such as image segmentation, preference predictions, compression, model fitting. 







% historical context of the article 







Although you can trace the idea of kmeans clustering back to 1967 with a paper by Hugo Steinhaus\cite{ogClustering}, James MacQueen was the first to coin the term kmeans in 1956\cite{kmeans}. MacQueen's paper title "Some Methods For Classification and Analysis of Multivariate Observations" goes over the kmeans process that segments an Ndimensional population into k sets. Note: when we refer to k in the algorithm, that is the number of sets that we are dividing the population. 







% overview of the article 







A great deal of this article discusses optimally for the kmeans algorithm, which is an important area to discuss, especially when considering the time at which the article got published. Back in 1967, computers were very slow and expensive. Although we had proofs that can guarantee that we could find an optimal solution, they were a NPHard problem\cite{nphard}. 







Although the kmeans algorithm did not guarantee the optimal solution, there was a subset of problems that it did guarantee an optimal solution the specifics of these problems got discussed later in the article. Nerveless, since this algorithm wasn't computationally expensive and generally gave good results, it was a huge breakthrough at the time. 







In section three, the paper examines specific applications of the kmeans algorithm. The paper ran these experiments with an IBM 7094. Section 3.1 looked at clustering student documents data in high dimensions to find syntactical differences. Section 3.2 looked at a more theoretical way of testing the document. They created fourdimensional data and clustered them into two groups. After running the algorithm, it was able to identify the correct class 87 percent of the times correctly. 



Section 3.5 poses a unique approach that looks at the lexicographical analysis of papers written. 







\subsection{KShape} 







% background of time series segmentation 







Time series data is a series of data points taken at time intervals. Each data point measured could have multiple dimensions. When doing kmeans clustering on images, you would just ignore the relative xy position in the image and treat everything as if they were datapoints to feed into the algorithm. You could do the same thing with timeseries data; however, you would just be throwing away the time information. Timeseries clustering is frequently used for anomaly detection and pattern identification for use later on in forecasting. 







% historical context of an article 







Due to its immediate applications in fields of finance for stock prediction and the medical field, time series analysis is as old as computing itself. 







% overview of the article 







The Kshape clustering algorithm was introduced in 2015 by Paparrizos J and Gravano L in their paper title "kShape: Efficient and Accurate Clustering of Time Series"\cite{kshape}. 







Cyclical patterns that repeat over time is unique to time series analysis and requires special treatment. Unlike a typical kmeans clustering algorithm, kshape tries to find these shapes in the time series data that repeat itself. One cluster, for example, could be a gradual increase in values and then a sudden drop. Identification and categorization of these shapes is no trivial task. Compared to standard clustering, clustering on shape is far more timeconsuming. Miss alignments and magnitude differences have to get accounted for in any timeseries clustering algorithm. Aligning two sequences using dynamic time warping is much more time consuming than a typical comparison of two sequences. 











The three major things discussed in this paper are the distance measure used, the clustering algorithm, and performance experiments. The kshape algorithm used a centroid based clustering technique (different from other methods like hierarchically and bottomup/topdown) that uses a shaped based distance measure. 











This algorithm got tested on the ECGFiveDays dataset. Similar to the MNIST dataset for neural networks, the ECGFiveDays dataset commonly gets used in timeseries analysis. This dataset looks at an electrocardiograph, which demonstrates the electrical activity of the heart over time. This dataset and implications have immediate usage since the rapid detection of anomalies in the data could indicate a fatal health issue. The experiment was able to identify the two different classes in the dataset correctly. Additional datasets were used in the experiments section later on. 















% critique 



\section{Critique} 



% three pages total 



% about one and a half pages per article 







This section critiques each article discussed in the prior section. 







\subsection{KMeans} 







% discussion 



% merits of its intellectual or scientific contributions 



%writing quality and delivery 



% importance of the article in the field 







The paper was well constructed and contained little to no grammatical errors. Most of the paper focused on the algorithm itself and all of the math that was associated with it. Rather than having separate sections for experiments, results, and applications, Macqueen lumped them all into a single application section that looked at different distance measures you could use with the algorithm. Compared to newer papers on the subject, this paper had relatively few computer simulations reflecting the cost and scarcity of computing at the time. 







Other than the math provided for proofs, no figures, tables, or other diagrams got provided in this paper. Compared to other papers at the time, this was common. Computer graphics were relatively new and very expensive to create. 







Due to the kmeans algorithm not always converting on an optimum answer and being profoundly affected by outlines, it is rarely used by itself. However, it frequently used to bootstrap and influence other algorithms, as we saw in the Kshape algorithm. 



As the author stated initially in his article, "there is no feasible, general method which always yield an optimal partition." Due to the nature of NPHard problems, this fact is unlikely to change anytime soon. More recently, people have rebooted kmeans to include a beam search to avoid converging on local maxima this process is called "kmeans++" and Vassilvitskii outlines in his 2007 paper titled "Kmeans++: the advantages of careful seeding"\cite{kplus}. 







Another major question posed after this research is: how do we choose k? Although this paper outlines how we can cluster data into k clusters, it did not mention how we should select k. Newer papers are extending the work with kmeans and investigating how we can automatically select k using the elbow technique or the goodness value fit GVF in Jenks clustering. 











This paper has had a lasting effect on the field of machine learning. Most textbooks and AI classes cover kmeans clustering as a starting point when teaching people about unsupervised learning. Moreover, algorithms to this day are still using kmeans as a tool behind the scenes to preprocess data before it gets fed to the next step in the data pipeline. 















\subsection{KShape} 







% discussion 



% merits of its intellectual or scientific contributions 



%writing quality and delivery 



% importance of the article in the field 



As a newer paper kshape has recent papers looking into it. In 2017 a group of researchers looked into how they could use kshape to monitor metrics on a distributed system \cite{kshapeuse}. 



The Kshape algorithm also has an opensource \footnote{\url{https://github.com/johnpaparrizos/kshape}} implementation in python that has a considerable audience. The scalability and performance of the kshape algorithm, like the original kmean clustering algorithm, has made this very popular. 







The writing style of the kshape paper was very formal and had no noticeable grammatical issues. The overall structure of the paper is pretty standard for the field. It had an abstract followed by a section going over the background and then a section describing the paper and then experiments, results, and conclusions section. One thing that was unique about this paper is that it contained a whopping 90 references. This abundance of references makes it hard to check all the references used thoroughly. The author tended to pile on a ton of references mentioning the same piece of information. For example: in the introduction to state that timeseries data gets used in many fields, the author cited 8 papers. Although it is not a bad thing, it is typically seen as impulsive to include multiple references when citing something that is common knowledge in the field. 







The experimental trials that the paper did were very appropriate for the scope of the work that it was doing. Rather than testing the algorithm on one or two datasets, the researchers tested their algorithm on 48 different datasets. The experiments looked at a combination of different clustering algorithms with different distance measures for timeseries clustering and compared it against the kshape algorithm presented in the paper. 







This project released the source code  something that many projects don't do. The Matlab source code and their datasets are published on the University of Colombia's website\footnote{\url{http://www.cs.columbia.edu/~jopa/ 



kshape 



.html}}. Disclosing source code is a notable thing to mention since it builds credibility on their paper because it makes it easier to reproduce the results of the experiments ran. However, it is worth noting that running all the experiments took two months to run on a 10 server cluster with Intel Xeon processors. 







Similar to many research articles, the verbiage of the kshape paper makes it hard for people outside of this sphere of research to digest. Acronyms like Dynamic Time Warping(DTW) and Euclidean Distance(ED) were frequently used  which is typical for research papers. Some terms, like ECG (Electrocardiograph), were even never defined in the paper. An appendix with acronyms would help less versed readers understand the research paper. 















% summary 



\section{Summary} 







% in conclusions wrap up the paper 







Although being first introduced in 1967, kmeans continues to be a flourishing facet of research in computer science. As we continue to gather and produce more data on the internet, state of the art for clustering has focused on time series and imagery to extract critical information. 















\bibliographystyle{plain} 



\bibliography{ref} 







\end{document} 