PAMAE: PArallel k-Medoids clustering with high Accuracy and Efficiency

1. Overview

The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors—"global search" and "entire data"—that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm.

2. Algorithms

PAM-MR [1]
FAMES-MR [2]
CLARA-MR' and CLARA-MR [3]
GREEDI [4]
MR-KMEDIAN [5]
PAMAE-Hadoop : Hadoop implementation of our algorithm
PAMAE-Spark : Spark implementation of our algorithm

Reference
[1] Xianfeng Yang and Liming Lian. 2014. A New Data Mining Algorithm Based on MapReduce and Hadoop. Int’l J. of Signal Processing, Image Processing, and Pattern Recognition 7, 2 (2014), 131–142.
[2] Ying-ting Zhu, Fu-zhang Wang, Xing-hua Shan, and Xiao-yan Lv. 2014. KMedoids Clustering Based on MapReduce and Optimal Search of Medoids. In Proc. 9th Int’l Conf. on Computer Science and Education. 573–577.
[3] Pelle Jakovits and Satish Narayana Srirama. 2013. Clustering on the Cloud: Reducing CLARA To MapReduce. In Proc. 2nd Nordic Sympo. on Cloud Computing and Internet Technologies. 64–71.
[4] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. 2013. Distributed Submodular Maximization: Identifying Representative Elements in Massive Data. In Proc. 27th Annual Conf. on Neural Information Processing Systems. 2049–2057.
[5] Alina Ene, Sungjin Im, and Benjamin Moseley. 2011. Fast Clustering using MapReduce. In Proc. 17th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. 681–689.

3. Data Sets

Name	# Object	# Dim	Size	Type	Link
Covertype	581,102	55	71 MB	int	link
Census1990	2,458,285	68	324 MB	int	link
Cosmo50	315,086,245	3	13.6 GB	float	link
TeraClickLog	4,373,472,329	13	300 GB	float	link 1,2,3

4. Configuration

We conducted experiments on 12 Microsoft Azure D12v2 instances loacted in Japan.
Each instance has four cores, 28GB of main memory, and 200GB of disk (SSD).
All instances run on Ubuntu 14.04. We used Hadoop 2.7.1 and Spark 1.6.1 for distributed parallel processing.
Ten out of 12 instances were used as worker nodes, and the remaining two instances were used as master nodes.
All the algorithms were written in the Java programming language and run on JDK 1.7.0_101.

5. How to Run

Compile.
- Download the spark-assembly library from Apache Spark.
- Make a jar file using IDE tools. For example, you can easily make it using Eclipse through project name->export->jar file. It is possible that you just download the jar files from jar folder.
Create Azure HDInsight instances.
- Refer to HDInsight Document.
Move the data sets into the HDFS folder.
- Download all data sets from the above links and move them to the Azure master node.
- Transfer your data sets from the Azure master node into HDFS.
  hadoop dfs -put [your master node path] [hdfs path]
Run the algirithms using the following command:
- PAM-MR
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids]
- FAMES-MR
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids]
- CLARA-MR' (type : 0), CLARA-MR (type : 1)
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of samples] [# of medoids] [type]
- GREEDI
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of partitions] [# of medoids]
- MR-KMEDIAN
  Iteraive Sampling
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of medoids] [upsilon] [# of cores] [# of objects]
  Weighted K-Median
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs sample path] [hdfs output path] [# of medoids] [# of cores]
- PAMAE-Hadoop
  hadoop jar [jar file] [main class] [hdfs input path] [hdfs output path] [# of sampled objects] [# of samples] [# of medoids] [# of partition]
- PAMAE-Spark
  spark-submit --class [main class] [jar file] wasb://[hdfs address]/[hdfs input path] [# of medoids] [# of sampled objects] [# of samples] [# of partitions] [# of iteration for Phase II]

6. Example

The example_dataset.csv file is used to visualize the difference of the six algorithms. The data set consists of 10 clusters having 2D 10,000 objects.
The clustering results are shown in the figures below.

Here, we can easily observe that other algorithms suffer from local optima or insufficient sample problems. Only our algorithm succeeds to find the exact medoids of 10 clusters.

7. Experiment

The script(.sh) files for our experiments are included in scripts.

Figures 5 and 8: Accuracy and efficiency experiment
- This experiment is accuracy/efficiency comparisons of eight parallel algorithms.
- Run all algorithms on the four real-world data sets when k is set to be 10, 25, and 50.
- The results are Figure 5 and Figure 8 in our paper.
- sh Hadoop-AccEffExp.sh and sh Spark-AccEffExp.sh
Figure 7: Convergence experiment
- This experiment is to verify the convergence of Phase II.
- Three different seeding strategies are used for this experiment:
  (40+2k,5) by CLARA-MR', (100+5K) by CLARA-MR, and (40K,5) by Phase I of PAMAE.
- The number of iterations for Phase II is forced to increase from 1 to 10 when k is 50.
- The result is Figure 7 in our paper.
- sh ConvergenceExp.sh
Figure 9: Scalability experiment
- This experiment is scalability test using the TeraClickLog data set on Spark.
- Two storage-level options are used: MEMORY_ONLY and MEMORY_AND_DISK.
- The result is Figure 9 in our paper.
- sh ScalabilityExp.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PAMAE: PArallel k-Medoids clustering with high Accuracy and Efficiency

1. Overview

2. Algorithms

3. Data Sets

4. Configuration

5. How to Run

6. Example

7. Experiment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
CLARA-MR		CLARA-MR
FAMES-MR		FAMES-MR
GREEDI		GREEDI
MR-KMEDIAN		MR-KMEDIAN
PAM-MR		PAM-MR
PAMAE-Hadoop		PAMAE-Hadoop
PAMAE-Spark		PAMAE-Spark
example		example
figures		figures
jar		jar
scripts		scripts
README.md		README.md

kaist-dmlab/k-Medoid

Folders and files

Latest commit

History

Repository files navigation

PAMAE: PArallel k-Medoids clustering with high Accuracy and Efficiency

1. Overview

2. Algorithms

3. Data Sets

4. Configuration

5. How to Run

6. Example

7. Experiment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages