Accelerating-Sub-Dataset-Processing

Sub-dataset processing, such as event reasoning or statistic learning, is fundamental to a wide range of contemporary data analyses. However, with the explosive growth of scientific/social information, a raw dataset usually contains millions or billions of sub-datasets and the content distribution of individual sub-datasets, referred to as sub-dataset locality, is lost when the dataset is divided into partitions and stored into file systems such as the Hadoop File System. Due to the content clustering, sub-datasets can have an imbalanced content distribution among the data partitions. Without the sub-dataset locality information, an imbalanced distribution will degrade the performance of many big data applications.

We propose a proactive method, called DataNet, that will discover sub-datasets locality before the actual analyses are performed. DataNet proposes efficient algorithms to detect clusters of relevant data and distinguish dominant and non-dominat sub-datasets within partitions. In order to achieve fast metadata storage and lookup, DataNet employs a new data structure, called ElasticMap, to store the distribution information of sub-datasets and implements efficient methods to quickly access the distribution of a sub-dataset’s content. This can maximize the efficiency of an approximation computation such as static analysis, which only involves the majority of a sub-dataset, and enforce load-balanced computing for MapReduce applications.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
approximationmeta		approximationmeta
client		client
communication		communication
hdfs		hdfs
mapreducer		mapreducer
scheduler		scheduler
server		server
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating-Sub-Dataset-Processing

About

Uh oh!

Releases

Packages

Languages

License

yinjiangling/Accelerating-Sub-Dataset-Processing

Folders and files

Latest commit

History

Repository files navigation

Accelerating-Sub-Dataset-Processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages