Sub-dataset processing, such as event reasoning or statistic learning, is fundamental to a wide range of contemporary data analyses. However, with the explosive growth of scientific/social information, a raw dataset usually contains millions or billions of sub-datasets and the content distribution of individual sub-datasets, referred to as sub-dataset locality, is lost when the dataset is divided into partitions and stored into file systems such as the Hadoop File System. Due to the content clustering, sub-datasets can have an imbalanced content distribution among the data partitions. Without the sub-dataset locality information, an imbalanced distribution will degrade the performance of many big data applications.
We propose a proactive method, called DataNet, that will discover sub-datasets locality before the actual analyses are performed. DataNet proposes efficient algorithms to detect clusters of relevant data and distinguish dominant and non-dominat sub-datasets within partitions. In order to achieve fast metadata storage and lookup, DataNet employs a new data structure, called ElasticMap, to store the distribution information of sub-datasets and implements efficient methods to quickly access the distribution of a sub-dataset’s content. This can maximize the efficiency of an approximation computation such as static analysis, which only involves the majority of a sub-dataset, and enforce load-balanced computing for MapReduce applications.