In the contemporary landscape of online social networks, preserving users' privacy while applying clustering techniques is a pivotal concern. This project integrates differential privacy into social network clustering to balance user privacy and clustering effectiveness. Through a detailed exploration of differential privacy parameters, this work provides insights into how privacy levels influence clustering accuracy and offers a comprehensive understanding of the relationship between privacy, data utility, and clustering in social networks.
This research was published in the 2024 International Conference on Smart Systems for Electrical, Electronics, Communication, and Computer Engineering (ICSSEECC).
- Integration of differential privacy with social network clustering.
- K-means clustering on a noisy feature matrix generated by Laplace noise.
- Evaluation of epsilon parameter impacts on privacy and clustering performance.
- Detailed graphical analysis and evaluation metrics.
- Python (3.8+)
numpy- Numerical computationsnetworkx- Graph and network analysismatplotlib- Data visualizationsklearn- Machine learning toolsKMeans- Clustering algorithmadjusted_rand_score- Evaluation metricsilhouette_score- Evaluation metricdavies_bouldin_score- Evaluation metric
- Jupyter Notebook
- Privacy Parameter (epsilon): Varied from 0.1 to 3 in increments of 0.1.
- Optimal Epsilon Value: At an epsilon of 2.2, the clustering accuracy peaks at 80.53%, achieving a balance between privacy and effectiveness.
- Detailed visualizations of the epsilon vs. metric relationship, highlighting the privacy-utility trade-offs.
- Source: Twitter Social Network
- Description: Contains user profiles, follower/friend lists, and interaction subgraphs.
- Twitter Dataset
- Clone the repository:
git clone https://github.com/KavinAravindhan/privacy-preserving-clustering.git
- Install the necessary libraries:
pip3 install -r requirements.txt
- Run Jupyter notebooks as per the analysis workflow.
data_preprocessing.ipynb- Data preprocessing and cleaning.k-means_clustering.ipynb- Initial clustering on raw data.differential_privacy.ipynb- Adding differential privacy to data and clustering again.accuracy_metrics.ipynb- Evaluation of clustering results using various metrics.graphical_analysis.ipynb- Graphical analysis of metrics across different privacy parameter values.privacy_preserving_clustering.ipynb- Comprehensive notebook with all steps combined.
This project is licensed under the MIT License - see the LICENSE file for details.
A special thanks to our amazing team for their dedication and hard work. Despite the challenges, their commitment to learning new technologies and collaborating effectively made this project a success.