The final model for this project is Spectral Clustering, which achieved an accuracy of 97.9681%. This indicates that the model performs very well and that the full pipeline produces strong clustering quality for distinguishing STAR, GALAXY, and QSO classes.
As an earlier discovery/preworkflow stage, the first K-Means model on pure physics-informed features in extra folder/V6.ipynb also produced a strong result with 97.84% overall accuracy.
This is slightly lower than the final Spectral Clustering model (difference: 0.1281 percentage points), and it helped validate that the physics-based feature direction was already effective before the final model refinement.
This README explains exactly how to open and run this project in a clean environment.
code.ipynbis the main, complete workflow notebook. It includes the introduction, physics/background context, EDA, feature engineering/analysis, and clustering model building/evaluation.model.ipynbis a separated modeling-focused version for convenience, mainly to run and review model-related steps.- For full context and the complete end-to-end work, graders should prioritize reviewing and running
code.ipynb.
Required files:
code.ipynbmodel.ipynbstar-galaxy-quasar.csv
Strongly recommended files (lets grader run the modeling notebook immediately without rerunning full preprocessing):
star-galaxy-quasar_processed.csvstar-galaxy-quasar_featured.csv
Recommended:
- Python 3.10-3.13
- VS Code with Jupyter extension
Create and activate a virtual environment from the project root:
python -m venv .venv
.\.venv\Scripts\Activate.ps1If PowerShell execution policy blocks activation, run in Command Prompt:
.venv\Scripts\activate.batInstall required packages:
pip install pandas numpy matplotlib seaborn scikit-learn umap-learn hdbscan jupyter ipykernelNote:
hdbscanis optional in code logic (the notebook skips HDBSCAN sections if unavailable), but installing it is recommended for full output parity.
- Open
code.ipynb - Set notebook kernel to the created
.venv - Run all cells from top to bottom
- Confirm generated files exist:
star-galaxy-quasar_processed.csvstar-galaxy-quasar_featured.csv
- Open
model.ipynb - Run all cells from top to bottom
If star-galaxy-quasar_featured.csv already exists, you can run only:
model.ipynb(Run All)
The notebook is set to load:
star-galaxy-quasar_featured.csvfirst- falls back to
star-galaxy-quasar_processed.csvif needed
When run successfully, grader should see:
- Feature engineering and EDA plots from
code.ipynb - Clustering metrics/tables (Silhouette, AMI, ARI, NMI) in
model.ipynb - Confusion matrices and post-hoc model comparison visualizations
Fix:
- Ensure notebook working directory is the project root.
- Ensure required CSV files are in the same folder as notebooks.
Fix:
pip install umap-learn hdbscanFix:
- In VS Code notebook toolbar, click Kernel.
- Select Python interpreter from
.venv. - Restart kernel and run all cells again.
The notebooks set fixed random seeds (for example, random_state=42) in major modeling steps to keep results stable across runs.
- Open project folder in VS Code.
- Create/activate
.venv. - Install packages.
- Run
code.ipynb(Run All). - Run
model.ipynb(Run All). - Verify tables/plots render without errors.