University of Pittsburgh Data Mining Course Spring 2020
- Linear Regression and evaluation techniques
- Logistic Regression and evaluation techniques
- KNN and evaluation techniques
- Naive Bayes
- Random Forests
- Multinomial Logistic Regression
- t-SNE
- K-means clustering
- Uniform manifold approximation projection
In this python lab, four models were evaluated on surgical procedures data for procedures conducted between June 2017 and June 2018. The goal was to develop an algorithm to accurately predict a patient’s level of risk for a length of stay (LOS) greater than five days post-surgery. Please read the final essay, essay_Python_lab.pdf, for lab details.
- XGBoost
- LightGBM
- CatBoost
- Gridsearch techniques
- LIME and Shapley Additive Explanations (SHAP)
The Humanities Data Librarian for the University Library at the University of Pittsburgh, Terry Kapral, provided three data sets for an analysis of the digital collections in the Humanities department. This was an exploratory, unsupervised learning project. The high-level goal was to investigate which topics are present within the humanities digital collection, and how those topics vary over time. Specifically, Mrs. Kapral was interested in answers to the following questions about the data:
- What are the latent topics across the digital items?
- What items are related by topic?
- How do topics change over time with respect to the time period covered by the items within each topic?
- Are there any problems with the data?
These questions were answered through data exploration, including word embeddings and t-SNE plots, and topic modeling, using the unsupervised learn- ing algorithm, Latent Dirichlet Allocation (LDA). Data exploration revealed problems in the data, some of which were mitigated. The final LDA model revealed 19 latent topics from the titles and abstracts in the metadata for the 124,517 digitized items that had a title.