Thanks to visit codestin.com
Credit goes to github.com

Skip to content
View Shrikant-Sharma's full-sized avatar

Block or report Shrikant-Sharma

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Shrikant-Sharma/README.md

Hi, I'm Shrikant 👋

Data Scientist with ~8 years across pharma (Amgen) and financial services (American Express). Building production ML, agentic AI, and grounded RAG systems for regulated industries. Open to Senior Data Scientist, Decision Scientist, Applied Scientist, ML Engineer, and AI Engineer roles.

🛠️ Tech Stack

Python SciPy Scikit-learn XGBoost PyTorch PySpark Snowflake SHAP FAISS RAGAS LangGraph Sentence Transformers Cross-Encoder SQL AWS Databricks Docker FastAPI MLflow Tableau Streamlit AWS Lambda DoWhy lifelines scikit-survival Git

🚀 Featured Projects

Agentic RAG pipeline over 484 ClinicalTrials.gov oncology protocols (3,264 chunks). Phase 2 upgraded the baseline to a LangGraph-orchestrated Corrective RAG (CRAG) pattern with LLM-as-judge document relevance grading, query rewriting on poor retrieval, and bounded retries. Three independent refusal gates were empirically validated when gibberish input scored 0.91 cosine similarity but was correctly refused by the grader. Phase 3 added a cross-encoder reranker (MS-MARCO MiniLM-L-6-v2) for two-stage retrieval before deduplication, fixing a documented bi-encoder failure where generic trastuzumab studies ranked above actually-relevant trastuzumab-deruxtecan (T-DXd) chunks on HER2-positive antibody-drug conjugate queries. PubMedBERT embeddings, FAISS retrieval, Groq Llama 3.3 70B with source-cited responses. Two-mode UI lets users A/B the agentic flow against the baseline pipeline. Live demo includes graceful rate-limit handling. Evaluated retrieval and generation with RAGAS-equivalent metrics across a 25-query stress test and 3 chunking strategies.

Built with: Python, LangGraph, Sentence Transformers, PubMedBERT, MS-MARCO Cross-Encoder, Groq Llama 3.3 70B, FAISS, RAGAS, Streamlit, Git


End-to-end ML pipeline on the UCI Heart Disease dataset (303 patients, 13 clinical features). Trained 5 classification models (Logistic Regression, Random Forest, XGBoost, SVM, PyTorch NN) with MLflow experiment tracking; tuned top 3 via RandomizedSearchCV with 5-fold cross-validation. Tuned XGBoost reached 0.95 AUC-ROC with 0.93 recall at clinical operating threshold. Added SHAP TreeExplainer for global and per-patient explanations, calibration analysis (Brier score 0.092, reliability curves), and threshold tuning for clinical screening. Phase 2 added causal inference and survival analysis on the NHEFS cohort (Hernán-Robins canon, 1,629 subjects): estimated the ATT of smoking cessation on weight via propensity score matching + G-computation with three-way convergence within 0.2 kg of the IPW reference; Cox PH and Random Survival Forest agreed at 0.80 test concordance for 10-year mortality, both showing the significant unadjusted Kaplan-Meier difference was pure age confounding. Deployed as a REST API via FastAPI

Built with: Python, Scikit-learn, XGBoost, PyTorch, SHAP, DoWhy, lifelines, scikit-survival, MLflow, FastAPI, Pydantic, Docker, AWS Lambda, ECR, Mangum, Git


Compliance spend analytics on real CMS Open Payments data: 16M+ federal records ($13B). Sampled 989K transactions across 289K unique HCPs and engineered 5 HCP-level features. Applied within-specialty z-scores + global IQR with a $500 monetary floor and concentration logic, flagging the top 1.67% as a HIGH-tier triage queue. Layered Isolation Forest ML detection on top; reconciliation against rule-based flags surfaced three structurally distinct compliance archetypes — captured specialists (both methods agree), captured generalists (rules-only), and industry consultants (ML-only) — that no single method catches alone. Snowflake SQL quantified that top 4 device manufacturers (Arthrex, Stryker, Zimmer Biomet, Smith+Nephew) capture 59.7% of orthopedic surgery payments. Same feature pipeline implemented across three execution backends (Pandas, Snowflake, PySpark) with cross-platform row-level equivalence verified. Two interactive Tableau Public dashboards.

Built with: Python (Pandas, Scikit-learn, PySpark), Snowflake, Tableau, Git


🌱 Currently Deepening

Agentic AI production patterns • RAG evaluation • Causal inference at scale

📫 Let's Connect

LinkedIn Email

Pinned Loading

  1. clinical-trial-rag clinical-trial-rag Public

    Agentic Corrective RAG over 484 ClinicalTrials.gov oncology protocols. LangGraph CRAG with LLM-as-judge grading + cross-encoder reranker for two-stage retrieval. Three independent refusal gates. Pu…

    Jupyter Notebook

  2. patient-risk-stratification patient-risk-stratification Public

    End-to-end heart disease risk pipeline: XGBoost + SHAP, causal inference (DoWhy) + survival analysis (lifelines), deployed live on AWS Lambda.

    Jupyter Notebook

  3. pharma-compliance-spend-analytics pharma-compliance-spend-analytics Public

    Rule-based + ML anomaly detection across 16M+ CMS Open Payments records ($13B). Same pipeline runs in Pandas, Snowflake, and PySpark; surfaces three compliance archetypes and the finding that top 4…

    Jupyter Notebook