Paper: "Integrating Fuzzy MCDM Priors into Deep Reinforcement Learning for Sustainable and Resilient Supplier Selection and Dynamic Order Allocation"
Authors: Ali Vaezi · Erfan Rabbani · Giulia Bruno
Affiliation: Politecnico di Torino, Italy
- Overview
- Architecture
- Quick Start
- Repository Structure
- Core Features
- Experimental Design
- Results Preview
- Reproducibility
- Citation
- License
This repository provides the complete implementation for a two-phase hybrid framework that integrates Multi-Criteria Decision Making (MCDM) with Deep Reinforcement Learning (DRL) for sustainable supplier order allocation under demand uncertainty.
Phase I applies Fuzzy Best-Worst Method (FBWM) and Fuzzy TOPSIS to evaluate suppliers across economic, environmental, and resilience dimensions. Phase II embeds these sustainability priors into a custom PPO agent (SUPRA-PPO) operating within a stochastic supply chain simulation.
The key research question: When and how do sustainability priors improve RL-based order allocation?
┌─────────────────────────────────────────────────────────┐
│ Phase I: MCDM │
│ ┌──────────┐ ┌───────────┐ ┌──────────────────┐ │
│ │ FBWM │───▸│ Criteria │───▸│ FTOPSIS │ │
│ │ (weights) │ │ Weights │ │ (supplier ranks) │ │
│ └──────────┘ └───────────┘ └────────┬─────────┘ │
└────────────────────────────────────────────┼────────────┘
│
Sustainability Priors (scores)
│
┌────────────────────────────────────────────▼────────────┐
│ Phase II: RL Training │
│ ┌──────────────────┐ ┌───────────────────────────┐ │
│ │ SupplyChainEnv │◂──▸│ SUPRA-PPO │ │
│ │ (gymnasium) │ │ - Adaptive Entropy (ES) │ │
│ │ - 5 suppliers │ │ - FTOPSIS reward shaping │ │
│ │ - 2 products │ │ - CV-based scheduling │ │
│ │ - 4 scenarios │ └───────────────────────────┘ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
- Python: 3.9 or later
- Key packages: See
requirements.txtfor the full pinned list
pip install -r requirements.txt- Create and activate a Python virtual environment:
python -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- (Optional) Recompute MCDM supplier scores — pre-computed scores are provided in
results/mcdm/:
python src/mcdm_evaluation.py- Run the full experiment suite (4 scenarios × 3 models × 3 seeds = 54 runs — 36 main + 18 ablation):
bash run_all_experiments.sh
# or individually:
python src/run_experiment.py- Run the Misaligned Shock adversarial experiment (Section 5.4):
python src/run_misaligned_shock.py
# or for a specific seed subset:
python src/run_misaligned_shock.py --seeds 42,123,456- Generate publication-quality figures:
python src/visualize_results.pyBuild and run the entire experiment suite in a container:
# Build the image
docker build -t supra-ppo .
# Run environment validation
docker run --rm supra-ppo
# Run experiments (results are saved inside the container)
docker run --rm -v $(pwd)/results:/app/results supra-ppo \
python src/run_experiment.py
# Run visualisation
docker run --rm -v $(pwd)/results:/app/results supra-ppo \
python src/visualize_results.py├── src/
│ ├── run_experiment.py # Main experiment runner (4 scenarios × 3 models × 3 seeds)
│ ├── mcdm_evaluation.py # FBWM + FTOPSIS supplier scoring
│ ├── visualize_results.py # Publication-quality figure generator
│ └── generate_supplier_table.py # Supplier evaluation LaTeX table
├── scripts/
│ ├── check_status.py # Experiment progress monitor
│ └── evaluate_trained_models.py # Post-training model evaluation
├── test/
│ ├── test_environment.py # Supply chain environment unit tests
│ └── test_reward_fix.py # Reward function validation tests
├── results/
│ └── mcdm/ # Pre-computed MCDM scores (tracked in Git)
│ ├── fbwm_weights.csv # Fuzzy BWM criteria weights
│ ├── ftopsis_rankings.csv # Fuzzy TOPSIS supplier rankings
│ ├── dimensional_scores.csv # Per-dimension supplier scores
│ └── supplier_scores_for_rl.npy # NumPy array used by RL agent
├── requirements.txt # Pinned Python dependencies
├── Dockerfile # Container-based reproducibility
├── .dockerignore # Docker build exclusions
├── run_all_experiments.sh # Shell script to run full experiment suite
├── LICENSE # CC BY-NC-ND 4.0 (pre-publication)
└── README.md
Note: Trained model checkpoints (~160 MB) and generated figures are excluded from Git. Run the experiment scripts to reproduce them locally.
- SUPRA-PPO Algorithm: A PPO variant featuring:
- Adaptive Entropy Scheduling (ES) for dynamic exploration–exploitation balance
- Uncertainty-Driven Regularisation (UDR) for promoting robust policies
- Supply Chain Environment: A
gymnasium-compatible simulation with:- Non-stationary demand (trend, seasonality, stochastic shocks)
- Lognormal lead times and systemic supplier disruptions
- Four market scenarios: Stable Operations, High Volatility, Systemic Shock, Misaligned Shock
- MCDM Integration: Fuzzy Best-Worst Method (FBWM) + Fuzzy TOPSIS (FTOPSIS) to generate sustainability–resilience priors
- Ablation Study: Three models (M1 = Base-Stock, M2 = Vanilla PPO, M3 = PPO + FTOPSIS priors) across 4 scenarios × 3 seeds (54 total runs)
| Factor | Levels | Details |
|---|---|---|
| Models | 3 | M1: Base-Stock heuristic, M2: Vanilla PPO, M3: SUPRA-PPO (PPO + FTOPSIS priors) |
| Scenarios | 4 | Stable Operations, High Volatility, Systemic Shock, Misaligned Shock |
| Seeds | 3 | 42, 123, 456 (for statistical robustness) |
| Timesteps | 3M | Per model-scenario combination |
| Sensitivity | 5 | λ_sust ∈ {0.0, 0.1, 0.2, 0.3, 0.5} |
| Total runs | 54 | 36 main (3 models × 4 scenarios × 3 seeds) + 18 ablation |
| Scenario / Metric | M1 (Base-Stock) | M2 (Vanilla PPO) | M3 (SUPRA-PPO) |
|---|---|---|---|
| Stable — cost | Baseline | Improved |
Best ( |
| High Volatility — cost | Baseline | Best | Penalised by priors |
| Systemic Shock — sustainability | Low | Moderate |
Highest ( |
| Misaligned Shock — cost | Baseline | Moderate | Survives ( |
| Misaligned Shock — sustainability | — | Best | Collapses ( |
Full quantitative results are available after running the experiments.
- All configuration parameters are embedded in
run_experiment.py— no external config files needed - All four scenarios (Stable, High Volatility, Systemic Shock, Misaligned Shock), hyperparameters, and random seeds are explicitly declared in configuration dictionaries
- The Misaligned Shock adversarial scenario is reproduced via
src/run_misaligned_shock.py - Pre-computed MCDM scores are provided in
results/mcdm/so the RL experiments can run without recomputing them - Large outputs (trained models, generated figures, TensorBoard logs) are excluded from Git via
.gitignore - To fully reproduce all 54 runs: install dependencies, run
bash run_all_experiments.sh, thenpython src/visualize_results.py
If you use this code in your research, please cite using the "Cite this repository" button on GitHub, or:
@article{vaezi2026suprappo,
author = {Vaezi, Ali and Rabbani, Erfan and Bruno, Giulia},
title = {Integrating Fuzzy {MCDM} Priors into Deep Reinforcement Learning
for Sustainable and Resilient Supplier Selection and Dynamic Order Allocation},
journal = {International Journal of Production Economics},
year = {2026},
note = {Under review},
url = {https://github.com/aliivaezii/FBWM-FTOPSIS-PPO},
}Contributions are welcome. Please see CONTRIBUTING.md for guidelines.
This project is licensed under CC BY-NC-ND 4.0 during the pre-publication period. You may view and share the code with attribution, but commercial use and derivative works are not permitted. After the associated paper is published, this repository will be re-licensed under the MIT License.