This quickstart tutorial demonstrates how to use Azure Machine Learning to train a simple linear regression model on the Boston House Prices dataset.
aml_quickstart_tutorial/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .env.template # Environment variables template
├── data/
│ └── boston_house_prices.csv # Sample dataset
├── code/
│ ├── train_model.py # Model training script
│ └── upload_data_asset.py # Script to upload data to Azure ML
└── jobs/
└── submit_training_job.py # Script to submit training job to Azure ML
- An Azure subscription
- An Azure Machine Learning workspace
- Python 3.8 or later
- Azure CLI (optional, for additional commands)
To set up the environment automatically run:
cd aml_quickstart_tutorial
./setup.sh
or you can also do it step by step:
cd aml_quickstart_tutorial
pip install -r requirements.txt-
Copy the environment template:
cp .env.template .env
-
Edit
.envand fill in your Azure details:AZURE_SUBSCRIPTION_ID=your_subscription_id_here AZURE_RESOURCE_GROUP=your_resource_group_here AZURE_ML_WORKSPACE_NAME=your_workspace_name_here AZURE_ML_COMPUTE_NAME=cpu_cluster_name_here
Make sure you're authenticated with Azure. You can use one of these methods:
- Azure CLI:
az login - VS Code: Use the Azure extension
- Environment variables: Set
AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,AZURE_TENANT_ID
Before running on Azure ML, you can test the training script locally:
cd code
python train_model.py --data_path ../data/boston_house_prices.csv --output_dir ./local_outputsUpload the dataset as a data asset to your Azure ML workspace:
cd code
python upload_data_asset.pyThis creates a data asset named "boston-house-prices" that can be reused across experiments.
Submit the training job to run on Azure ML compute:
cd jobs
python submit_training_job.pyThis will:
- Create a compute environment with the required dependencies
- Submit a training job using the uploaded data asset
- Train a linear regression model
- Save the trained model and metrics
After submitting the job, you can monitor it:
- Azure ML Studio: Use the Studio URL provided in the output
- Azure CLI:
az ml job show --name <job_name> - Stream logs:
az ml job stream --name <job_name>
The training script (train_model.py):
- Loads the Boston House Prices dataset
- Splits the data into training and test sets
- Trains a linear regression model using scikit-learn
- Evaluates the model and logs metrics to Azure ML
- Saves the trained model and feature names
- Data Asset Management: The dataset is uploaded as a versioned data asset
- Environment Management: Automatic creation of conda environment with dependencies
- Experiment Tracking: Metrics are logged to Azure ML for comparison
- Model Artifacts: Trained models are saved and can be registered for deployment
- Scalable Compute: Jobs run on Azure ML compute clusters
The training script logs the following metrics:
- R² Score: Coefficient of determination (how well the model explains variance)
- RMSE: Root Mean Square Error
- MSE: Mean Square Error
After completing this quickstart, you can:
- Register the Model: Register the trained model for deployment
- Create Endpoints: Deploy the model as a real-time or batch endpoint
- Experiment with Different Models: Try other algorithms like Random Forest
- Add Data Drift Monitoring: Monitor your deployed model for data drift
- Set up MLOps Pipelines: Automate training and deployment with Azure DevOps or GitHub Actions
- Authentication Errors: Make sure you're logged in with
az login - Compute Not Found: Ensure your compute cluster exists or update the compute name in
.env - Environment Creation Fails: Check that all dependencies in
requirements.txtare available - Data Asset Not Found: Make sure you've run the upload script first
The Boston House Prices dataset contains:
- Features: 13 attributes including crime rate, property tax, pupil-teacher ratio, etc.
- Target: Median home value (MEDV)
- Samples: ~500 housing records
- Task: Regression (predicting continuous values)
This is a classic dataset for learning regression techniques, though note that it's considered outdated for real-world applications due to ethical concerns with some features.