Machine Learning Classification in Qgis
Machine Learning Classification in Qgis
March 1, 2024
GIS, ai, data science, machine-learning, python, qgis
Background
In this brief tutorial, we’ll examine machine learning through multi-class classification in the Dzetsaka classification
plugin in QGIS. Dzetsaka was written by Nicolas Karasiak. The Dzetsaka plugin works in QGIS to take raster (often
satellite imagery) data and uses a set of training data to build land type classifications. Traditionally, the tool was
developed to determine different types of vegetation in the landscape, although it works well (with training and validation)
on different types of land covers.
Land cover refers to visible categories of land use in a given area. These could include:
● Tree cover (high coverage >80%, medium coverage, low coverage)
● Water
● Built-up areas
● Grassland
● Settlements / homes
● Shrub
● Bare soil
This map shows land cover in the conterminous U.S. in 2016. Image credit: USGS
The types of land cover you choose will be up to you, depending on the area you’re classifying. In this example, we’ll be
using a high-resolution orthoimage from the USGS EROS Archive at USGS’s EarthExplorer.
This tutorial covers the following steps:
1. Background
2. Installation
1. Install QGIS
2. Install scikit-learn
3. Installing the dzetsaka plugin
3. Preparation
4. Building Region Classes
1. Modifying the Training Data Classes
2. Adding Regions of Interest
5. Classifying using Training
6. Smoothing and Vectorizing the Results
1. Smoothing the Results
2. Vectorizing the Results
7. Validating the Model
1. Confusion Matrix
2. Confidence Map
3. Cross-Validation
8. Next Steps
Installation
Install QGIS
If you haven’t installed QGIS, please download and install QGIS on your Windows, Mac, or Linux computer. The
Dzetsaka plugin should work with recent versions of QGIS.
Install scikit-learn
Before running QGIS, be sure to have the scikit-learn Python module installed in your QGIS path. To do this, follow one
of these methods:
☐ On Windows, open the OSGeo4W Shell and run
python3 - m pip install scikit-learn -U --user
Preparation
For this tutorial, I’ve prepared a data set ready to use. You can, of course, substitute your own files and the process will
largely be the same.
☐ Download this zip folder, and extract the data into a folder you can access.
machine-learningDownload
☐ To open the project, double-click on the machine-learning.qgz file.
The data consists of the following files:
● oc6iO_37_000_10978801_20130220_0304r0.tif – a 3-band corrected orthoimage of Chapel Hill, NC from
February 20, 2013.
● A prepared roi.shp (Region of Interest) shapefile with three land cover types: roads, buildings, and fields. This file
will hold our training data.
● An example output raster (out.tif) from an earlier run.
● An example smoothed raster (sieve.tif) from an earlier run.
● And an example vectorized, classified shapefile (vectorized.shp) from an earlier run.
All of the data is the tutorial is in the EPSG:2264, NAD83 / North Carolina (ftUS) projection. The cell size of the raster is
.5 feet. It’s important that you keep your projection consistent between data and appropriate for the location that you’re
running analysis on.
NOTE: If you haven’t extracted the data from the zip folder, it may not appear in QGIS. Be sure to extract the data.
☐ Enable editing by clicking the Toggle Editing button in the Digitizing Toolbar. If the Digitizing Toolbar isn’t visible, go
to View > Toolbars > Digitizing Toolbar to check it.
☐ To remove polygons, you can open the attribute field and delete rows or modify them. If you’ve made any mistakes,
you can remove it from the attribute table or toggle editing off and on.
☐ Once you’re ready to save the layer, click the save button, and then click the Toggle Editing button.
☐ For the input layer, choose the output raster from your classification step.
☐ The Threshold is the number of pixels to be removed. For a noisy dataset, you will need to set this very high – perhaps
100 or higher. This is quite high – fixing the training data and rerunning the classification will also improve this problem.
☐ Set the output file to a new raster.
If you compare the output from sieve to the output from the classifier, you should see a clear difference in noise.
Output from classifier
Confusion Matrix
To validate the model, we have several options. One is a confusion matrix, which involves comparing regions of a known
class and comparing them with what the classifier defined them as:
Confidence Map
Djetsaka will generate a confidence map, which is how confident the classifier appears to be on a range of 0 to 100%
based on the training data for each pixel. This doesn’t meant that the results are valid, only that the output matches the
training data. This helps to location potential spatial error in the data.
Cross-Validation
Another technique to test the classifier is cross-validation. This involves comparing the outputs to a known classification –
either another dataset or by leaving some of the training data out of the initial learning process.
Overall accuracy can be calculated from the omitted training data by counting the number of pixels in the region that are
correctly classified vs. the total number of pixels in the region.
Next Steps
Training and classification of land cover data requires patience and practice. There are plenty of other techniques for
image classification available in open source data, including the Semi-Automatic Classification Plugin, used to
automatically find and download satellite imagery and process it.
Other software platforms, including ArcGIS, provide pre-made classifiers that can be used for specific purposes.
Once classification has been verified, then using multiple images over time can be used to determine land cover change.
We can determine some critiques of this technique as well, since it requires skill in determining training data, and, as with
all AI, Garbage In = Garbage Out. But even basic classification techniques can provide a stepping stone to understanding
the landscape and examining change in land.