This is a detour in my research to find a lightweight method to extract semantically similar contents from data of varying sizes. Even though this demo is for image similarity, the same principles can be applied to other type of data such as audios or texts. The method used in this Application is a traditional machine learning method, called BoVW or Bag of Visual Words.
The reason why methods like this is still interesting and still have applications is because in many scientific domains, there are no pre-trained CNN and no model weights exist (think SEM / Scanning Electron Microscope images or microscope slides or defect images of some proprietary processes, or FFT of audio segments of varying lengths, etc...) and as such, it can be expensive in terms of time, money, and compute to use Deep Learning based methods for these domains. In the applications of NLP, one could use similar techniques to search for similar documents in a corpus of documents of variable lengths.
The main takeaway with this family of technique is that by creating a histogram from a fixed size codebook, it can accommodate for input data of varying sizes or features and does not require resizing or truncating the data to fit certain model size. There are techniques in this family with better recall, however they were not used since they involve heavy storage and as such is a constraint for storing on the free github and streamlit community cloud. For my own research I came upon a different technique that is fairly unique (however still of similar principles) such that I do not want to make public (at least not yet).
Generally however, BoVW is a good basis to start generating ideas for other better techniques and therefore should be a part of one's bag of tools to know.
If you just want to see the application, it is here.
This Application demonstrates the use of BoVW for image semantic similarity search using the National Gallery of Art's collection of Open Access images. The NGA makes over +120 thousands images of arts available for download via their Open Data / Open Access program https://www.nga.gov/artworks/free-images-and-open-access.
On their github page, one can find a csv of published images that has the direct links to images (lower resolutions): published_images.csv.
For this Application, 127,883 images were learned using the above mentioned technique, using data made available up to 2025-12-10.
The BOVW method used is composed of the following steps:
Unsupervised Learning phase:
- Extract the visual words, in this case SIFT features of images of varying sizes
- Cluster these words using k-means with a preset / defined k clusters -> This essentially creates a code book for this unique set of data.
Transformation phase:
- Given an input image, extract SIFT features
- Use the trained k-means model and predict the clusters for each of the features
- Then create a count histogram of how features many land in each cluster
- This step is optional but helps to improve recall: perform TF-IDF transformation based on the album or collection of images that this image belongs to
- This results in essentially a unique barcode for this specific image
Next steps / depending on application:
- For image search application (via semantic similarity), can add all these barcodes of the images into a vector database in order to efficiently search by visual similarity
- For image classification, these barcodes (or vectors) can then be clustered and the representative images for each cluster can be selected for human annotation (a technique called label propagation), saving labor
Despite its simplicity as seen from the front-end, there were steps taken to enable efficient training of the nearly 130 thousands images. Techniques used include:
- sparse arrays passing and storage for all stages involved (both learning and transform)
- mini-batch kmeans for the learning phase
- batch processing and transform
- efficient vector storage and recall with usearch
- at run time, memory saving with memory mapping with usearch
The end result of these 127,883 images and modeling is a 268MB usearch vector index, a 3.3MB pickled BOVW model, and a 48MB csv file which relates the unique index keys to the image URLs for display.
The following tools were use in the collection of data, building of the model, the vector database, and this application:
- polars to download and read the csv, as well as for filtering specific columns / rows
- openCV for reading in the images, as well as for the SIFT (Scale Invariant Feature Transformers) model
- numpy and scipy for sparse arrays and other numerical computings
- scikit-learn for mini-batch k-means and tf-idf transformers
- usearch for storing of the transformed vectors as well as for efficient ANN searches
- streamlit for the user interfacing UI front-end