Codestin Search App

Introduction

This is a detour in my research to find a lightweight method to extract semantically similar contents from data of varying sizes. Even though this demo is for image similarity, the same principles can be applied to other type of data such as audios or texts. The method used in this Application is a traditional machine learning method, called BoVW or Bag of Visual Words.

The reason why methods like this is still interesting and still have applications is because in many scientific domains, there are no pre-trained CNN and no model weights exist (think SEM / Scanning Electron Microscope images or microscope slides or defect images of some proprietary processes, or FFT of audio segments of varying lengths, etc...) and as such, it can be expensive in terms of time, money, and compute to use Deep Learning based methods for these domains. In the applications of NLP, one could use similar techniques to search for similar documents in a corpus of documents of variable lengths.

The main takeaway with this family of technique is that by creating a histogram from a fixed size codebook, it can accommodate for input data of varying sizes or features and does not require resizing or truncating the data to fit certain model size. There are techniques in this family with better recall, however they were not used since they involve heavy storage and as such is a constraint for storing on the free github and streamlit community cloud. For my own research I came upon a different technique that is fairly unique (however still of similar principles) such that I do not want to make public (at least not yet).

Generally however, BoVW is a good basis to start generating ideas for other better techniques and therefore should be a part of one's bag of tools to know.

App

If you just want to see the application, it is here.

Data

This Application demonstrates the use of BoVW for image semantic similarity search using the National Gallery of Art's collection of Open Access images. The NGA makes over +120 thousands images of arts available for download via their Open Data / Open Access program https://www.nga.gov/artworks/free-images-and-open-access.

On their github page, one can find a csv of published images that has the direct links to images (lower resolutions): published_images.csv.

For this Application, 127,883 images were learned using the above mentioned technique, using data made available up to 2025-12-10.

Method

The BOVW method used is composed of the following steps:

Unsupervised Learning phase:

Extract the visual words, in this case SIFT features of images of varying sizes
Cluster these words using k-means with a preset / defined k clusters -> This essentially creates a code book for this unique set of data.

Transformation phase:

Given an input image, extract SIFT features
Use the trained k-means model and predict the clusters for each of the features
Then create a count histogram of how features many land in each cluster
This step is optional but helps to improve recall: perform TF-IDF transformation based on the album or collection of images that this image belongs to
This results in essentially a unique barcode for this specific image

Next steps / depending on application:

For image search application (via semantic similarity), can add all these barcodes of the images into a vector database in order to efficiently search by visual similarity
For image classification, these barcodes (or vectors) can then be clustered and the representative images for each cluster can be selected for human annotation (a technique called label propagation), saving labor

Despite its simplicity as seen from the front-end, there were steps taken to enable efficient training of the nearly 130 thousands images. Techniques used include:

sparse arrays passing and storage for all stages involved (both learning and transform)
mini-batch kmeans for the learning phase
batch processing and transform
efficient vector storage and recall with usearch
at run time, memory saving with memory mapping with usearch

The end result of these 127,883 images and modeling is a 268MB usearch vector index, a 3.3MB pickled BOVW model, and a 48MB csv file which relates the unique index keys to the image URLs for display.

Tools

The following tools were use in the collection of data, building of the model, the vector database, and this application:

polars to download and read the csv, as well as for filtering specific columns / rows
openCV for reading in the images, as well as for the SIFT (Scale Invariant Feature Transformers) model
numpy and scipy for sparse arrays and other numerical computings
scikit-learn for mini-batch k-means and tf-idf transformers
usearch for storing of the transformed vectors as well as for efficient ANN searches
streamlit for the user interfacing UI front-end

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
bovw.py		bovw.py
bovw_index.usearch		bovw_index.usearch
bovw_nga.pkl		bovw_nga.pkl
published_images.csv		published_images.csv
published_images_with_keys.csv		published_images_with_keys.csv
requirements.txt		requirements.txt
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

App

Data

Method

Tools

About

Uh oh!

Languages

License

huyndao/nga_project

Folders and files

Latest commit

History

Repository files navigation

Introduction

App

Data

Method

Tools

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages