This project offers a user-friendly interface for working with the Llama-3.2-11B-Vision and Molmo-7B-D models.
In this case, both the Llama-3.2-11B-Vision-bnb-4bit and Molmo-7B-D-bnb-4bit models need 12GB of VRAM to run.
The code is tested and runs on Ubuntu 22.04.5 / Python 3.10.12
The model selection is done via the command line:
To set up and run this project on your local machine, follow the steps below:
Copy the repository to a convenient location on your computer:
git clone <repository-url>
cd <repository-directory>Inside the cloned repository, create a virtual environment using the following command:
python -m venv venv-uiActivate the virtual environment using:
.\venv-ui\Scripts\activateAfter activating the virtual environment, install the necessary dependencies from requirements.txt:
pip install -r requirements.txtInstall Torch and TorchVision using separate commands:
pip install torch==2.4.1+cu121 --index-url https://download.pytorch.org/whl/cu121and
pip install torchvision==0.19.1+cu121 --index-url https://download.pytorch.org/whl/cu121To start the UI, you can either:
-
Use the run.bat script (Windows only)
Simply double-click on
run.bator-
Activate the virtual environment:
- Windows:
.\venv-ui\Scripts\activate
- Windows:
-
Run the Python script:
python clean-ui.py
-
You can use the gradio client to programatically script prompts and retreive JSON files with descriptions. For example using the following command will retreive descriptions for the two images in the img/ subdirectory
python3 client.py img/preview.png img/selection.png
- Upload an image and enter a prompt to generate an image description.
- Adjustable parameters such as temperature, top-k, and top-p for more control over the generated text.
- Chatbot history to display prompt-response interactions.
This project is licensed under the MIT License. See the LICENSE file for more details.