This project demonstrates and analyzes CLIP's (Contrastive Language-Image Pre-Training) zero-shot classification capabilities, then explores various approaches to improve its performance. Through systematic testing with diverse object images and text prompts, we identify CLIP's weaknesses and develop effective solutions using prompt engineering and ensemble methods.
- Demonstrate CLIP's zero-shot capabilities with real object images
- Analyze strengths and weaknesses of open-vocabulary perception
- Test diverse text prompts including specific objects, attributes, and functional descriptions
- Attempt fine-tuning approaches to improve performance
- Develop effective prompt engineering strategies as alternative to fine-tuning
- Provide comprehensive insights into improving vision systems without retraining
CLIP/
โโโ clip_zero_shot_test.py # Main testing script
โโโ clip_analysis_results.md # Initial analysis report
โโโ clip_improvement_strategies.py # Prompt engineering strategies
โโโ clip_pretrained_demo.py # Pre-trained CLIP demonstration
โโโ clip_finetune.py # Experimental fine-tuning script (may be unstable)
โโโ clip_finetuning_summary.md # Fine-tuning journey summary
โโโ clip_improvement_analysis.md # Comprehensive improvement analysis
โโโ pretrained_clip_results.json # Results from pre-trained demonstrations
โโโ demo_images/ # Test images (apple, banana, cat, dog, hammer, screwdriver)
โโโ clip/ # CLIP model implementation
โโโ README.md # This file
- Python 3.9+
- PyTorch 1.13.1+ with CUDA support
- CLIP model (ViT-B/32)
-
Activate the CLIP conda environment:
conda activate clip
-
Run the zero-shot classification test:
python clip_zero_shot_test.py
-
Test improvement strategies:
python clip_improvement_strategies.py
-
Demonstrate pre-trained capabilities:
python clip_pretrained_demo.py
- 6 diverse objects: Apple, Banana, Cat, Dog, Hammer, Screwdriver
- Source: CLIP demo images (high-quality, real photographs)
- Specific objects: "a photo of a hammer", "a photo of a cat"
- Functional descriptions: "something with a handle", "something you can eat"
- Material properties: "something made of metal", "something with fur"
- Color attributes: "something red", "something orange"
- Size descriptions: "something small", "something large"
- Contextual categories: "something in the kitchen", "something in the garage"
- Confidence scores for each image-prompt pair
- Top matches for each image and prompt
- Strengths identification (high-confidence matches)
- Weaknesses identification (low-confidence matches)
- Ambiguous cases (multiple high-confidence matches)
-
Excellent Common Object Recognition
- Banana: 97.2% confidence for "a photo of a banana"
- Dog: 79.7% confidence for "a photo of a dog"
- Cat: 76.9% confidence for "a photo of a cat"
-
Hierarchical Understanding
- Apple โ "fruit" (32.0% confidence)
- Cat/Dog โ "pet" (12.2% and 11.1% respectively)
- Cat/Dog โ "animal" (3.4% and 3.2% respectively)
-
Semantic Relationships
- Apple/Banana โ "something you can eat" (5.3% and 0.7%)
- Hammer/Screwdriver โ "something you can use" (12.9% and 1.6%)
-
Tool Recognition Failures
- Hammer: Only 0.5% confidence for "a photo of a hammer"
- Screwdriver: Only 0.1% confidence for "a photo of a screwdriver"
-
Material Property Confusion
- Screwdriver classified as "something with fur" (28.5% confidence)
- Hammer classified as "something with fur" (15.5% confidence)
-
Attribute Recognition Issues
- Poor color recognition (Apple โ "something red" only 2.8%)
- Inconsistent size perception
- Limited material understanding
We attempted to fine-tune CLIP to address these weaknesses, but encountered technical challenges:
- Loss Function Issues: Contrastive loss implementation wasn't compatible
- Learning Rate Problems: Even very low learning rates caused instability
- Gradient Issues: Pre-trained weights made fine-tuning unstable
- Dataset Size: 6 images insufficient for effective fine-tuning
Instead of fine-tuning, we discovered that prompt engineering is much more effective:
- Hammer Recognition: 0.005 โ 0.982 (196x improvement)
- Screwdriver Recognition: 0.001 โ 0.920 (920x improvement)
- Color Recognition: 0.028 โ 0.966 (35x improvement)
-
Detailed Prompts
โ "a photo of a hammer" โ 0.005 confidence โ "a photo of a metal hammer with wooden handle" โ 0.982 confidence -
Contextual Prompts
โ "a photo of a tool" โ 0.628 confidence โ "a photo of a hammer in a toolbox" โ 0.321 confidence -
Functional Descriptions
โ "a photo of a screwdriver" โ 0.391 confidence โ "a photo of a screwdriver used for turning screws" โ 0.055 confidence -
Ensemble Methods
- Combine multiple prompts for the same concept
- Use maximum, average, or weighted scores
- Reduces variance and improves consistency
Prompt engineering is more effective than fine-tuning for addressing CLIP's weaknesses:
- No training required - immediate implementation
- Cost-effective - no computational resources needed
- Domain flexible - can be adapted for any domain
- Leverages pre-trained knowledge - uses CLIP's existing understanding
- Natural Language Integration: Successfully maps text descriptions to visual concepts
- Zero-Shot Generalization: Works without training on specific categories
- Semantic Understanding: Captures functional and categorical relationships
- Prompt Engineering Effectiveness: Dramatic improvements possible with better prompts
- Domain Bias: Better at common objects than specialized tools
- Attribute Recognition: Struggles with materials, colors, and sizes (but improvable)
- Fine-Grained Discrimination: Poor at distinguishing similar objects
- Context Sensitivity: Performance varies significantly with prompt wording
- Fine-tuning Challenges: Difficult to improve through traditional fine-tuning approaches
-
Image Loading (
load_images())- Loads images from demo_images directory
- Handles PIL Image processing
- Error handling for corrupted files
-
CLIP Classification (
run_clip_classification())- Encodes images and text using CLIP
- Computes cosine similarities
- Returns confidence scores for all image-prompt pairs
-
Results Analysis (
analyze_results())- Identifies top matches for each image/prompt
- Finds strengths and weaknesses
- Detects ambiguous cases
- Model: CLIP ViT-B/32
- Device: CUDA GPU (with CPU fallback)
- Preprocessing: CLIP's standard image transforms
python clip_zero_shot_test.pyUsing device: cuda
CLIP model loaded successfully!
Loaded: dog, cat, screwdriver, banana, apple, hammer
Testing with 29 text prompts
================================================================================
CLIP ZERO-SHOT CLASSIFICATION RESULTS
================================================================================
TOP MATCHES FOR EACH IMAGE:
--------------------------------------------------
DOG:
0.797: a photo of a dog
0.111: a photo of a pet
0.032: a photo of an animal
[Additional results...]
ANALYSIS OF CLIP'S STRENGTHS & WEAKNESSES:
--------------------------------------------------
STRENGTHS (High confidence matches):
โ banana โ 'a photo of a banana' (confidence: 0.972)
โ dog โ 'a photo of a dog' (confidence: 0.797)
โ cat โ 'a photo of a cat' (confidence: 0.769)
WEAKNESSES (Low confidence matches):
โ hammer โ 'a photo of a hammer' (confidence: 0.005)
โ screwdriver โ 'a photo of a screwdriver' (confidence: 0.001)
This project serves as an excellent learning resource for:
- Computer Vision Students: Understanding zero-shot learning and prompt engineering
- ML Researchers: Analyzing model limitations and effective improvement strategies
- Practitioners: Learning about prompt engineering for vision models
- Educators: Demonstrating the challenges and solutions for open-vocabulary perception
- Engineers: Understanding when to use fine-tuning vs. prompt engineering
Potential extensions and improvements:
- Prompt Optimization: Develop algorithms to automatically find optimal prompts
- Domain-Specific Templates: Create prompt templates for different domains (medical, industrial, etc.)
- Ensemble Systems: Build robust systems combining multiple prompt strategies
- Larger CLIP Models: Explore ViT-L/14 and ViT-H/14 for even better performance
- Hybrid Approaches: Combine prompt engineering with selective fine-tuning
- Real-World Applications: Apply to practical use cases in industry
- Prompt Generation: Use LLMs to automatically generate effective prompts
This is a demonstration project, but suggestions and improvements are welcome:
- Test with additional images
- Experiment with different prompts
- Analyze results with different CLIP models
- Extend the analysis framework
This project follows the same license as the original CLIP repository. See LICENSE for details.
Project Status: โ
Complete with comprehensive improvement analysis
Last Updated: August 2025
CLIP Version: ViT-B/32
Test Images: 6 objects, 29 prompts
Improvement Achieved: 52x-920x better performance through prompt engineering
Key Discovery: Prompt engineering > Fine-tuning for CLIP improvements
The repo includes an experimental fine-tuning script clip_finetune.py. It is provided for reference and may be unstable with very small datasets.
- Run (may require GPU and fp32 compute):
python clip_finetune.py --epochs 5 --batch_size 4 --lr 1e-6- Notes and caveats:
- Tries symmetric CLIP loss with imageโtext logits
- Use 1:1 imageโcaption pairs per batch
- Prefer fp32 (disable autocast) to avoid NaNs
- Freeze most backbone; only train projections/ln_post
- Small datasets (<100 images) are likely to be unstable
See clip_finetuning_summary.md for details on issues encountered and recommended alternatives (prompt-tuning, LoRA, ensemble prompts).