- 
                Notifications
    You must be signed in to change notification settings 
- Fork 443
Description
Was extracting over 1000 images and wanted to summarise in each image was a unique person with age, gender, ethnicity, emotional tone.
So with those 1000 images,
Time for setting up each environment:
wd-14 did well,  20 minutes,
Blip did well, 5 minutes,
Llava similar to cogvlm, setting it up will take 1 hour,
Deepface less then 25 minutes add blib 1 hour+
So given I have 1000 images and 20 images with text:
That's 990 images sliced from a video frame by frame as a experiment, it was flipped also and slight light change over each pair of 100 images excluding the 20 with text.
So to retrieve information from a video blip did exceptionally well, the caveat is text was ignored, only blip 3 tale achieved text but the compute cost was higher, so there will be pros and cons.
Wd-14 was more verbose tagging words for tokens sadly it's overhead cost was not equivalent to blip.
Now none of these did a match for the 20 images between the 480 and 480 that had two famous people, it was detected with gpt4 vision but I had to hand feed it and the computational cost for it to retrieve such face is computationally expensive versus live, so It failed regardless of it's advanced performance.
We used llava it has a overhead cost and overly verbose and while with vision understanding the next frame form 100 frames it did exceptionally well but does not fit the criteria extracting information from the scene rather injecting.
Deepface could detect race, age, and so forth why is this important well if we have 960 images we can search frame by frame a person of that criteria, and if the failure rate is high we know the person in the scene was at location X or Y, ect, but the over head cost in manual time exceeds it's use given you can compare 100 images in a folder to match with 1000 images the failure rates higher than 30% so manual labour is required
So with the rest of 1000 images 480 had hu adjustments, the other 480 was not only flipped for better detection, the last 20 was text and so forth 2 famous faces at 20 images, I randomly removed them but they were placed in as a test hidden in the 960 images.
So blip to becomes optional why it's cheap, it's fast, and can be injected, unfortunately if we have more than 1 person two people labelled person could be male or female so that's where it fails where deepface becomes the last resort.
So that leaves us with cogxvm.
Can this framework achieve anything remotely close you let me and your community know after all they'll be reading it.