This is a crude demo project made to mimic the supposed live video ingestion capabilities of Google's multimodal Gemini LLM, but made with the GPT-4 Vision API.
Demo: https://youtu.be/UxQb88gENeg
$ pip install -r requirements.txt
$ export OPENAI_API_KEY=YOUR_OPENAI_API_KEYTo run the voice commanded terminal version, run the voice.py script.
$ python3 voice.py VIDEO_STREAM_URLThe assistant only reacts to voice commands.
To run the motion detecting version, run the motion.py script.
$ python3 motion.py VIDEO_STREAM_URLThe assistants reacts every time motion is detected in the video. A tripod is recommended.
To run the automatic version that detects both voice commands and motion in the video, run the auto.py script.
$ python3 auto.py VIDEO_STREAM_URLThe assistants reacts every time motion is detected in the video or a voice command is given. A tripod is recommended.
There is also a version with a "UI" made with CV2 (it sucks but kinda works). It both listens to voice commands and detects motion in the video and automatically sends both to the GPT4V API.
$ python3 auto_with_ui.py VIDEO_STREAM_URLIn my testing, I have used my phone camera as the video stream. For this, I used the IP Webcam app on Play Store. I set the camera to 10 fps at 640x480 resolution.
The VIDEO_STREAM_URL is passed directly into cv2.VideoCapture(), so I guess you should be able to pass in a video file too, or any kind of video stream.
There is a config.py file where you can tweak some settings if you are having trouble with the motion detection or speech detection.
- GPT-4V API is often slow
- Sometimes the assistant response is detected as a user message
- The CV2 UI sucks and should be made with another way
- The CV2 UI can only be closed by hittin Ctrl+C in the terminal