This repository contains code and models from the following papers:
- Vong, W. K., Wang, W., Orhan, A. E., & Lake, B. M (2024). Grounded language acquisition through the eyes and ears of a single child. Science.
- Wang, W., Vong, W. K., Kim, N., & Lake, B. M. (2023). Finding Structure in One Child's Linguistic Experience. Cognitive Science.
This project uses Python 3.8, and uv for dependency management. Follow these steps to set up the environment:
- Install
uv:
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone the repository:
git clone [email protected]:wkvong/multimodal-baby.git
cd multimodal-baby- Create and set up the virtual environment:
# Create a virtual environment in this folder
uv venv
# Optional: If you need to create the environment in a custom location, run the following commands instead:
uv venv ${UV_CACHE_DIR}/multimodal_baby_env
export VIRTUAL_ENV=${UV_CACHE_DIR}/multimodal_baby_env
# Install dependencies
uv sync- Install additional requirements:
# Install CLIP from source
uv pip install git+https://github.com/openai/CLIP.git
# Download spaCy language model
uv run -- spacy download en_core_web_sm- Test the installation:
# Run the demo script to verify everything is working
uv run demo.pyUsage of CVCL follows the CLIP API. The following code downloads the pre-trained CVCL model (trained on the SAYCam-S dataset) from HuggingFace Hub, and then encodes images and utterances using the model:
import torch
from multimodal.multimodal_lit import MultiModalLitModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cvcl, preprocess = MultiModalLitModel.load_model(model_name="cvcl")
cvcl = cvcl.to(device)
cvcl.eval()
# create random image to encode
images = torch.rand(4, 3, 224, 224).to(device)
image_features = cvcl.encode_image(images)
# create texts to encode
texts = ["ball", "puzzle", "car"]
texts, texts_len = cvcl.tokenize(texts)
texts, texts_len = texts.to(device), texts_len.to(device)
texts_features = cvcl.encode_text(texts, texts_len)
# get logits from a batch of images and texts
logits_per_image, logits_per_text = cvcl(images, texts, texts_len)
print("Logits per image shape:", logits_per_image.shape)
print("Logits per text shape:", logits_per_text.shape)The code in analysis_cvcl/figures.R can be run to reproduce the main figures from the paper.
This project uses the SAYCam dataset described in the following paper:
Sullivan J, Mei M, Perfors A, Wojcik EH, Frank MC (2021) SAYCam: A large, longitudinal audiovisual dataset recorded from the infant's perspective. Open Mind.
The original dataset is hosted on the Databrary repository for behavioral science, along with the SAYCam-S and Labeled-S subsets used in this project. Unfortunately, we are unable to publicly share the SAYCam dataset here due to the terms of use. Interested researchers can apply for access to the dataset with approval from their institution's IRB.
Thank you for checking out our work! If you use models or code from repo, please cite either:
- Vong, W. K., Wang, W., Orhan, A. E., and Lake, B. M (2024). Grounded language acquisition through the eyes and ears of a single child. Science.
or:
- Wang, W., Vong, W. K., Kim, N., and Lake, B. M. (2023). Finding Structure in One Child's Linguistic Experience. Cognitive Science, 47, e13305.
We are grateful for the authors of the SAYCam article, and the volunteers who contributed to the data set, for making our article possible. This work was supported by the DARPA Machine Common Sense program and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.