Language planning labeling for VLA-OS
First, create a virtual environment with python==3.10 and torch.
# Create and activate conda environment
conda create -n plan_label python=3.10 -y
conda activate plan_label
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -yThen, install this repo.
pip install -e .We will use an example dataset to illustrate how to convert a numpy dataset into a tfds dataset, and then label the language planning as described in our paper. We also provide the scripts for libero_10 dataset as an example here.
First, generate a dummy example dataset.
python tfds_dataset_generation/example_dataset/create_example_data.pyThen the example numpy dataset will be saved to data/raw_datasets. Convert it into a tfds dataset with a specified saving path data/tfds_datasets.
tfds build tfds_dataset_generation/example_dataset --data_dir data/tfds_datasetsWe will first generate step-wise language reasoning and then add some visual features in language.
First, describe the scene with Prismatic-7B VLM.
python scripts/example_scripts/describe_scene.py --dataset_dir data/tfds_datasetsThen, extract the primitive moves with horizon of 10. Verify whether the interpretation of the state and action values is correct. The first six elements of the state represent translation and rotation in the world coordinate frame. In gripper_action, a value of +1 indicates closing the gripper, while -1 indicates opening it.
python scripts/example_scripts/extract_primitives.py --dataset_dir data/tfds_datasets --action_horizon 10Next, query Gemini-1.5-flash for the initial reasoning. Remember to use your own gemini api key at GOOGLE_API_KEY=xxx.
GOOGLE_API_KEY=xxx python scripts/example_scripts/batch_generate_plan_subtasks.py --batch_size 1 --action_horizon 10 --dataset_dir data/tfds_datasets --force_regenerateFinally, query Gemini-1.5-flash again to filter the initial reasoning for the final reasoning.
GOOGLE_API_KEY=xxx python scripts/example_scripts/batch_filter_plan_subtasks.py --batch_size 1 --action_horizon 10 --dataset_dir data/tfds_datasets --force_regenerateObject bounding boxes and gripper position are included for visual features in language planning. And there are different processes for simulation data and real-world data.
In simulation, we obtain segmentation mask and camera parameters easily.
We use the segmentation mask and the INSTANCE_ID_TO_NAMES dict in env_configs.py to get the bounding boxes for each step. The option --debug can be used to visualize the first episode and see if the labeling is correct.
python scripts/example_scripts/generate_bboxes.py --dataset_dir data/tfds_datasetsWe use the IMAGE_SIZE and the CAMERA_PARAS dict in env_configs.py to get the gripper position for each step. The option --debug can be used to visualize the first episode and see if the labeling is correct.
python scripts/example_scripts/detect_gripper_position.py --dataset_dir data/tfds_datasetsThe code will be coming soon.
Merge the filtered_reasoning_h10.json, bboxes.json and gripper_positions.json under data/planning_datasets/example_dataset/cot to get the complete language planning json file.
python scripts/example_scripts/merge_plan_subtasks.py --planning_dir data/planning_datasets/example_datasetThis repository is based on Embodied-CoT.