Precise failure reasoning and detection for robotic manipulation
[ICLR 2025] [Paper]
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar*, Yijie Guo*
AHA is an open-source VLM specifically designed to detect and reason about failures in robotic manipulation through natural language. Through failure reasoning, AHA can improve performance for robotic manipulation systems that rely on VLMs (such as Trust the PRoC3S, Manipulate-Anything, and Eureka).
git clone https://github.com/NVlabs/AHA.git
conda create -n aha python=3.10 -y
conda activate aha
pip install --upgrade pip # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda=12.1 -y
Download CoppeliaSim v4.1:
Extract it somewhere in your system, and set the following environemnt variables
(add it to .bashrc
to make changes last):
export COPPELIASIM_ROOT=/path/to/coppeliasim
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT
Remember to run source ~/.bashrc
after adding these lines.
⚠️ Warning: CoppeliaSim might cause conflicts with ROS workspaces.
Install PyRep:
git clone https://github.com/stepjam/PyRep.git
cd PyRep
pip install -r requirements.txt
pip install .
Install the fork:
git clone -b peract https://github.com/MohitShridhar/RLBench.git
python update.py
cd RLBench
pip install -r requirements.txt
pip install -e .
cd aha/Data_Generation/rlbench-failgen
pip install -r requirements.txt
pip install -e .
After installing the packages, the structure is now:
- **
- aha/
- ...
- PyRep/
- ...
- RLBench/
- ...
- ....
- aha/
For specific tasks:
python ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.py \
--task basketball_in_hoop \
--episodes 1 \
--max_tries 1 \
--savepath <Output Dir>
For headless servers:
xvfb-run -a -s "-screen 0 1400x900x24" \
python ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.py \
--task basketball_in_hoop \
--episodes 1 \
--max_tries 1 \
--savepath <Output Dir>
Generate all 79 tasks as in the paper:
bash ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.sh
After generating all of the tasks, you would need to run these to generate the json file for instruction fine-tuning.
#Process the data generated into right format
python ./aha/Data_Generation/rlbench-failgen/process_data.py /path/to/input_folder /path/to/output_folder
#Format the processed data into json for finetuning.
python ./aha/Data_Generation/rlbench-failgen/make_json.py /path/to/processed_data --output ./aha_training.json
Training takes ~40 hours on 8 A100 GPUs (80GB).
AHA is instruction finetuned with RoboPoint codebase, so setup finetuning code with RoboPoint.
Setup:
git clone https://github.com/wentaoyuan/RoboPoint.git
pip install -e .
pip install -e ".[train]" # Only needed for training
pip install flash-attn --no-build-isolation
Merge the AHA failure dataset generated previously with the Co-training data.
We use pretrained projector weights from LLaVA. The projector is trained on image-text pairs from the 558K subset of the LAION-CC-SBU dataset with BLIP captions (see here). When using these projector weights, please make sure that the vision encoder and the projector type are set correctly.
For CLIP-L-336px vision encoder,
--vision_tower openai/clip-vit-large-patch14-336
For MLP-2x projector,
--mm_projector_type mlp2x_gelu
For Linear projector,
--mm_projector_type linear
Base LLM | Vision Encoder | Projection | Pretrain Data | Download |
---|---|---|---|---|
Vicuna-13B-v1.5 | CLIP-L-336px | MLP-2x | LCS-558K | projector |
Vicuna-7B-v1.5 | CLIP-L-336px | MLP-2x | LCS-558K | projector |
LLaMA-2-13B-Chat | CLIP-L-336px | Linear | LCS-558K | projector |
LLaMA-2-7B-Chat | CLIP-L-336px | Linear | LCS-558K | projector |
If you are do not have enough GPU memory, you can reduce BATCH_PER_GPU
and increase the GRAD_ACC_STEPS
accordingly. Always keep the global batch size the same: NUM_NODES
x NUM_GPUS
x BATCH_PER_GPU
x GRAD_ACC_STEPS
.
Hyperparameters used in instruction tuning are provided below.
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
RoboPoint-v1-13B | 128 | 2e-5 | 1 | 2048 | 0 |
#For full finetuning of RoboPoint with AHA dataset via Vicuna 1.5
bash ./RoboPoint/scripts/finetune_vicuna.sh
We evaluated AHA on three test datasets:
- AHA (test)
- Maniskill FailGen data
- REFLECT
Below are the instructions to generate or obtain each dataset:
- ⚙️ AHA (test): Generate this dataset using the same dataset generation script, but with different tasks.
bash ./aha/Data_Generation/rlbench-failgen/examples/ex_data_generator_eval.sh
- 📖 Maniskill FailGen: Follow the instructions here to generate the dataset.
- 🔍 REFLECT: Sub-sample the REFLECT dataset from this source and use our annotated JSON file for evaluation.
After evaluated your trained model with the respective datasets you can measure the ROGUE-L, LLM Fuzzy, or Binary Success results via these:
python aha/evaluation/eval_metrics/LLM_fuzzy.py --gt_path /path/to/real_qa.json --res_path /path/to/your_results.json
python aha/evaluation/eval_metrics/check_answer_ROGUE.py --data_path /path/to/out_qa.json --answers_path /path/to/aha_arnold_out_final_qa_failgen_answers.json --indx_num 11291
python aha/evaluation/eval_metrics/check_answer_Yes_No.py --data_path /path/to/out_qa.json --answers_path /path/to/aha_fr_out_final_qa_failgen_answers.json --indx_num 11291
We thank the following projects that parts of our code are derived from:
@article{duan2024aha,
title={AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation},
author={Duan, Jiafei and Pumacay, Wilbert and Kumar, Nishanth and Wang, Yi Ru and Tian, Shulin and Yuan, Wentao and Krishna, Ranjay and Fox, Dieter and Mandlekar, Ajay and Guo, Yijie},
journal={arXiv preprint arXiv:2410.00371},
year={2024}
}