🤖 AHA: A Vision-Language-Model for Detecting and Reasoning over Failures in Robotic Manipulation

Precise failure reasoning and detection for robotic manipulation

[ICLR 2025] [Paper]

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar*, Yijie Guo*

📖 Introduction

AHA is an open-source VLM specifically designed to detect and reason about failures in robotic manipulation through natural language. Through failure reasoning, AHA can improve performance for robotic manipulation systems that rely on VLMs (such as Trust the PRoC3S, Manipulate-Anything, and Eureka).

🛠️ Data Generation

1. Environment Setup

git clone https://github.com/NVlabs/AHA.git
conda create -n aha python=3.10 -y
conda activate aha

pip install --upgrade pip  # enable PEP 660 support

# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda=12.1 -y

2. PyRep and Coppelia Simulator

Download CoppeliaSim v4.1:

Extract it somewhere in your system, and set the following environemnt variables (add it to .bashrc to make changes last):

export COPPELIASIM_ROOT=/path/to/coppeliasim
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=$COPPELIASIM_ROOT

Remember to run source ~/.bashrc after adding these lines.

⚠️ Warning: CoppeliaSim might cause conflicts with ROS workspaces.

Install PyRep:

git clone https://github.com/stepjam/PyRep.git
cd PyRep
pip install -r requirements.txt
pip install .

3. RLBench

Install the fork:

git clone -b peract https://github.com/MohitShridhar/RLBench.git
python update.py
cd RLBench
pip install -r requirements.txt
pip install -e .

4. FailGen

cd aha/Data_Generation/rlbench-failgen
pip install -r requirements.txt
pip install -e .

After installing the packages, the structure is now:

**
- aha/
  - ...
- PyRep/
  - ...
- RLBench/
  - ...
- ....

5. Generate failure trajectories with keyframes only:

For specific tasks:

python ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.py \
  --task basketball_in_hoop \
  --episodes 1 \
  --max_tries 1 \
  --savepath <Output Dir>

For headless servers:

xvfb-run -a -s "-screen 0 1400x900x24" \
  python ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.py \
  --task basketball_in_hoop \
  --episodes 1 \
  --max_tries 1 \
  --savepath <Output Dir>

Generate all 79 tasks as in the paper:

bash ./aha/Data_Generation/rlbench-failgen/examples/ex_custom_data_generator.sh

After generating all of the tasks, you would need to run these to generate the json file for instruction fine-tuning.

#Process the data generated into right format
python ./aha/Data_Generation/rlbench-failgen/process_data.py /path/to/input_folder /path/to/output_folder

#Format the processed data into json for finetuning.
python ./aha/Data_Generation/rlbench-failgen/make_json.py /path/to/processed_data --output ./aha_training.json

🧠 Visual Instruction Finetuning

Training takes ~40 hours on 8 A100 GPUs (80GB).

AHA is instruction finetuned with RoboPoint codebase, so setup finetuning code with RoboPoint.

Setup:

git clone https://github.com/wentaoyuan/RoboPoint.git
pip install -e .
pip install -e ".[train]"  # Only needed for training
pip install flash-attn --no-build-isolation

Merge the AHA failure dataset generated previously with the Co-training data.

Download pretrained projector weights

We use pretrained projector weights from LLaVA. The projector is trained on image-text pairs from the 558K subset of the LAION-CC-SBU dataset with BLIP captions (see here). When using these projector weights, please make sure that the vision encoder and the projector type are set correctly.

For CLIP-L-336px vision encoder,

--vision_tower openai/clip-vit-large-patch14-336

For MLP-2x projector,

--mm_projector_type mlp2x_gelu

For Linear projector,

--mm_projector_type linear

Base LLM	Vision Encoder	Projection	Pretrain Data	Download
Vicuna-13B-v1.5	CLIP-L-336px	MLP-2x	LCS-558K	projector
Vicuna-7B-v1.5	CLIP-L-336px	MLP-2x	LCS-558K	projector
LLaMA-2-13B-Chat	CLIP-L-336px	Linear	LCS-558K	projector
LLaMA-2-7B-Chat	CLIP-L-336px	Linear	LCS-558K	projector

Training

If you are do not have enough GPU memory, you can reduce BATCH_PER_GPU and increase the GRAD_ACC_STEPS accordingly. Always keep the global batch size the same: NUM_NODES x NUM_GPUS x BATCH_PER_GPU x GRAD_ACC_STEPS.

Hyperparameters used in instruction tuning are provided below.

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
RoboPoint-v1-13B	128	2e-5	1	2048	0

#For full finetuning of RoboPoint with AHA dataset via Vicuna 1.5
bash ./RoboPoint/scripts/finetune_vicuna.sh

Evaluation:

We evaluated AHA on three test datasets:

AHA (test)
Maniskill FailGen data
REFLECT

Below are the instructions to generate or obtain each dataset:

⚙️ AHA (test): Generate this dataset using the same dataset generation script, but with different tasks.
```
bash ./aha/Data_Generation/rlbench-failgen/examples/ex_data_generator_eval.sh
```
📖 Maniskill FailGen: Follow the instructions here to generate the dataset.
🔍 REFLECT: Sub-sample the REFLECT dataset from this source and use our annotated JSON file for evaluation.

After evaluated your trained model with the respective datasets you can measure the ROGUE-L, LLM Fuzzy, or Binary Success results via these:

LLM Fuzzy

python aha/evaluation/eval_metrics/LLM_fuzzy.py --gt_path /path/to/real_qa.json --res_path /path/to/your_results.json

ROGUE-L

python aha/evaluation/eval_metrics/check_answer_ROGUE.py --data_path /path/to/out_qa.json --answers_path /path/to/aha_arnold_out_final_qa_failgen_answers.json --indx_num 11291

Binary Success

python aha/evaluation/eval_metrics/check_answer_Yes_No.py --data_path /path/to/out_qa.json --answers_path /path/to/aha_fr_out_final_qa_failgen_answers.json --indx_num 11291

🙏 Acknowledgments

We thank the following projects that parts of our code are derived from:

📝 Citation

@article{duan2024aha,
  title={AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation},
  author={Duan, Jiafei and Pumacay, Wilbert and Kumar, Nishanth and Wang, Yi Ru and Tian, Shulin and Yuan, Wentao and Krishna, Ranjay and Fox, Dieter and Mandlekar, Ajay and Guo, Yijie},
  journal={arXiv preprint arXiv:2410.00371},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
aha		aha
LICENSE		LICENSE
README.md		README.md
add_license_header.py		add_license_header.py
aha-teaser.gif		aha-teaser.gif
requirements-docs.txt		requirements-docs.txt
update.py		update.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 AHA: A Vision-Language-Model for Detecting and Reasoning over Failures in Robotic Manipulation

📖 Introduction

📑 Contents

🛠️ Data Generation

1. Environment Setup

2. PyRep and Coppelia Simulator

3. RLBench

4. FailGen

5. Generate failure trajectories with keyframes only:

🧠 Visual Instruction Finetuning

Download pretrained projector weights

Training

Evaluation:

LLM Fuzzy

ROGUE-L

Binary Success

🙏 Acknowledgments

📝 Citation

About

Releases

Packages

Languages

License

NVlabs/AHA

Folders and files

Latest commit

History

Repository files navigation

🤖 AHA: A Vision-Language-Model for Detecting and Reasoning over Failures in Robotic Manipulation

📖 Introduction

📑 Contents

🛠️ Data Generation

1. Environment Setup

2. PyRep and Coppelia Simulator

3. RLBench

4. FailGen

5. Generate failure trajectories with keyframes only:

🧠 Visual Instruction Finetuning

Download pretrained projector weights

Training

Evaluation:

LLM Fuzzy

ROGUE-L

Binary Success

🙏 Acknowledgments

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages