02 CMU · MMML

Fine-tuning LLaVA for Web Agents

2024

Lede

Group project at CMU. We took LLaVA, fine-tuned it on VisualWebBench, and pushed the open-model score up.

Detail

MMML capstone at CMU. The starting point was a baseline LLaVA checkpoint that did poorly on web-grounded tasks. Most of the work was unglamorous: cleaning and augmenting the VisualWebBench training data, picking a LoRA configuration that fit on a single A100, and running enough ablations to know which screenshots actually moved the metric. The interesting finding was that visual cues mattered less than we expected and bounding-box hints mattered more.

Stack

LLaVA
LoRA
PyTorch
VisualWebBench