instructionaugmentation.github.io - DIAL

Example domain paragraphs

In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Such methods typically learn from corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models (VLMs) like CLIP or ViLD have been applied to robotics for learning representations and scene descriptors. Can these pretrai

DIAL consists of three stages: (1) finetuning a VLM’s vision and language representation on a small offline dataset of trajectories with crowdsourced episode-level natural language description, (2) generating alternative instructions for a larger offline dataset of trajectories with the VLM, and (3) learning a language-conditioned policy viabehavior-cloning on this augmented offline data.

After finetuning CLIP on the portion of the training dataset that contains crowd-sourced language instructions, we can automatically label the rest of the dataset without any additional human effort. In our setting, we finetune CLIP on crowd-sourced human labels for 2,800 demonstrations out of the total training dataseet of 80,000 teleoperated demonstrations.

Links to instructionaugmentation.github.io (5)