sharegpt4v.github.io - ShareGPT4V

Description: ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

gpt4-vision (1)

Example domain paragraphs

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates

Comparison of widely-used caption datasets and ShareGPT4V. 'LCS' abbreviates the LAION, CC, and SBU datasets. The 'Visible' column denotes the image visibility during captioning, and the 'Avg.' column shows the average character number of the caption.

We illustrate the procedure for collecting highly descriptive captions from GPT4-Vision via various image sources and data-specific prompts, resulting in 100K high-quality captions that encapsulate a wide array of information conveyed by the images. A comparison between the caption in our proposed ShareGPT4V dataset and those utilized by recent large multi-modal models (LMMs):

Links to sharegpt4v.github.io (5)