codet-ovd.github.io - CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Description: Use co-occurrence to discover region-word pairs from image-text pairs for open-vocabulary object detection pre-training.

codet (2) co-occurrence (1) open-vocabulary detection (1)

Example domain paragraphs

Previous studies typically rely on region-text similarity to discover pseudo region-text pairs from image-text pairs. But a tricky thing is that, to get accurate similarity estimation, you need a region-level aligned vision-language sapce, which in turn requires abundant region-text pairs to train. To break the chicken-and-egg problem, we introduce co-occurrence based region-word alignment, which solely relies on region-region similarity to discover pseudo region-text pairs. A nice property of co-occurrence

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as

Links to codet-ovd.github.io (2)