Description: Evaluation data has been compromised! A workshop on detecting, preventing, and addressing data contamination.
Workshop@ ACL 2024
Evaluation data has been compromised! A workshop on detecting, preventing, and addressing data contamination.
Data contamination, where evaluation data is inadvertently included in pre-training corpora of large scale models, and language models (LMs) in particular, has become a concern in recent times ( Sainz et al. 2023 ; Jacovi et al. 2023 ). The growing scale of both models and data, coupled with massive web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training data of LMs ( Dodge et al., 2021 ; OpenAI, 2023 ; Google, 2023 ; Elazar et al., 2023 ). The scale of internet dat