ds1000-code-gen.github.io - DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Description: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

data-science (1571) codex (122) code-generation (41) semantic-parsing (2) ds-1000 (1)

Example domain paragraphs

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) – across all Codex-002-predicted solutions that our evaluation accepts, only 1.8% of them is incorrect; we achieve this

DS-1000 contains 1000 problems originating from 451 unique StackOverflow problems. To defend against potential memoriza- tion, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites.

Below are more DS-1000 examples. For each example, The model needs to fill in the code into “[insert]” in the prompt on the left; the code will then be executed to pass the multi-criteria automatic evaluation, which includes the test cases and the surface form constraints; a reference solution is provided at the bottom left.

Links to ds1000-code-gen.github.io (9)