dataset-finder.netlify.app - Dataset finder

Example domain paragraphs

At TaskMaster.Info, Karl Craven is “obsessively documenting the international Taskmaster franchise,” which began as a British game show on which comedians compete to win challenges such as watermelon speed-eating and high-fiving strangers. Reddit user Alohamori has used the site and other sources to create a “ridiculously comprehensive” database of that information, enabling queries such as the fastest-completed tasks, tasks awarding zero points, and episodes ending in ties. Bonus link: Taskmaster’s officia

The website orcas.pt publishes monthly, downloadable maps indicating the date, time, and location of orca sightings and attacks off the coasts of Portugal and Spain. Run by Rui Alves as a personal project, the project gathers its data through a network of local sailors. Related: The Cruising Association, in collaboration with Grupo de Trabajo Orca Atlántica, publishes maps and detailed reports of orca interactions, including “uneventful passages.” [h/t Soph Warnes]

As part of its SafeDocs project, DARPA has compiled a corpus of “nearly 8 million PDFs gathered from across the web in July/August of 2021.” To create it, the authors began with the URLs of PDF files identified by Common Crawl (DIP 2021.04.21), fetched their complete contents, and recorded metadata about each file and where it was found. “At the time of its creation, this is the largest single corpus of real-world (extant) PDFs that is publicly available,” they write.