internetarcheology.com - Internet Archaeology: Scraping time series data from Archive.org | sangaline.com

Description: A guide to scraping historical snapshots of webpages from the Archive.org Wayback Machine.

Example domain paragraphs

Skip to the Wayback Machine Scraper GitHub repo if you’re just looking for the completed command-line utility or the Scrapy middleware . The article focuses on how the middleware was developed and an interesting use case: looking at time series data from Reddit.

The Archive.org Wayback Machine is pretty awe inspiring. It’s been archiving web pages since 1996 and has amassed 284 billion page captures and over 15 petabytes of raw data. Many of these are sites that are no longer online and their content would have been otherwise lost to time. For sites that are still around, it can be absolutely fascinating to watch how they’ve evolved over the years.

Take Reddit for instance. You can go back in time and watch it grow from this…