Skip to main content

‘Internet archiving’ directory

Archiv­ing the Web, be­cause noth­ing lasts for­ever: sta­tis­tics, on­line archive ser­vices, ex­tract­ing URLs au­to­mat­i­cally from browsers, and cre­at­ing a dae­mon to reg­u­larly back up URLs to mul­ti­ple sources.

Links on the In­ter­net last for­ever or a year, whichever in­con­ve­niences you more. This is a major prob­lem for any­one se­ri­ous about writ­ing with good ref­er­ences, as link rot will crip­ple sev­eral per­cent of all links each year, and com­pound­ing.

To deal with link rot, I present my multi-pronged archival strat­egy using a com­bi­na­tion of scripts, dae­mons, and In­ter­net archival ser­vices: URLs are reg­u­larly dumped from both my web browser’s daily brows­ing and my web­site pages into an archival dae­mon I wrote, which pre-emptively down­loads copies lo­cally and at­tempts to archive them in the In­ter­net Archive. This en­sures a copy will be avail­able in­def­i­nitely from one of sev­eral sources. Link rot is then de­tected by reg­u­lar runs of linkchecker, and any newly dead links can be im­me­di­ately checked for al­ter­na­tive lo­ca­tions, or re­stored from one of the archive sources.

As an ad­di­tional flour­ish, my local archives are ef­fi­ciently cryp­to­graph­i­cally time­stamped using Bit­coin in case forgery is a con­cern, and I demon­strate a sim­ple com­pres­sion trick for sub­stan­tially re­duc­ing sizes of large web archives such as crawls (par­tic­u­larly use­ful for re­peated crawls such as my DNM archives).

See Also

Gwern

“Internet Search Case Studies ”, Gwern 2019

⁠Internet Search Case Studies

“Design Graveyard ”, Gwern 2010

⁠Design Graveyard

“Research Bounties On Fulltexts ”, Gwern 2018

⁠Research Bounties On Fulltexts

“Internet Search Tips ”, Gwern 2018

Internet Search Tips

“Design Of This Website ”, Gwern 2010

Design Of This Website

“Archiving URLs ”, Gwern 2011

Archiving URLs

“The sort –key Trick ”, Gwern 2014

The sort –key Trick

“Darknet Market Archives (2013–2015) ”, Gwern 2013

Darknet Market Archives (2013–2015)

“Predicting Google Closures ”, Gwern 2013

Predicting Google closures

“Easy Cryptographic Timestamping of Files ”, Gwern 2015

Easy Cryptographic Timestamping of Files

“Writing a Wikipedia Link Archive Bot ”, Gwern 2008

Writing a Wikipedia Link Archive Bot

“Archiving GitHub ”, Gwern 2011

Archiving GitHub

“Writing a Wikipedia RSS Link Archive Bot ”, Gwern 2009

Writing a Wikipedia RSS Link Archive Bot

“Resilient Haskell Software ”, Gwern 2008

⁠Resilient Haskell Software

Miscellaneous