‘Internet archiving’ directory

Archiv­ing the Web, be­cause noth­ing lasts for­ever: sta­tis­tics, on­line archive ser­vices, ex­tract­ing URLs au­to­mat­i­cally from browsers, and cre­at­ing a dae­mon to reg­u­larly back up URLs to mul­ti­ple sources.

Links on the In­ter­net last for­ever or a year, whichever in­con­ve­niences you more. This is a major prob­lem for any­one se­ri­ous about writ­ing with good ref­er­ences, as link rot will crip­ple sev­eral per­cent of all links each year, and com­pound­ing.

To deal with link rot, I present my multi-pronged archival strat­egy using a com­bi­na­tion of scripts, dae­mons, and In­ter­net archival ser­vices: URLs are reg­u­larly dumped from both my web browser’s daily brows­ing and my web­site pages into an archival dae­mon I wrote, which pre-emptively down­loads copies lo­cally and at­tempts to archive them in the In­ter­net Archive. This en­sures a copy will be avail­able in­def­i­nitely from one of sev­eral sources. Link rot is then de­tected by reg­u­lar runs of linkchecker, and any newly dead links can be im­me­di­ately checked for al­ter­na­tive lo­ca­tions, or re­stored from one of the archive sources.

As an ad­di­tional flour­ish, my local archives are ef­fi­ciently cryp­to­graph­i­cally time­stamped using Bit­coin in case forgery is a con­cern, and I demon­strate a sim­ple com­pres­sion trick for sub­stan­tially re­duc­ing sizes of large web archives such as crawls (par­tic­u­larly use­ful for re­peated crawls such as my DNM archives).

