‘Internet archiving’ directory
Archiving the Web, because nothing lasts forever: statistics, online archive services, extracting URLs automatically from browsers, and creating a daemon to regularly back up URLs to multiple sources.
Links on the Internet last forever or a year, whichever inconveniences you more. This is a major problem for anyone serious about writing with good references, as link rot will cripple several percent of all links each year, and compounding.
To deal with link rot, I present my multi-
pronged archival strategy using a combination of scripts, daemons, and Internet archival services: URLs are regularly dumped from both my web browser’s daily browsing and my website pages into an archival daemon I wrote, which pre- emptively downloads copies locally and attempts to archive them in the Internet Archive. This ensures a copy will be available indefinitely from one of several sources. Link rot is then detected by regular runs of linkchecker
, and any newly dead links can be immediately checked for alternative locations, or restored from one of the archive sources.As an additional flourish, my local archives are efficiently cryptographically timestamped using Bitcoin in case forgery is a concern, and I demonstrate a simple compression trick for substantially reducing sizes of large web archives such as crawls (particularly useful for repeated crawls such as my DNM archives).
See Also
-
Gwern
- “Internet Search Case Studies ”, Gwern 2019
- “Design Graveyard ”, Gwern 2010
- “Research Bounties On Fulltexts ”, Gwern 2018
- “Internet Search Tips ”, Gwern 2018
- “Design Of This Website ”, Gwern 2010
- “Archiving URLs ”, Gwern 2011
-
“The
sort –key
Trick ”, Gwern 2014 - “Darknet Market Archives (2013–2015) ”, Gwern 2013
- “Predicting Google Closures ”, Gwern 2013
- “Easy Cryptographic Timestamping of Files ”, Gwern 2015
- “Writing a Wikipedia Link Archive Bot ”, Gwern 2008
- “Archiving GitHub ”, Gwern 2011
- “Writing a Wikipedia RSS Link Archive Bot ”, Gwern 2009
- “Resilient Haskell Software ”, Gwern 2008
-
Links
- “Visualizing All Books of the World in ISBN-Space ”
- “How Do Archivists Package Things? The Battle of the Boxes ”
- “HUGE Google Search Document Leak Reveals Inner Workings of Ranking Algorithm: The Documents Reveal How Google Search Is Using, or Has Used, Clicks, Links, Content, Entities, Chrome Data and More for Ranking. ”, Goodwin 2024
- “Insights from a Laboratory Fire ”, Jones et al 2023
- “Introducing A Dark Web Archival Framework ”, Brunelle et al 2021
- “Gscan2pdf: A GUI to Produce PDFs from Scanned Documents ”, Ratcliffe 2019
- “When Nothing Ever Goes Out of Print: Maintaining Backlist Ebooks ”, Elsey 2016
- “Memory and the Construction of Scientific Meaning: Michael Faraday’s Use of Notebooks and Records ”, Tweney & Ayala 2015
- “Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot ”, Klein et al 2014
- “Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations ”, Zittrain & Albert 2013
- “The Prevalence and Inaccessibility of Internet References in the Biomedical Literature at the Time of Publication ”, Aronsky et al 2007
- “More Product, Less Process: Revamping Traditional Archival Processing ”, Greene & Meissner 2005
- “How Large Is the World Wide Web? ”, Dobra & Fienberg 2004
- “The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines ”, Bradlow & Schmittlein 2000
- Unforgotten Dreams: Poems by the Zen Monk Shōtetsu, Shōtetsu & Carter 1997
- “Space Jam Homepage ”
- “Faraday’s Notebooks: the Active Organization of Creative Science ”, Tweney 1991
- “The Other Pínakes and Reference Works of Callimachus ”, Witty 1973
- “The Pínakes of Callimachus ”, Witty 1958
- “How Archives Can Make—Or Break—A Philosopher’s Reputation ”
- “The Backrooms of the Internet Archive ”
- “The Original WWW Proposal Is a Word for Macintosh 4.0 File from 1990, Can We Open It? ”
- “The Old Family Photos Project: Lessons in Creating Family Photos That People Want to Keep ”, Schindler 2025
- “SingleFile ”, Lormeau 2025
- “Century-Scale Storage: If You Had to Store Something for 100 Years, How Would You Do It? ”, Neely-Cohen 2025
- “A Lunar Library ”
- “2024 Guide on Removing DRM from Kobo & Kindle Ebooks ”
- “Internet Archive Hacked, Data Breach Impacts 31 Million Users ”
- “The Forgotten Pixel Art Masterpieces of the PlayStation 1 Era by Richmond Lee ”
- “Policymakers Don’t Have Access to Paywalled Articles ”
- “To Preserve Their Work—And Drafts of History—Journalists Take Archiving into Their Own Hands ”
- Sort By Magic
- Wikipedia
- Miscellaneous
- Bibliography
Gwern
“Internet Search Case Studies ”, Gwern 2019
“Design Graveyard ”, Gwern 2010
“Research Bounties On Fulltexts ”, Gwern 2018
“Internet Search Tips ”, Gwern 2018
“Design Of This Website ”, Gwern 2010
“Archiving URLs ”, Gwern 2011
“The sort –key
Trick ”, Gwern 2014
“Darknet Market Archives (2013–2015) ”, Gwern 2013
“Predicting Google Closures ”, Gwern 2013
“Easy Cryptographic Timestamping of Files ”, Gwern 2015
“Writing a Wikipedia Link Archive Bot ”, Gwern 2008
“Archiving GitHub ”, Gwern 2011
“Writing a Wikipedia RSS Link Archive Bot ”, Gwern 2009
“Resilient Haskell Software ”, Gwern 2008
Links
“Visualizing All Books of the World in ISBN-Space ”
Visualizing all books of the world in ISBN-Space
“How Do Archivists Package Things? The Battle of the Boxes ”
How do archivists package things? The battle of the boxes
“HUGE Google Search Document Leak Reveals Inner Workings of Ranking Algorithm: The Documents Reveal How Google Search Is Using, or Has Used, Clicks, Links, Content, Entities, Chrome Data and More for Ranking. ”, Goodwin 2024
“Insights from a Laboratory Fire ”, Jones et al 2023
“Introducing A Dark Web Archival Framework ”, Brunelle et al 2021
“Gscan2pdf: A GUI to Produce PDFs from Scanned Documents ”, Ratcliffe 2019
“When Nothing Ever Goes Out of Print: Maintaining Backlist Ebooks ”, Elsey 2016
When Nothing Ever Goes Out of Print: Maintaining Backlist Ebooks
“Memory and the Construction of Scientific Meaning: Michael Faraday’s Use of Notebooks and Records ”, Tweney & Ayala 2015
Memory and the construction of scientific meaning: Michael Faraday’s use of notebooks and records
“Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot ”, Klein et al 2014
Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot
“Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations ”, Zittrain & Albert 2013
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations
“The Prevalence and Inaccessibility of Internet References in the Biomedical Literature at the Time of Publication ”, Aronsky et al 2007
“More Product, Less Process: Revamping Traditional Archival Processing ”, Greene & Meissner 2005
More Product, Less Process: Revamping Traditional Archival Processing
“How Large Is the World Wide Web? ”, Dobra & Fienberg 2004
“The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines ”, Bradlow & Schmittlein 2000
The Little Engines That Could: Modeling the Performance of World Wide Web Search Engines
Unforgotten Dreams: Poems by the Zen Monk Shōtetsu, Shōtetsu & Carter 1997
“Space Jam Homepage ”
“Faraday’s Notebooks: the Active Organization of Creative Science ”, Tweney 1991
Faraday’s notebooks: the active organization of creative science
“The Other Pínakes and Reference Works of Callimachus ”, Witty 1973
“The Pínakes of Callimachus ”, Witty 1958
“How Archives Can Make—Or Break—A Philosopher’s Reputation ”
How archives can make—or break—a philosopher’s reputation
“The Backrooms of the Internet Archive ”
The Backrooms of the Internet Archive
“The Original WWW Proposal Is a Word for Macintosh 4.0 File from 1990, Can We Open It? ”
The original WWW proposal is a Word for Macintosh 4.0 file from 1990, can we open it?
“The Old Family Photos Project: Lessons in Creating Family Photos That People Want to Keep ”, Schindler 2025
The Old Family Photos Project: Lessons in creating family photos that people want to keep
“SingleFile ”, Lormeau 2025
“Century-Scale Storage: If You Had to Store Something for 100 Years, How Would You Do It? ”, Neely-Cohen 2025
Century-Scale Storage: If you had to store something for 100 years, how would you do it?
“A Lunar Library ”
“2024 Guide on Removing DRM from Kobo & Kindle Ebooks ”
2024 Guide on removing DRM from Kobo & Kindle ebooks
“Internet Archive Hacked, Data Breach Impacts 31 Million Users ”
Internet Archive hacked, data breach impacts 31 million users
“The Forgotten Pixel Art Masterpieces of the PlayStation 1 Era by Richmond Lee ”
The Forgotten Pixel Art Masterpieces of the PlayStation 1 Era by Richmond Lee
“Policymakers Don’t Have Access to Paywalled Articles ”
Policymakers don’t have access to paywalled articles
“To Preserve Their Work—And Drafts of History—Journalists Take Archiving into Their Own Hands ”
To preserve their work—and drafts of history—journalists take archiving into their own hands
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
Wikipedia
Miscellaneous
/doc/
cs/ linkrot/ archiving/ 2020-03-03-meganwarnock-picardfacepalmcartoon.jpg : /doc/
cs/ linkrot/ archiving/ 2019-gwern-internetarchive-domainsearch-screenshot.png : /doc/
cs/ linkrot/ archiving/ 2011-muflax-backup.pdf : /doc/
cs/ linkrot/ archiving/ 2006-alperin-webcitetechnicalguide.pdf : /doc/
cs/ linkrot/ archiving/ gwern-googlescholar-search-highlightfulltextlink-thumbnail.jpg : /doc/
cs/ linkrot/ archiving/ gildaslormeau-singlefile-archivingtutorialanimation.mp4 : https://
annamancini.substack.com/ p/ how-the-apple-archive-ended-up-at https://
blog.gingerbeardman.com/ 2023/ 05/ 24/ ordering-photocopies-from-japans-national-library/ https://
file770.com/ judge-decides-against-internet-archive/ https://
github.com/ Kneesnap/ onstream-data-recovery/ blob/ main/ info/ INTRO.MD https://
michaelnielsen.org/ ddi/ how-to-crawl-a-quarter-billion-webpages-in-40-hours/ : https://
placesjournal.org/ article/ the-filing-cabinet-and-20th-century-information-infrastructure/ https://
www.atlasobscura.com/ articles/ bbc-missing-horror-show https://
www.historytoday.com/ archive/ missing-pieces/ lost-movies https://
www.johndcook.com/ blog/ 2024/ 03/ 03/ archiving-data-on-paper/ :
Bibliography
https://
: “SingleFile ”,github.com/ gildas-lormeau/ SingleFile/