Archiving link posts?
post by Said Achmiz (SaidAchmiz)
score: 56 (19 votes) ·
Link rot is a huge problem. At the same time, many posts on Less Wrong—including some of the most important posts, which talk about important concepts or otherwise advance our collective knowledge and understanding—are link posts, which means that a non-trivial chunk of our content is hosted elsewhere—across a myriad other websites.
If Less Wrong means to be a repository of the rationality community’s canon, we must take seriously the fact that (as gwern’s research indicates) many or most of those externally-hosted pages will, in a few years, no longer be accessible.
I’ve taken the liberty of putting together a quick-and-dirty solution. This is a page that, when loaded, scrapes the external links (i.e., the link-post targets) from the front page of GreaterWrong, and automatically submits them to archive.is (after checking each link to see whether it’s already been submitted). A cronjob that loads the page daily ensures that as new link-posts are posted, they will automatically be captured and submitted to archive.is.
This solution does not currently have any way to scrape and submit links older than those which are on the front page today (2018-09-08). It is also not especially elegant.
It may be advisable to implement automatic link-post archiving as a feature of Less Wrong itself. (Programmatically submitting URLs to archive.is is extremely simple. You send a POST request to
http://archive.is/submit/, with a single field,
url, with the URL as its value. The URL of the archived content will then—after some time, as archiving is not instantaneous—be accessible via
http://archive.is/timegate/[the complete original URL].)
Comments sorted by top scores.
comment by Vladimir_Nesov
· score: 11 (5 votes) · LW
This also applies to externally hosted images in regular posts, so ideally archiving shouldn't be restricted to just linkposts. That said, I tried archiving some posts on LW2 on archive.is before (example) and it's both ugly and with comments missing, probably too much delay with scripts or script-related restrictions.
comment by habryka (habryka4)
· score: 7 (4 votes) · LW
Yeah, I am sorry for LessWrong currently not playing well with some archival engines. If anyone has any advice on how to fix this, I would be glad to hear that (we already do Server-Side-Rendering, but it seems like some archive sites still get confused).
We are working on adding comments back to Server-Side-Rendering, at which point I expect them to show up in the archive site. There is currently a bug in Node.js that makes comments load super slowly for us, but as soon as that's fixed, I think we should be able to just show you at least 100 comments immediately with the rest of the post.
comment by habryka (habryka4)
· score: 8 (2 votes) · LW
This seems like a great idea. I wanted to do something in this space for a while, but didn't yet get around to figuring out how precisely to do it.
I expect that if we do this for all link posts in a batch, we will probably hit some kind of rate-limit, but I might still give it a try, and/or stagger them somehow. And then setting it up so that any new post gets archived this way, and has an immediate link to an archived version somewhere on the post-page should be quite straightforward.
Thanks a lot for the idea! I don't know how soon we will get around to this, but I do want to make it happen. I expect sometime in the new few months, given current priorities, but I can be convinced we should prioritize it more, if people think it's particularly important. Making this happen would probably also be a very simple open-source contribution, so I welcome anyone to submit a PR implementing this.
comment by Three-Monkey Mind
· score: 13 (4 votes) · LW
Drive-by suggestion: I'd suggest doing the archiving maybe a week or month after posting. That way, most updates to the post are archived, too.
comment by Said Achmiz (SaidAchmiz)
· score: 2 (1 votes) · LW
Actually, a feature I have on my to-do list to add to the ArchiveURLs recipe is automatic periodic (with a configurable period) re-archiving (which archive.is does support). That way, updates will get captured indefinitely.