SL4 in more legible format

post by Richard_Ngo (ricraz) · 2021-08-12T06:36:03.371Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    10 Zack_M_Davis
    5 PeterMcCluskey
None
No comments

Does anyone have a copy of the SL4 archives in a format that's easier to read - e.g. a single file?

If not, I'd be happy to pay someone to put this together; let me know if interested.

Answers

answer by Zack_M_Davis · 2021-08-12T07:48:10.529Z · LW(p) · GW(p)

How much are you offering? I downloaded by the by-date archive page and wrote this Python code to download the individual message pages, with the thought of catting them together to get your "single file." (Which is an ugly hack and not even technically valid HTML, but it worked when I've done it in the past.) But it's only halfway through downloading and I need to turn the computer and its noisy fan off and go to sleep now; this should be easy to finish tomorrow.

import re
import requests

# http://sl4.org/archive/date.html
with open("sl4_archive_date.html", 'rb') as toc:
    archive_page = str(toc.read())
    results = re.findall(r"http://sl4.org/archive/\d+/\d+.html", archive_page)

for i, result in enumerate(results):
    message = requests.get(result)
    with open("messages/{:07}.html".format(i), 'w') as f:
        f.write(str(message.content))
comment by Zack_M_Davis · 2021-08-13T05:03:27.453Z · LW(p) · GW(p)

Ugh, maybe this wasn't such a good strategy. After downloading all the message pages and catting them in a loop, I did "successfully" end up with a 128 MiB HTML file ...

full_sl4_archive.html

But, first of all, it somehow ended up with a lot of escape characters (\n, \') displayed literally in the page. (Not sure how that happe—oh. The str(message.content) in my script should have been a message.content.decode().) Second of all, this enormous file is not actually smooth to read on my machine (it just crashes in Firefox, and Chromium is very laggy). On net, this may not actually be an improved experience over just clicking through the messages on sl4.org. (Third, Mediafire's pop-up ad when you click the "Download" button feels kind of scummy, but it was the first convenient way that came to mind for sharing a large file.)

answer by PeterMcCluskey · 2021-08-12T18:52:37.595Z · LW(p) · GW(p)

I suggest looking at hypetombox.pl, which should convert the archive into a Unix mbox file. You probably want to use wget first.

comment by localdeity · 2021-08-13T08:30:59.767Z · LW(p) · GW(p)

I'd specifically suggest using something like

wget --mirror -np http://sl4.org/archive/date.html

to download everything into a local directory structure.  After that you can experiment with how to format the resulting stuff.

No comments

Comments sorted by top scores.