Why are all these domains called from Less Wrong?

post by Viliam · 2020-06-27T13:46:05.857Z · LW · GW · 4 comments

This is a question post.

Contents

  Answers
    33 gbear605
    30 jimrandomh
None
4 comments

When I visit a Less Wrong page, the browser also attempts to load content from the following domains:

* algolia.net
* algolianet.com
* cloudflare.com
* cloudinary.com
* dl.drop
* dropbox.com
* dropboxusercontent.com
* google.com
* googleapis.com
* googletagmanager.com
* intercom.io
* jsdelivr.net
* lr-ingest.io
* typekit.net

Why is it so? I don't want to advertise to half of internet (and specifically to Google) the fact that I read Less Wrong. What happens if I simply block all these domains? What service do they provide if I don't block them?

Answers

answer by gbear605 · 2020-06-27T16:09:14.631Z · LW(p) · GW(p)

I'm not sure, but here are some guesses.

Algolia provides site search I believe, which seems reasonable. Cloudflare is generally used for DDOS protection, which is reasonable even if I personally think that Cloudflare is getting too monopolistic. Cloudinary is for image and video storage, probably for images in videos embedded in posts. I'm not sure about dl.drop, and I'm not sure why LW needs to use Dropbox. The Google connections are probably for analytics, although LW could and should definitely do that in-house. Intercom.io is for the messaging-with-developers that you can reach by clicking on the button in the bottom right corner of the screen. Jsdelivr.net is a caching service for Javascript, which helps you save internet bandwidth, which is reasonable. lr-ingest.io is apparently for analytics on user interaction with the site, which seems like a stretch for "does LW need it." Typekit.net (presumably) provides the fonts used, which is useful for caching although LW could also do it locally.

So to summarize, dl.drop, Dropbox, and some of the Google usage are for unknown reasons. Using Google analytics and lr-ingest.io make me uncomfortable personally. And Typekit and Jsdelivr provide marginal benefit at some cost, which aren't worth it in my opinion.

comment by habryka (habryka4) · 2020-06-27T17:43:36.338Z · LW(p) · GW(p)

Yep, this is basically right. 

We have recently experimented with LogRocket, but it's currently deactivated (though we might activate it again in the future). People should also feel free to block it, since the benefit for us is just from getting more data on how on-average users interact with the site. 

We don't use dropbox, and indeed it isn't loaded on any page I could find. People often use Dropbox to host images for LessWrong posts, so my guess is that where that request came from. The same goes for dl.drop and the dropboxusercontent URL. 

Google Analytics is just really useful. We are building up internal analytics infrastructure, but I think we are still quite a bit away from being able to shut down Google Analytics.

The Googleapis are likely for Google's ReCaptcha which we use to identify bots.

Overall, feel free to deactivate basically all of them, and nothing horrible should break. With the exception of typekit, algolia, jsdelivr (which would break LaTeX editing) and cloudflare. You can just deactivate Intercom in your user settings if you don't want it, but you can also just block requests to Intercom.com. 

comment by habryka (habryka4) · 2020-06-27T19:28:00.360Z · LW(p) · GW(p)

Typekit.net (presumably) provides the fonts used, which is useful for caching although LW could also do it locally.

Small comment on this: We have a Typekit subscription, which sadly does not actually allow us to download the fonts and serve them locally. We have to serve them directly from Adobe's servers. It's slightly annoying, but I don't think it's bad enough that I would want to stop using Typekit (which has overall been pretty decent and gives us access to a really wide range of fonts).

comment by habryka (habryka4) · 2020-06-27T19:31:03.930Z · LW(p) · GW(p)

Jsdelivr provide marginal benefit at some cost, which aren't worth it in my opinion.

The only place where we use jsdelivr is for serving MathJax, for which it is the canonical source that Mathjax links to in the documentation, which seems good because it allows people to cache Mathjax for multiple sites, so I think this is the best solution here. Seems worse for us to set up our own CDN, and worse for it to be served from the LessWrong server, since that makes our job harder.

answer by jimrandomh · 2020-06-27T20:43:25.113Z · LW(p) · GW(p)

LessWrong developer here. Here's an overview of what all those domains are. The code is open source, so you should be able to verify these, with some effort.

Algolia (algolia.net, algolianet.com) is a service we use for site search (what you get when you click the magnifying glass icon on the top-bar). They have a mirror of all searchable data (ie non-draft posts and comments, tag pages, user bios); they receive a copy of searches that are performed through the site search box, which they can associate with IP addresses but not with usernames.

Cloudflare is a CDN that is hosting components of MathJax, the Javascript library that renders LaTeX in posts and comments, and some libraries we use for integrating MathJax with the comment/post editors. The CDN URLs were defaults that came with libraries we're using; we could probably move them to our own domain with a little effort. JsDelivr is hosting some things that similarly came with library defaults, as parts of MathJax3 and Algolia.

Cloudinary is an image-hosting CDN that we use for images in some posts and images that are part of the site UI.

dropbox.com and dropboxusercontent.com are hosting images that were used in posts, presumably because they were visible in the Recent Discussion section when you loaded the front page. Currently, when users insert images into posts, depending how they do it and which editor they're using, it may point to the original domain of the image. Also, for authors we have set up automatic crossposting for, the crossposts will use the original image URLs. We will hopefully switch this to always upload those images to Cloudinary and host them there instead, partially for privacy reasons but mostly to prevent link rot in archives of old posts.

dl.drop is not a valid domain name; it's either a broken image link in some post that was in Recent Discussion, or a typo in this post.

The Google domains are from Google Analytics, Google Tag Manager, Google Fonts, and ReCaptcha. Google Analytics and Google Tag Manager measure site traffic and aggregate usage patterns.

intercom.io is for the chat icon in the bottom-right corner, used for messaging the admins about the site.

lr-ingest.io is LogRocket. We (the devs) use it to see how the site is being used; we can watch anonymized replays of sessions (anonymized in that the username in the corner is edited out). As policy, we don't read people's direct messages or unpublished drafts, or deanonymize votes, though in principle we have the capability to (both with this tool or with direct database access).

TypeKit, aka Adobe Fonts, is a font library and font hosting service. We could probably consolidate this with one of the other CDNs being used, but font-hosting involves some user-agent-string based compatibility polyfills, which would be somewhat annoying to reproduce ourselves.

comment by Raemon · 2020-06-27T20:46:03.482Z · LW(p) · GW(p)

Quick note for transparency, re: LogRocket – previously, we used another service called FullStory which did indeed edit out the username. We're currently trying out LogRocket to make sure it's basically worthwhile, and haven't yet implemented various anonymization practices, but plan to.

comment by habryka (habryka4) · 2020-06-27T21:47:31.374Z · LW(p) · GW(p)

TypeKit, aka Adobe Fonts, is a font library and font hosting service. We could probably consolidate this with one of the other CDNs being used, but font-hosting involves some user-agent-string based compatibility polyfills, which would be somewhat annoying to reproduce ourselves.

Small correction to this. As I mentioned below, we don't actually have a license to host the fonts we are serving ourselves. We could buy one, but it would probably run into at least hundred and possibly thousands of dollars per year, because fonts are expensive.

4 comments

Comments sorted by top scores.

comment by Rudi C (rudi-c) · 2020-06-27T14:58:06.293Z · LW(p) · GW(p)

What about greaterwrong.com?

Replies from: Viliam
comment by Viliam · 2020-06-27T22:37:07.206Z · LW(p) · GW(p)

It only calls google-analytics.com.

Replies from: SaidAchmiz
comment by Said Achmiz (SaidAchmiz) · 2020-06-27T23:42:38.860Z · LW(p) · GW(p)

Which you can block unproblematically; no site functionality depends on it. In fact, if you’ve got uBlock Origin, GA will be blocked automatically.

Replies from: Viliam
comment by Viliam · 2020-06-28T12:45:39.368Z · LW(p) · GW(p)

I use uMatrix (on Firefox), which blocks everything by default.