LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Toy Models of Feature Absorption in SAEs
chanind · 2024-10-07T09:56:53.609Z · comments (7)

Compelling Villains and Coherent Values
Cole Wyeth (Amyr) · 2024-10-06T19:53:47.891Z · comments (4)

Losing Faith In Contrarianism
omnizoid · 2024-04-25T20:53:34.842Z · comments (44)

[question] How would you navigate a severe financial emergency with no help or resources?
Tigerlily · 2024-05-02T18:27:51.329Z · answers+comments (22)

Medical Roundup #2
Zvi · 2024-04-09T13:40:05.908Z · comments (18)

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing
Connor Kissane (ckkissane) · 2024-10-27T18:46:21.316Z · comments (1)

A New Class of Glitch Tokens - BPE Subtoken Artifacts (BSA)
Lao Mein (derpherpize) · 2024-09-20T13:13:26.181Z · comments (7)

OODA your OODA Loop
Raemon · 2024-10-11T00:50:48.119Z · comments (3)

[link] A High Decoupling Failure
Maxwell Tabarrok (maxwell-tabarrok) · 2024-04-14T19:46:09.552Z · comments (5)

[link] An X-Ray is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
hugofry · 2024-10-07T08:53:14.658Z · comments (0)

The murderous shortcut: a toy model of instrumental convergence
Thomas Kwa (thomas-kwa) · 2024-10-02T06:48:06.787Z · comments (0)

Distinguish worst-case analysis from instrumental training-gaming
Olli Järviniemi (jarviniemi) · 2024-09-05T19:13:34.443Z · comments (0)

Thousands of malicious actors on the future of AI misuse
Zershaaneh Qureshi (zershaaneh-qureshi) · 2024-04-01T10:08:42.357Z · comments (0)

Striking Implications for Learning Theory, Interpretability — and Safety?
RogerDearnaley (roger-d-1) · 2024-01-05T08:46:58.915Z · comments (4)

[question] Is a random box of gas predictable after 20 seconds?
Thomas Kwa (thomas-kwa) · 2024-01-24T23:00:53.184Z · answers+comments (35)

The Defence production act and AI policy
[deleted] · 2024-03-01T14:26:09.064Z · comments (0)

Possible OpenAI's Q* breakthrough and DeepMind's AlphaGo-type systems plus LLMs
Burny · 2023-11-23T03:16:09.358Z · comments (25)

Otherness and control in the age of AGI
Joe Carlsmith (joekc) · 2024-01-02T18:15:54.168Z · comments (0)

Your LLM Judge may be biased
Henry Papadatos (henry) · 2024-03-29T16:39:22.534Z · comments (9)

Enhancing intelligence by banging your head on the wall
Bezzi · 2023-12-12T21:00:48.584Z · comments (26)

I’m confused about innate smell neuroanatomy
Steven Byrnes (steve2152) · 2023-11-28T20:49:13.042Z · comments (1)

AI #49: Bioweapon Testing Begins
Zvi · 2024-02-01T15:30:04.690Z · comments (11)

UDT1.01: The Story So Far (1/10)
Diffractor · 2024-03-27T23:22:35.170Z · comments (6)

What is wisdom?
TsviBT · 2023-11-14T02:13:49.681Z · comments (3)

Principles For Product Liability (With Application To AI)
johnswentworth · 2023-12-10T21:27:41.403Z · comments (55)

Deconfusing In-Context Learning
Arjun Panickssery (arjun-panickssery) · 2024-02-25T09:48:17.690Z · comments (1)

Interview with Vanessa Kosoy on the Value of Theoretical Research for AI
WillPetillo · 2023-12-04T22:58:40.005Z · comments (0)

Review Report of Davidson on Takeoff Speeds (2023)
Trent Kannegieter · 2023-12-22T18:48:55.983Z · comments (11)

[link] Dark Skies Book Review
PeterMcCluskey · 2023-12-29T18:28:59.352Z · comments (3)

[link] The Hippie Rabbit Hole -Nuggets of Gold in Rivers of Bullshit
Jonathan Moregård (JonathanMoregard) · 2024-01-05T18:27:01.769Z · comments (20)

[link] [Fiction] A Confession
Arjun Panickssery (arjun-panickssery) · 2024-04-18T16:28:48.194Z · comments (2)

[link] A Percentage Model of a Person
Sable · 2024-10-12T17:55:07.560Z · comments (3)

Exploring SAE features in LLMs with definition trees and token lists
mwatkins · 2024-10-04T22:15:28.108Z · comments (5)

Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
cmathw · 2024-04-08T11:14:43.268Z · comments (4)

AI #66: Oh to Be Less Online
Zvi · 2024-05-30T14:20:03.334Z · comments (6)

[link] I didn't have to avoid you; I was just insecure
Chipmonk · 2024-08-17T16:41:50.237Z · comments (7)

[link] WSJ: Inside Amazon’s Secret Operation to Gather Intel on Rivals
trevor (TrevorWiesinger) · 2024-04-23T21:33:08.049Z · comments (5)

On DeepMind’s Frontier Safety Framework
Zvi · 2024-06-18T13:30:21.154Z · comments (4)

Free Will and Dodging Anvils: AIXI Off-Policy
Cole Wyeth (Amyr) · 2024-08-29T22:42:24.485Z · comments (12)

[link] Twitter thread on AI takeover scenarios
Richard_Ngo (ricraz) · 2024-07-31T00:24:33.866Z · comments (0)

[question] Is there software to practice reading expressions?
lsusr · 2024-04-23T21:53:00.679Z · answers+comments (10)

Games for AI Control
charlie_griffin (cjgriffin) · 2024-07-11T18:40:50.607Z · comments (0)

Turning Your Back On Traffic
jefftk (jkaufman) · 2024-07-17T01:00:08.627Z · comments (7)

COT Scaling implies slower takeoff speeds
Logan Zoellner (logan-zoellner) · 2024-09-28T16:20:00.320Z · comments (56)

Glitch Token Catalog - (Almost) a Full Clear
Lao Mein (derpherpize) · 2024-09-21T12:22:16.403Z · comments (3)

[link] Turning 22 in the Pre-Apocalypse
testingthewaters · 2024-08-22T20:28:25.794Z · comments (14)

LASR Labs Spring 2025 applications are open!
Erin Robertson · 2024-10-04T13:44:20.524Z · comments (0)

[link] UC Berkeley course on LLMs and ML Safety
Dan H (dan-hendrycks) · 2024-07-09T15:40:00.920Z · comments (1)

[link] Shifting Headspaces - Transitional Beast-Mode
Jonathan Moregård (JonathanMoregard) · 2024-08-12T13:02:06.120Z · comments (9)

But Where do the Variables of my Causal Model come from?
Dalcy (Darcy) · 2024-08-09T22:07:57.395Z · comments (1)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

christian-z-r on D&D Sci Coliseum: Arena of Data

A thanks a lot. I was actually working through the earlier scenarios, I just missed that I new one had popped up. Subscribed now, then I will hopefully notice the next one.

Also, my approach didn't work this time, I ended up trying with a way too complicated model. I really like how the actual answer to this one worked.

avturchin on avturchin's Shortform

Lifehack: If you're attacked by a group of stray dogs, pretend to throw a stone at them. Each dog will think you're throwing the stone at it and will run away. This has worked for me twice.

cousin_it on The Alignment Trap: AI Safety as Path to Power

Yeah, this is my main risk scenario. But I think it's more correct to talk about imbalance of power, not concentration of power. Maybe there will be one AI dictator, or one human+AI dictator, or many AIs, or many human+AI companies; but anyway most humans will end up at the bottom of a huge power differential. If history teaches us anything, this is a very dangerous prospect.

It seems the only good path is aligning AI to the interests of most people, not just its creators. But there's no commercial or military incentive to do that, so I don't see it happening, at least not by default.

james-chua on Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence

author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!

To summarize the suggested experiment:

Train a model to be calibrated on whether it gets an answer correcct.
Modify the model (e.g. activation steering). This changes the model's performance on whether it gets an answer correct.
Check if the modified model is still well calibrated.

This could work and I'm excited about it.

One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instances -- just assume the model is dumb. An alternative is to make the model do better on some instances (by finetuning?), and check if the model is still calibrated on those too.

remmelt-ellen on Why Stop AI is barricading OpenAI

Noticing no response here after we addressed superficial critiques and moved to discussing the actual argument.

For those few interested in questions raised above, Forrest wrote some responses: http://69.27.64.19/ai_alignment_1/d_241016_recap_gen.html

The claims made will feel unfamiliar and the reasoning paths too. I suggest (again) taking the time to consider what is meant. If a conclusion looks intuitively wrong from some AI Safety perspective, it may be valuable to explicitly consider the argumentation and premises behind that.

tropicalfruit on Dating Roundup #1: This is Why You’re Single

Same. It would take incredible effort to find one person I reasonably connect with each year.

So much of this is just location. I've met 100s of people over the last few years. Nearly all either over 40 with kids, or those kids. I've connected with many, maybe 10%, on a pretty good level. That doesn't help with dating at all.

I just really, really don't want it to be the case that he only answer is: move to NY, SF, or Seattle, becuase I really like it here.

tailcalled on Three Notions of "Power"

However, though dominance is hard-coded, it seems like something of a simple evolved hack to avoid costly fights among relatively low-cognitive-capability agents; it does not seem like the sort of thing which more capable agents (like e.g. future AI, or even future more-intelligent humans) would rely on very heavily.

This seems exactly reversed to me. It seems to me that since dominance underlies defense, law, taxes and public expenditure, it will stay crucial even with more intelligent agents. Conversely, as intelligence becomes "too cheap to meter", "getting what you want" will become less bottlenecked on relevant insights, as those insights are always available.

green_leaf on Habryka's Shortform Feed

I use Google Chrome on Ubuntu Budgie and it does look to me like both the font and the font size changed.

saidachmiz on Habryka's Shortform Feed

Well, let’s see. Calibri is a humanist sans; Gill Sans is technically also humanist, but more more geometric in design. Geometric sans fonts tend to be less readable when used for body text.

Gill Sans has a lower x-height than Calibri. That (obviously) is the cause of all the “the new font looks smaller” comments.

(A side-by-side comparison of the fonts, for anyone curious, although note that this is Gill Sans MT Pro, not Gill Sans Nova, so the weight [i.e., stroke thickness] will be a bit different than the version that LW now uses.)

Now, as far as font rendering goes… I just looked at the site on my Windows box (adjusting the font stack CSS value to see Gill Sans Nova again, since I see you guys tweaked it to give Calibri priority)… yikes. Yeah, that’s not rendering well at all. Definitely more blurry than Calibri. Maybe something to do with the hinting, I don’t know. (Not really surprising, since Calibri was designed from the beginning to look good on Windows.) And I’ve got a hi-DPI monitor on my Windows machine…

Interestingly, the older version of Gill Sans (seen in the demo on my wiki, linked above) doesn’t have this problem; it renders crisply on Windows. (Note that this is not the flawed, broken-kerning version of the font that comes with Macs!)

I also notice that the comment font size is set to… 15.08px. Seems weird? Bumping it up to 16px improves things a bit, although it’s still not amazing.

If you can switch to the older (but not broken) version of Gill Sans, that’d be my recommendation.

If you can’t… then one option might be to check out one of the many similar fonts to see if perhaps one of them renders better on Windows while still having matching metrics.

habryka4 on Habryka's Shortform Feed

Sure, I was just responding to this literal quote:

Couldn't you please just set the comment font to the same as the post font?