LessWrong 2.0 Reader
View: New · Old · Topnext page (older posts) →
next page (older posts) →
I think it's still valid to ask in the abstract whether RL is a particularly dangerous approach to training an AI system.
elityre on My hour of memoryless lucidityTo test whether Drake’s circumvention of his short-term memory loss worked via the intended mechanism, I could ask my girlfriend in advance to prompt me once — and only once — to complete the long-term memory scene that I had been practicing. Then I could see if I have a memory of the scene after I fully regain my memory.
You should have her decide (and write down) what to encode in advance, so that you can check later not just if you remembered something, but if you successfully encoded it in such a way that you communicate what you intend to communicate to yourself.
(Since Drake managed to send a memory, but was only guessing about what it was intended to mean.)
david-james on Announcing the LessWrong Curated PodcastI listened to part of “Processor clock speeds are not how fast AIs think”, but I was disappointed by the lack of a human narrator. I am not interested in machine readings; I would prefer to go read the article.
bhauth on introduction to cancer vaccinesThat's true; I misremembered that part when I wrote it. I'll just remove that.
nina-rimsky on Refusal in LLMs is mediated by a single directionThe direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.
ricki-heicklen on The Best Tacit Knowledge Videos on Every SubjectInDesign! Or anything on page layout / publishing / how to make a pretty and well formatted book
pat-myron on ChristianKl's Shortformshut your phone off
Leave phones elsewhere, remove batteries, or faraday cage them if you're concerned about state-level actors:
yonge on D&D.Sci Long War: Defender of Data-mocracyThere is some evidence that 2 artillery is sufficient to deal with 3 tyrants, but the amount of data is a bit small. I couldn't see any other change I could make which wouldn't lead to at least some measurable risk of loseing. Risking being eaten to impress my superiors feels like a poor trade off, especially as they should hopefully be at least somewhat impressed with winning a battle at 10:16 odds, so I will stick with my initial selection. (Though I'm pretty sure that someone who was willing to take some risk of being eaten for extra prestige would be well advised to take one fewer artillery.)
Compared to what?
If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals.
The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:
Goals selected from learned knowledge: an alternative to RL alignment [AF · GW]
raemon on Introducing AI Lab WatchI do think being one syllable is pretty valuable. Although AI Org watch might be fine (kinda rolls off the tongue worse)