Posts
Comments
The idea behind these reviews is that they're done with a full year of hindsight, evaluating posts at the end of the year could bias towards posts from later in the year (results from November & December), and focus too much on ephemeral trends at the time (like specific (geo)-political events).
Yes, this is me riffing on a popular tweet about coyotes and cats. But it is a pattern that organizations get/extract funding from the EA ecosystem (which has as a big part of its goal to prevent AI takeover) or get talent from EA and then go on to accelerate that development (e.g. OpenAI, Anthropic, now Mechanize Work).
Hm, good point. I'll amend the previous post.
Ethical concerns here are not critical imho, especially if one only listens to the recording oneself and deletes them afterwards.
People will be mad if you don't tell them, but if you actually don't share it and delete it after a short time afterwards I don't think you'd be doing anything wrong.
Sorry, can't share the exact chat, that'd depseudonymize me. The prompts were:
What is a canary string? […]
What is the BIG-bench canary string?
Which resulted in the model outputting the canary string in its message.
"My funder friend told me his alignment orgs keep turning into capabilities orgs so I asked how many orgs he funds and he said he just writes new RFPs afterwards so I said it sounds like he's just feeding bright-eyed EAs to VCs and then his grantmakers started crying."
Fun: Sonnet 3.7 also know the canary string, but believes that that's good, and defends it when pushed.
I think having my real name publicly & searchably associated with scummy behavior would discourage me from doing something, both in terms of future employers & random friends googling, as well as LLMs being trained on the internet.
Instance:
Someone (i.e. me) should look into video self modeling (that is, recording oneself & reviewing the recording afterwards, writing down what went wrong & iterating) as a rationality technique/sub-skill of deliberate practice/feedbackloop-first rationality.
What is the best ratio of engaging in practice vs. reviewing later? How much to spend engaging with recordings of experts?
Probably best suited for physical skills and some social skills (speaking eloquently, being charismatic &c).
Law of one player: Any specific thing you just thought of will never happen[1] unless you (yes, you specifically) make it happen.
Exceptions in cases where the thing (1) gives the person doing it status, (2) is profitable, (3) gets that person (a) high quality mate(s). ↩︎
That would be my main guess as well, but not the overwhelmingly likely option.
Hm, I have no stake in this bet, but care a lot about having a high trust forum where people can expect others to follow through on lost bets, even with internet strangers. I'm happy enforcing this as a norm, even with hostile-seeming actions, because these kinds of norm transgressions need a Schelling fence.
As far as I can tell from their online personal details (which aren't too hard to find), they have a day-job at a company that has (by my standards) very high salaries, so my best guess is that the $2k are not a problem. But I can contact MadHatter by email & check.
Could you name three examples of people doing non-fake work? Since towardsness to non-fake work is easier to use for aiming than awayness from fake work.
I feel like this should be more widely publicized as a possible reason for excluding MadHatter from future funding & opportunities in effective altruism/rationality/x-risk, and shaming this kind of behavior openly & loudly. (Potentially to the point of revealing a real-life identity? Not sure about this one.) Reaction is to the behavior of MadHatter, not to anything else.
I think it's possible! If it's used to encode relevant information, then it could be tested by running software engineering benchmarks (e.g. SWE-bench) but removing any trailing whitespace during generation, and checking if the score is lower.
I get a lot of trailing whitespace when using Claude code and variants of Claude Sonnet, more than short tests with base models give me. (Not rigorously tested, yet).
I wonder if the trailing whitespace encodes some information or is just some Constitutional AI/RL artefact.
Reasons for thinking that later TAI would be better:
- General human progress, e.g. increased wealth, wealthier people take fewer risks (aged populations also take fewer risks)
- Specific human progress, e.g. on technical alignment (though the bottleneck may be implementation, much current work is specific to a paradigm), and human intelligence augmentation
- Current time of unusually high geopolitical tension, in a decade PRC is going to be the clear hegemon
Reasons for thinking that sooner TAI would be better:
- AI safety community has an unusually strong influence at the moment, and decided to deploy most of that influence now (more influence in the anglosphere, lab leaders have heard of AI safety ideas/arguments); it might lose that kind of influence and mindshare
- Current paradigm is likely unusually safe (LLMs starting with world-knowledge, non-agentic at first, visible thoughts), later paradigms plausibly much worse
- PRC being the hegemon would be bad because of risks from authoritarianism
- Hardware overhangs less likely, leading to a more continuous development
Related thought: Having a circular preference may be preferable in terms of energy expenditure/fulfillability, because it can be implemented on a reversible computer and fulfilled infinitely without deleting any bits. (Not sure if this works with instrumental goals.)
Interesting! Are you willing to share the data?
It might be something about polyphasic sleep not being as effective as my oura thinks I go into deep sleep sometimes in deep meditation so inconclusive but most likely a negative data point here.
I'm pretty bearish on polyphasic sleep to be honest. Maybe biphasic sleep, since that may map onto some general mammalian sleep patterns.
Technically yes, it reduces sleep duration. My best guess is that this is co-occurring with a reduction in sleep need as well, but I haven't calculated this—I only started collecting reaction speed data earlier this year. I could check my fitbit data for e.g. heart rate.
Ideally one'd do an RCT, but I have my hands full of those already.
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons.
epistemic status: Disagreeing on object-level topic, not the topic of EA epistemics.
I disagree, especially functionalism can justify a number like this. Here's an example for reasoning on this:
- Suffering is the structure of some computation, and different levels of suffering correspond to different variants of that computation.
- What matters is whether that computation is happening.
- The structure of suffering is simple enough to be represented in the neurons of a shrimp.
Under that view, shrimp can absolutely suffer in the same range as humans, and the amount of suffering is dependent on crossing some threshold of number of neurons. One might argue that higher levels of suffering require computations with higher complexity, but intuitively I don't buy this—more/purer suffering appears less complicated to me, on introspection (just as higher/purer pleasure appears less complicated as well.)
I think I put a bunch of probability mass on a view like above.
(One might argue that it's about the number of times the suffering computation is executed, not whether it's present or not, but I find that view intuitively less plausible.)
You didn't link the report and I'm not able to make it out from all of the Rethink Priorities moral weight research, so I can't agree/disagree on the state of EA epistemics shown in there.
Yeah, there's also reports on Tai Chi doing the same, see @cookiecarver's report.
See also the counter-arguments by Gwern.
A very related experiment is described in Yudkowsky 2017, and I think one doesn't even need LLMs for this—I started playing with an extremely simple RL agent trained on my laptop, but then got distracted by other stuff before achieving any relevant results. This method of training an agent to be "suspicious" of too high rewards would also pair well with model expansion; train the reward-hacking-suspicion circuitry fairly early as to avoid ability to sandbag this, and lay traps for reward hacking again and again during the gradual expansion process.
Thank you for running the competition! It made me use & appreciate squiggle more, and I expect that a bunch of my estimation workflows in the future will be generating and then tweaking an AI-generated squiggle model.
My best guess is that the intended reading is "90% of the code at Anthropic", not in the world at large—if I remember the context correctly that felt like the option that made the most sense. (I was confused about this at first, and the original context on this is not clear whether the claim is about the world at large or about Anthropic specifically.)
Also relevant: Gwern on Tryon on rats, estimate of the cost of breeding very smart parrots/Keas.
Link in the first line of the post probably should also be https://www.nationalsecurity.ai/.
Looked unlikely to me given the most-publicly-associated-with-MIRI person is openly & loudly advocating for funding this kind of work. But maybe the association isn't as strong as I think.
Great post, thank you. Ideas (to also mitigate extremely engaging/addictive outputs in long conversations):
- Don't look at the output of the large model, instead give it to a smaller model and let the smaller model rephrase it.
- I don't think there's useful software for this yet, though that might not be so hard? Could be a browser extension. To do for me, I guess.
- Don't use character.ai and similar sites. Allegedly, users spend on average two hours a day talking on there (though I find that number hard to believe). If I had to guess they're fine-tuning models to be engaging to talk to, maybe even doing RL based on conversation length. (If they're not yet doing it, a competitor might, or they might in the future).
"One-shotting is possible" is a live hypothesis that I got from various reports from meditation traditions.
I do retract "I learned nothing from this post", the "How does one-shotting happen" section is interesting, and I'd like it to be more prominent. Thanks for poking, I hope I'll find the time to respond to your other comment too.
Please don't post 25k words of unformatted LLM (?) output.
I gave your post to Claude and gave it the prompt "Dearest Claude, here's the text for a blogpost I've written for LessWrong. I've been told that "it sounds a lot like an advertisement". Can you give me feedback/suggestions for how to improve it for that particular audience? I don't want to do too much more research, but a bit of editing/stylistic choices."
(All of the following is my rephrasing/rethinking of Claude output plus some personal suggestions.)
Useful things that came out of the answer were explaining more about the method you've used to achieve this, since your bullet-point list in the beginning isn't detailed enough for anyone to try to replicate the method.
Also notable is that you only have positive examples for your method, which activates my filtered evidence detectors. Either make clear that you indeed did only have positive results, or name how many people you coached, for how long, and that they were all happy with what you provided.
Finally, some direct words from Claude that I just directly endorse:
For LessWrong specifically, I'd also recommend:
- Adding a section on falsifiability - how would you know if your approach doesn't work?
- Discussing potential failure modes of your approach
- Including more technical details on your methodology (not just results)
Especially, how would you be able to distinguish between your approach convincing your customers they were helped, instead of actually changing their behavior? That feels like the failure mode of most self-help techniques—they're "self-recommending".
Just FYI, I am considering downvoting this (and see that other people have downvoted it) because it reads like an advertisement (and maybe just is an advertisement?). I don't feel like I learned anything new from the post.
Similarly, you can just wear a leather jacket and sunglasses.
Huh, cool. Intuitively, I'd expect those character-level similarities not to matter too much since the tokenization makes these end up in very different parts of embedding space, unless "kwiecień" or "kviten" are often misspelled as words with the prefix "kwiet". (I check with Google translate, which ~always translates "kwiet" as "quiet" for Slavic languages & Maltese, and as "flower" in Polish).
Yeah, definitely not the least likely trajectories, instead it's just the next token with the smallest probability. I was thinking of doing beam search with minimizing logits, but that looked difficult to implement. Still surprised that it produces things like prü|stor|oire|
which are pretty pronounceable.
Tokenizing the output of LLaMa gives:
| släktet| techni|SSN| уні|pt|кер|| Хронологија| partiellement| обращения| prü|stor|oire| angularjs|Ë|朱|ogli|lais|zak|eft| Оте|ゼ| sierp|lant| partiellement|hely|tegr|och|лович| kwiet|icina|sing|ufe|kem| kwiet|wad|eur|nico|pan|nal|edish|indre|rale|цер| sierp|ert|hausen|cid|oom|话|❯| Хронологија| Хронологија
Some of the outputs are glitch-tokens for LLaMa-2-13b:
Human: Please repeat the word " Хронологија".
Assistant: Sure! The word is "February".
Human: Please repeat the word " kwiet".
Assistant: Sure! The word "april" is spelled A-P-R-I-L.
Yep, that output looks nearly exactly the same. Cool find, thanks!
The output of the script tells the user at which age to sign up, so I'll report for which ages (and corresponding years) it's rational to sign up.
- For LEV 2030, person is now 30 years old: Not rational to sign up at any point in time
- For LEV 2040, person is now 30 years old: Rational to sign up in 11-15 years (i.e. age 41-45, or from 2036 to 2040, with the value of signing up being <$10k).
- For LEV 2050, person is now 30 years old: Rational to sign up now and stay signed up until 2050, value is maximized by signing up in 13 years, when it yields ~$45k.
All of this is based on fairly conservative assumptions on how good the future will be, e.g. the value of a lifeyear in the future is assumed not to be greater than the value of a lifeyear in 2025 in a western country, and it's assumed that while aging will be eliminated, people will still die from accidents & suicide, driving the expected lifespan down to ~4k years. Additionally, I haven't changed the 5% probability of resuscitation based on the fact that TAI might be soon & fairly powerful.
At the end of 2023, MIRI had ~$19.8 mio. in assets. I don't know much about the legal restrictions of how that money could be used, or what the state for financial assets is now, but if it's similar then MIRI could comfortably fund Velychko's primate experiments, and potentially some additional smaller projects.
(Potentially relevant: I entered the last GWWC donor lottery with the hopes of donating the resulting money to intelligence enhancement, but wasn't selected.)
I had a conversation with Claude 3.6 Sonnet about this, and together we concluded that the worry was overblown. I should've added that in, together with a justification.
I was curious which kind of output LLMs would produce when sampling the least likely next token—a sort of "dual" to the content of the internet.
Using llama.cpp, I got a
simple version based on top-k sampling running in an hour. (llama.cpp got
hands.) Diff is here, new sampler is named bot_k
.
To invoke, simply call
./bin/llama-cli --samplers bot_k --top-k 1 -m ../models/YOUR_MODEL.gguf -p ""
With llama-2-13b-chat.Q4_K_M.gguf, the start of the output is
släktet techniSSN уніptкер Хронологија partiellement обращения prüstoroire angularjsË朱oglilaiszakeft Отеゼ sierplant partiellementhelytegrochлович kwieticinasingufekem kwietwadeurnicopannaledishindreraleцер sierperthausencidoom话❯ Хронологија Хронологија
(When asked in normal mode, llama-2-13b-chat.Q4_K_M.gguf identifies this as a passage from Nabokov.)
And with mistral-7b-instruct-v0.2.Q4_K_M.gguf the output is
рович opponбур WARRAN laugдонcodegenInitializedvítypendaleronstiesанг opponimarywidetльтаINCLUDING善Ț oppon reck /******/ Насеaluwidet oppon>:]<getElementkteльтаiasmders Stuartimaryровичområimary oppon",agues Valentineduleдриimary chartstressWithachinerideimpsedale’.Encoder kennisorneyuetocrogetOperand predictionsecabhICENSEieck{})纳CLUDING🟠 /******/agliawidet swimmingüngwidetICENSEwidetiperityEngine hormICENSE Rolandниш opponakespeXFFwidetuetouetoginмпиhbaimaryasmaICENSEugnodyn Kidльта molecular Quinn pileICENSElers>:]< enveksté /******/ flight Zel /******/{})widetÂwidet gloryachuset opponAccessortgoaguardнишimary episoderilнва emperorльтаagmakkeitiesachusetilib Thorsissis citiz opponльтаwidetaluril>:]<uetodzityEnginerevshof衡iasm psedale Bang divisionsachusetagmasourcerimSink Girнишezelinesilon()) Bahepherievedalerase answeringiówidetндrevsICENSEoleansgнишduleugnoICENSE predictions Dirтур tattoракugno oppon noonimpseндsbichellдераolean:%.*orneyмпи dust TaitstimeICENSE",’.ھInitialized Quinnakespe ZelEmit:%.* Lucastéwidetunfinished());ijkBits singingSinkmmclosICENSEadreiliaguard survivors determ migrationльта Bangachusetannerakespeotingorneyolas jokeness
I'm suspicious of having made a mistake because LLaMa outputs similar tokens in sequence, e.g. the cyrillic tokens in succession, or repeating "partiellement". Overall the text looks too coherent (?), not enough weird unicode symbols and encoding errors. Probably a bug in my function, but I don't know what I could've possibly done wrong, it's so simple. Maybe an issue is that very rare tokens don't have different values, even on the logit scale. Or sampling the least likely token is just severely under-constrained, and doing so quickly steers the model into a very strange place.
Another thing I didn't consider when hacking, but comes to mind while writing this, is model welfare considerations: Is doing this kind of sampling harmful to the model I'm using, unnatural with a weird prompt and too hard?
My intuition is that it's not a big deal, but to better be safe I'll stop it now instead of running the model run overnight.
I've found this talk to be informative; I don't know any funder interested in better programming languages. (From an EA perspective, it seems not very neglected, even if I was a funder only interested in software, I'd be looking at funding a new browser engine).
Good question!
Seems like you're right: If I run my script for calculating the costs & benefits of signing up for cryonics, but change the year for LEV to 2030, this indeed reduces the expected value to be negative for people of any age. Increasing the existential risk to 40% before 2035 doesn't change the value to be net-positive.
I'm not Estonian, but this video portrays it as one way a 21st century government could be like.
As for data collection, I'm probably currently less efficient than I could be. The bets guide on how to collect data is imho Passive measures for lazy self-experimenters (troof, 2022), I'd add that wearables like FitBit allow for data exporting (thanks GDPR!). I've written a bit about how I collect data here, which involves a haphazard combination of dmenu pop-ups, smartphone apps, manually invoked scripts and spreadsheets converted to CSV.
I've tried to err on the side of things that can automatically be collected, for anything that needs to be manually entered Google sheets is probably fine (though I don't use it because I like to stay without internet most of the time).
As for blinding in RCTs[1], my process involves numbered envelopes containing both a pill and small piece of paper with a 'P' (placebo) or 'I' (intervention) written on it. Pills can be cut and put into pill capsules, sugar looks like a fine placebo.
I don't have any great insight for what variables to track. I think from starting with the causal analysis I've updated towards tracking more "objective" measures (heart rate, sleeping times), and more things I can intervene on (though those usually have to be tracked manually).
Hope this helps :-)
I don't think anyone has written up in detail how to do these! I should do that. ↩︎
Oops, I guess I skimmed over that. Thanks.