Posts
Comments
Just checked who from the authors of the Weak-To-Strong Generalization paper is still at OpenAI:
- Collin Burns
- Jan Hendrick Kirchner
- Leo Gao
- Bowen Baker
- Yining Chen
- Adrian Ecoffet
- Manas Joglekar
- Jeff Wu
Gone are:
- Ilya Sutskever
- Pavel Izmailov[1]
- Jan Leike
- Leopold Aschenbrenner
Reason unknown ↩︎
The image is not showing.
This is a good question, especially since there've been some short form posts recently that are high quality and would've made good top-level posts—after all, posts can be short.
Back then I didn't try to get the hostel to sign the metaphorical assurance contract with me, maybe that'd work. A good dominant assurance contract website might work as well.
I guess if you go camping together then conferences are pretty scalable, and if I was to organize another event I'd probably try to first message a few people to get a minimal number of attendees together. After all, the spectrum between an extended party and a festival/conference is fluid.
Trying to organize a festival probably isn't risky. It doesn't seem like it'd involve too much time or money.
I don't think that's true. I've co-organized one one weekend-long retreat in a small hostel for ~50 people, and the cost was ~$5k. Me & the co-organizers probably spent ~50h in total on organizing the event, as volunteers.
See also Guide to rationalist interior decorating (mingyuan, 2023).
That's unfortunate that you are less likely to come, and I'm glad to get the feedback. I could primarily reply with reasons why I think it was the right call (e.g. helpful for getting the event off the ground, helpful for pinpointing the sort of ideas+writing the event is celebrating, I think it's prosocial for me to be open about info like this generally, etc) but I don't think that engages with the fact that it left you personally less likely to come. I still overall think if the event sounds like a good time to you (e.g. interesting conversations with people you'd like to talk to and/or exciting activities) and it's worth the cost to you then I hope you come :-)
Maybe to clarify my comment: I was merely describing my (non-endorsed[1]) observed emotional content wrt the festival, and my intention with the comment was not to wag my finger at you guys in the manner of "you didn't invite me".
I wonder whether other people have a similar emotional reaction.
I appreciate Lightcone being open with the information around free invitations though! I think I'd have bought a ticket anyway if I had time around that weekend, and I think I'd probably have a blast if I would attend.
Btw: What's the chance of a 2nd LessOnline?
I think my reaction is super bound up in icky status-grabbing/status-desiring/inner-ring-infiltrating parts of my psyche which I'm not happy with. ↩︎
Oops, you're correct about the typo and also about how this doesn't restrict belief change to Brownian motion. Fixing the typo.
- Putting the festival at the same time as EAG London is unfortunate.
- Giving out "over 100 free tickets" induces (in me) a reaction of "If I'm not invited I'm not going to buy a ticket". This is perhaps because I hope/wish to slide into those 100 slots, even though it's unrealistic. I believe other events solve this by just giving a list of a bunch of confirmed attendees, and being silent about giving out free tickets to those.
Because[1] for a Bayesian reasoner, there is conservation of expected evidence.
Although I've seen it mentioned that technically the change in the belief on a Bayesian should follow a Martingale, and Brownian motion is a martingale.
I'm not super technically strong on this particular part of the math. Intuitively it could be that in a bounded reasoner which can only evaluate programs in , any pattern in its beliefs that can be described by an algorithm in is detected and the predicted future belief from that pattern is incorporated into current beliefs. On the other hand, any pattern described by an algorithm in can't be in the class of hypotheses of the agent, including hypotheses about its own beliefs, so patterns persist. ↩︎
Thank you a lot for this. I think this or @Thomas Kwas comment would make an excellent original-sequences-style post—it doesn't need to be long, but just going through an example and talking about the assumptions would be really valuable for applied rationality.
After all, it's about how much one should expect ones beliefs to vary, which is pretty important.
Thank you a lot! Strong upvoted.
I was wondering a while ago whether Bayesianism says anything about how much my probabilities are "allowed" to oscillate around—I was noticing that my probability of doom was often moving by 5% in the span of 1-3 weeks, though I guess this was mainly due to logical uncertainty and not empirical uncertainty.
Since there are 10 5% steps between 50% and 0 or 1, and for ~10 years, I should expect to make these kinds of updates ~100 times, or 10 times a year, or a little bit less than once a month, right? So I'm currently updating "too much".
- Transformers Represent Belief State Geometry in their Residual Stream: 6
- D&D.Sci: -5
- Open Thread Spring 2024: 3
- Introducing AI Lab Watch: -3
- An explanation of evil in an organized world: -3
- Mechanistically Eliciting Latent Behaviors in Language Models: 3
- Shane Legg's necessary properties for every AGI Safety plan: -1
- LessWrong Community Weekend 2024, open for applications: -6
- Ironing Out the Squiggles: 5
- ACX Covid Origins Post convinced readers: -7
- Why I'm doing PauseAI: -2
- Manifund Q1 Retro: Learnings from impact certs: -1
- Questions for labs: -3
- Refusal in LLMs is mediated by a single direction: 5
- Take SCIFs, it’s dangerous to go alone: 4
Thanks, that makes sense.
Someone strong-downvoted a post/question of mine with a downvote strength of 10, if I remember correctly.
I had initially just planned to keep silent about this, because that's their good right to do, if they think the post is bad or harmful.
But since the downvote, I can't shake off the curiosity of why that person disliked my post so strongly—I'm willing to pay $20 for two/three paragraphs of explanation by the person why they downvoted it.
The standard way of dealing with this:
Quantify how much worse the PRC getting AGI would be than OpenAI getting it, or the US government, and how much existential risk there is from not pausing/pausing, or from the PRC/OpenAI/the US government building AGI first, and then calculating whether pausing to do {alignment research, diplomacy, sabotage, espionage} is higher expected value than moving ahead.
(Is China getting AGI first half the value of the US getting it first, or 10%, or 90%?)
The discussion over pause or competition around AGI has been lacking this so far. Maybe I should write such an analysis.
Gentlemen, calculemus!
I realized I hadn't given feedback on the actual results of the recommendation algorithm. Rating the recommendations I've gotten (from -10 to 10, 10 is best):
- My experience using financial commitments to overcome akrasia: 3
- An Introduction to AI Sandbagging: 3
- Improving Dictionary Learning with Gated Sparse Autoencoders: 2
- [April Fools' Day] Introducing Open Asteroid Impact: -6
- LLMs seem (relatively) safe: -3
- The first future and the best future: -2
- Examples of Highly Counterfactual Discoveries?: 5
- "Why I Write" by George Orwell (1946): -3
- My Clients, The Liars: -4
- 'Empiricism!' as Anti-Epistemology: -2
- Toward a Broader Conception of Adverse Selection: 4
- Ambitious Altruistic Software Engineering Efforts: Opportunities and Benefits: 6
The obsessive autists who have spent 10,000 hours researching the topic and writing boring articles in support of the mainstream position are left ignored.
It seems like you're spanning up three different categories of thinkers: Academics, public intellectuals, and "obsessive autists".
Notice that the examples you give overlap in those categories: Hanson and Caplan are academics (professors!), while the Natália Mendonça is not an academic, but is approaching being a public intellectual by now(?). Similarly, Scott Alexander strikes me as being in the "public intellectual" bucket much more than any other bucket.
So your conclusion, as far as I read the article, should be "read obsessive autists" instead of "read obsessive autists that support the mainstream view". This is my current best guess—"obsessive autists" are usually not under much strong pressure to say politically palatable things, very unlike professors.
My best guess is that people in these categories were ones that were high in some other trait, e.g. patience, which allowed them to collect datasets or make careful experiments for quite a while, thus enabling others to make great discoveries.
I'm thinking for example of Tycho Brahe, who is best known for 15 years of careful astronomical observation & data collection, or Gregor Mendel's 7-year-long experiments on peas. Same for Dmitry Belayev and fox domestication. Of course I don't know their cognitive scores, but those don't seem like a bottleneck in their work.
So the recipe to me looks like "find an unexplored data source that requires long-term observation to bear fruit, but would yield a lot of insight if studied closely, then investigate".
I think the Diesel engine would've taken 10 years or 20 years longer to be invented: From the Wikipedia article it sounds like it was fairly unintuitive to the people at the time.
A core value of LessWrong is to be timeless and not news-driven.
I do really like the simplicity and predictability of the Hacker News algorithm. More karma means more visibility, older means less visibility.
Our current goal is to produce a recommendations feed that both makes people feel like they're keeping up to date with what's new (something many people care about) and also suggest great reads from across LessWrong's entire archive.
I hope that we can avoid getting swallowed by Shoggoth for now by putting a lot of thought into our optimization targets
(Emphasis mine.)
Here's an idea[1] for a straightforward(?) recommendation algorithm: Quantilize over all past LessWrong posts by using inflation-adjusted karma as a metric of quality.
The advantage is that this is dogfooding on some pretty robust theory. I think this isn't super compute-intensive, since the only thing one has to do is to compute the cumulative distribution function once a day (associating it with the post), and then inverse transform sampling from the CDF.
Recommending this way has the disadvantage of not being recency-favoring (which I personally like), and not personalized (which I also like).
By default, it also excludes posts below a certain karma threshold. That could be solved by exponentially tilting the distribution instead of cutting it off (, otherwise to be determined (experimentally?)). Such a recommendation algorithm wouldn't be as robust against very strong optimizers, but since we have some idea what high-karma LessWrong posts look like (& we're not dealing with a superintelligent adversary… yet), that shouldn't be a problem.
If I was more virtuous, I'd write a pull request instead of a comment. ↩︎
There are several sequences which are visible on the profiles of their authors, but haven't yet been added to the library. Those are:
- «Boundaries» Sequence (Andrew Critch)
- Maximal Lottery-Lotteries (Scott Garrabrant)
- Geometric Rationality (Scott Garrabrant)
- UDT 1.01 (Diffractor)
- Unifying Bargaining (Diffractor)
- Why Everyone (Else) Is a Hypocrite: Evolution and the Modular Mind (Kaj Sotala)
- The Sense Of Physical Necessity: A Naturalism Demo (LoganStrohl)
- Scheming AIs: Will AIs fake alignment during training in order to get power? (Joe Carlsmith)
I think these are good enough to be moved into the library.
My reason for caring about internal computational states is: In the twin prisoners dilemma[1], I cooperate because we're the same algorithm. If we modify the twin to have a slightly longer right index-finger-nail, I would still cooperate, even though they're a different algorithm, but little enough has been changed about the algorithm that the internal states that they're still similar enough.
But it could be that I'm in a prisoner's dilemma with some program that, given some inputs, returns the same outputs as I do, but for completely different "reasons"—that is, the internal states are very different, and a slight change in input would cause the output to be radically different. My logical correlation with is pretty small, because, even though it gives the same output, it gives that output for very different reasons, so I don't have much control over its outputs by controlling my own computations.
At least, that's how I understand it.
Is this actually ECL, or just acausal trade? ↩︎
I don't have a concrete usage for it yet.
The strongest logical correlation is -0.5, the lower the better.
For and , the logical correlation would be , assuming that and have the same output. This is a pretty strong logical correlation.
This is because equal output guarantees a logical correlation of at most 0, and one can then improve the logical correlation by also having similar traces. If the outputs have string distance 1, then the smallest logical correlation can be only 0.5.
Whenever people have written/talked about ECL, a common thing I've read/heard was that "of course, this depends on us finding some way of saying that one decision algorithm is similar/dissimilar to another one, since we're not going to encounter the case of perfect copies very often". This was at least the case when I last asked Oesterheld about this, but I haven't read Treutlein 2023 closely enough yet to figure out whether he has a satisfying solution.
The fact we didn't have a characterization of logical correlation bugged me and was in the back of my mind, since it felt like a problem that one could make progress on. Today in the shower I was thinking about this, and the post above is what came of it.
(I also have the suspicion that having a notion of "these two programs produce the same/similar outputs in a similar way" might be handy in general.)
Consider proposing the most naïve formula for logical correlation[1].
Let a program be a tuple of code for a Turing machine, intermediate tape states after each command execution, and output. All in binary.
That is , with and .
Let be the number of steps that takes to halt.
Then a formula for the logical correlation [2] of two halting programs , a tape-state discount factor [3], and a string-distance metric could be
The lower , the higher the logical correlation between and . The minimal value is .
If , then it's also the case that .
One might also want to be able to deal with the fact that programs have different trace lengths, and penalize that, e.g. amending the formula:
I'm a bit unhappy that the code doesn't factor in the logical correlation, and ideally one would want to be able to compute the logical correlation without having to run the program.
How does this relate to data=code?
Actually not explained in detail anywhere, as far as I can tell. I'm going to leave out all motivation here. ↩︎
Suggested by GPT-4. Stands for joining, combining, uniting. Also "to suit; to fit", "to have sexual intercourse", "to fight, to have a confrontation with", or "to be equivalent to, to add up". ↩︎
Which is needed because tape states close to the output are more important than tape states early on. ↩︎
There's this intro series by @Alex Lawsen.
That seems an odd motte-and-bailey style explanation (and likely, belief. As you say, misgeneralized).
From my side or theirs?
Huh. Intuitively this doesn't feel like it rises to the quality needed for a post, but I'll consider it. (It's in the rats tail of all the thoughts I have about subagents :-))
(Also: Did you accidentally a word?)
What then prevents humans from being more terrible to each other? Presumably, if the vast majority of people are like this, and they know that the vast majority of others are also like this, up to common knowledge, I don't see how you'd get a stable society in which people aren't usually screwing each other a giant amount.
Prompted by this post, I think that now is a very good time to check how easy it is for someone (with access to generative AI) impersonating you to get access to your bank account.
On a twitter lent at the moment, but I remember this thread. There's also a short section in an interview with David Deutsch:
So all hardware limitations on us boil down to speed and memory capacity. And both of those can be augmented to the level of any other entity that is in the universe. Because if somebody builds a computer that can think faster than the brain, then we can use that very computer or that very technology to make our thinking go just as fast as that. So that's the hardware.
[…]
So if we take the hardware, we know that our brains are Turing-complete bits of hardware, and therefore can exhibit the functionality of running any computable program and function.
and:
So the more memory and time you give it, the more closely it could simulate the whole universe. But it couldn't ever simulate the whole universe or anything near the whole universe because it is hard for it to simulate itself. Also, the sheer size of the universe is large.
I think this happens when people encounter the Deutsch's claim that humans are universal explainers, and then misgeneralize the claim to Turing machines.
So the more interesting question is: Is there a computational class somewhere between FSAs and PDAs that is able to, given enough "resources", execute arbitrary programs? What physical systems do these correspond to?
Related: Are there cognitive realms? (Tsvi Benson-Tilsen, 2022)
Yes, I was interested in the first statement, and not thinking about the second statement.
Not "humans are a general turing-complete processing system", that's clearly false
Critical rationalists often argue that this (or something very related) is true. I was not talking about whether humans are fully implementable on a Turing machine, that seems true to me, but was not the question I was interested in.
Could you explain more about being coercive towards subagents? I'm not sure I'm picking up exactly what you mean.
A (probably-fake-)framework I'm using is to imagine my mind being made up of subagents with cached heuristics about which actions are good and which aren't. They function in a sort-of-vetocracy—if any one subagent doesn't want to engage in an action, I don't do it. This can be overridden, but doing so carries the cost of the subagent "losing trust" in the rest of the system and next time putting up even more resistance (this is part of how ugh fields develop).
The "right" way to solve this is to find some representation of the problem-space in which the subagent can see how its concerns are adressed or not relevant to the situation at hand. But sometimes there's not enough time or mental energy to do this, so the best available solution is to override the concern.
This seems right. One thing I would say is that kind of surprisingly it hasn't been the most aversive tasks where the app has made the biggest difference, it's the larger number of moderately aversive tasks. It makes expensive commitments cheap and cheap commitments even cheaper, and for me it has turned out that cheap commitments have made up most of the value.
Maybe for me the transaction costs are still a bit too high to be using commitment mechanisms, which means I should take a look at making this smoother.
Trying to Disambiguate Different Questions about Whether Humans are Turing Machines
I often hear the sentiment that humans are Turing machines, and that this sets humans apart from other pieces of matter.
I've always found those statements a bit strange and confusing, so it seems worth it to tease apart what they could mean.
The question "is a human a Turing machine" is probably meant to convey "can a human mind execute arbitrary programs?", that is "are the languages the human brain emit at least recursively enumerable?", as opposed to e.g. context-free languages.
- My first reaction is that humans are definitely not Turing machines, because we lack the infinite amount of memory the Turing machine has in form of an (idealized) tape. Indeed, in the Chomsky hierarchy human aren't even at the level of pushdown automata, instead we are nothing more than finite state automata. (I remember a professor pointing this out to us that all physical instantiations of computers are merely finite-state automata).
- Depending on one's interpretation of quantum mechanics, one might instead argue that we're at least nondeterministic finite automata or even Markov chains. However, every nondeterministic finite automaton can be transformed into a deterministic finite automaton, albeit at an exponential increase in the number of states, and Markov chains aren't more computationally powerful (e.g. they can't recognize Dyck languages, just as DFAs can't).
- It might be that Quantum finite automata are of interest, but I don't know enough about quantum physics to make a judgment call.
- The above argument only applies if we regard humans as closed systems with clearly defined inputs and outputs. When probed, many proponents of the statement "humans are Turing machines" indeed fall back to a motte that in principle a human could execute every algorithm, given enough time, pen and paper.
- This seems true to me, assuming that the matter in universe does not have a limited amount of computation it can perform.
- In a finite universe we are logically isolated from almost all computable strings, which seems pretty relevant.
- Another constraint is from computational complexity; should we treat things that are not polynomial-time computable as basically unknowable? Humans certainly can't solve NP-complete problems efficiently.
- I'm not sure this is a very useful notion.
- On the one hand, I'd argue that, by orchestrating the exactly right circumstances, a tulip could receive specific stimuli to grow in the right directions, knock the correct things over, lift other things up with its roots, create offspring that perform subcomputations &c to execute arbitrary programs. Conway's Game of Life certainly manages to! One might object that this is set up for the tulip to succeed, but we also put the human in a room with unlimited pens and papers.
- On the other hand, those circumstances would have to be very exact, much more so than with humans. But that again is a difference in degree, not in kind.
- This seems true to me, assuming that the matter in universe does not have a limited amount of computation it can perform.
After all, I'm coming down with the following conclusion: Humans are certainly not Turing machines, however there might be a (much weaker) notion of generality that humans fulfill and other physical systems don't (or don't as much). But this notion of generality is purported to be stronger than the one of life:
I don't know of any formulation of such a criterion of generality, but would be interested in seeing it fleshed out.
Agree with the recommendation of using such websites.
I've found them to be very effective, especially for highly aversive medium-term tasks like applying for jobs, or finishing a project. Five times I wanted a thing to happen, but was really procrastinating on it. And five times pulling out the big guns (beeminder) got it done.
I haven't tried using them for long-term commitments, since my intuition is that using them is highly coercive towards subagents which then further entrench their opposition. So I've used these apps sparingly. Maybe I'll still give it a try.
I think these apps work best if really aversive tasks are set to a estimated-bare-minimum commitment, which in reality is probably the median amount of what one can get done.
There's also the Fluidity podcast narrating Meaningness, which I mistakenly believed was narrated by you.
Ideally I would not go with the extreme. I would instead choose a relatively light ‘this is not allowed’ where in practice we mostly look the other way.
I'm not going to argue about assisted suicide here, but I am going to remark that cryonicists are sometimes in a dilemma where they know they have a neurodegenerative disease, which is destroying the information content of their personality, but they can't (legally) go into cryopreservation before most of the damage has been done.
Obscure request:
Short story by Yudkowsky, on a reddit short fiction subreddit, about a time traveler coming back to the 19th century from the 21st. The time traveler is incredibly distraught about the red tape in the future, screaming about molasses and how it's illegal to sell food on the street.
Nevermind, found it.
An African grey parrot costs ~$2k/parrot. For a small breeding population might be ~150 individuals (fox domestication started out "with 30 male foxes and 100 vixens"). Let's assume cages cost $1k/parrot, including perches, feeding- and water-bowls. The estimated price for an avian vet is $400/parrot-year.
This page also says that African greys produce feather dust, and one therefore needs airfilters (which are advisable anyway). Let's say we need one for every 10 parrots, costing $500 each.
Let's say the whole experiment takes 50 years, which is ~7 generations. I'll assume that the number of parrots is not fluctuating due to breeding them at a constant rate.
Let's say it takes $500/parrot for feed and water (just a guess, I haven't looked this up).
We also have to buy a building to house the parrots in. 2m²/parrot at $100/m² in rural areas, plus $200k for a building housing 50 parrots each (I've guessed those numbers). Four staff perhaps (working 8 hours/day), expense at $60/staff-hour, 360 days a year.
The total cost is then 150*$2k+15*$500+150*$1k+150*50*($400+$500)+3*$200k+2*150*$100+50*360*8*4*$60=$3.632 mio.
I assume the number is going to be very similar for other (potentially more intelligent) birds like keas.
If some common variable C is causally upstream both of A and B, then I wouldn't say that A causes B, or B causes A—intervening on A can't possibly change B, and intervening on B can't change A (which is the understanding of causation by Pearl).
I don't think this is true, there's this paper and this post from November 2023, and these two papers from April and October 2022, respectively.
(Note that I've only read the single-turn paper.)
(my react is me rolling my eyes)
it felt very unnatural to tech giants to reward people trying to break their software;
Same for governments, afaik most still don't have bug bounty programs for their software.
Nevermind, a short google shows multiple such programs, although others have been hesitant to adopt them.
It's a bit like asking the question, what proportion of yes/no questions have the answer "yes"?
Modus ponens, modus tollens: I am interested in that question, and the answer (for questions considered worth asking to forecasters) is ~40%.
But having a better selection of causal graphs than just "uniformly" would be good. I don't know how to approach that, though—is the world denser or sparser than what I chose?
I keep all of my conversations. Additionally, I sometimes have the wish to search in all my conversations ("I've talked about this already")—but ChatGPT doesn't allow for this.
Thank for taking time to answer my question, as someone from the field!
The links you've given me are relevant to my question, and I can now rephrase my question as "in general, if we observe two things aren't correlated, how likely is it that one influences the other", or, simpler, how good is absence of correlation as evidence for absence of causation.
People tend to give examples of cases in which the absence of correlation goes hand in hand with the presence of causation, but I wasn't able to find an estimate of how often this occurs, which is potentially useful for the purposes of practical epistemology.
I want to push back a little bit on this simulation being not valuable—taking simple linear models is a good first step, and I've often been surprised by how linear things in the real world often are. That said, I chose linear models because they were fairly easy to implement, and wanted to find an answer quickly.
And, just to check: Your second and third example are both examples of correlation without causation, right?