Posts

Schelling points in the AGI policy space 2024-06-26T13:19:25.186Z
"How could I have thought that faster?" 2024-03-11T10:56:17.884Z
Dual Wielding Kindle Scribes 2024-02-21T17:17:58.743Z
[Repost] The Copenhagen Interpretation of Ethics 2024-01-25T15:20:08.162Z
EPUBs of MIRI Blog Archives and selected LW Sequences 2023-10-26T14:17:11.538Z
An EPUB of Arbital's AI Alignment section 2023-10-16T19:36:29.109Z
[outdated] My current theory of change to mitigate existential risk by misaligned ASI 2023-05-21T13:46:06.570Z
mesaoptimizer's Shortform 2023-02-14T11:33:14.128Z

Comments

Comment by mesaoptimizer on [deleted post] 2024-07-25T21:24:18.855Z

Yeah I think yours has achieved my goal -- a post to discuss this specific research advance. Please don't delete your post -- I'll move mine back to drafts.

Comment by mesaoptimizer on "AI achieves silver-medal standard solving International Mathematical Olympiad problems" · 2024-07-25T16:59:27.073Z · LW · GW

I searched for it and found none. The twitter conversation also seems to imply that there has not been a paper / technical report out yet.

Comment by mesaoptimizer on [Closed] Prize and fast track to alignment research at ALTER · 2024-07-25T12:08:21.657Z · LW · GW

Based on your link, it seems like nobody even submitted anything to the contest throughout the time it existed. Is that correct?

Comment by mesaoptimizer on Neural networks as non-leaky mathematical abstraction · 2024-07-15T17:51:23.673Z · LW · GW

yet mathematically true

This only seems to be the case because the equals sign is redefined in that sentence.

Comment by mesaoptimizer on Ryan Kidd's Shortform · 2024-07-15T07:50:37.047Z · LW · GW

I expect that Ryan means to say one of the these things:

  1. There isn't enough funding for MATS grads to do useful work in the research directions they are working on, that have already been vouched for by senior alignment researchers (especially their mentors) to be valuable. (Potential examples: infrabayesianism)
  2. There isn't (yet) institutional infrastructure to support MATS grads to do useful work together as part of a team focused on the same (or very similar) research agendas, and that this is the case for multiple nascent and established research agendas. They are forced to go to academia and disperse across the world instead of being able to work together in one location. (Potential examples: selection theorems, multi-agent alignment (of the sort that Caspar Oesterheld and company work on))
  3. There aren't enough research managers in existing established alignment research organizations or frontier labs to enable MATS grads to work on the research directions they consider extremely high value, and would benefit from multiple people working together on (Potential examples: activation steering)

I'm pretty sure that Ryan does not mean to say that MATS grads cannot do useful work on their own. The point is that we don't yet have the institutional infrastructure to absorb, enable, and scale new researchers the way our civilization has for existing STEM fields via, say, PhD programs or yearlong fellowships at OpenAI/MSR/DeepMind (which are also pretty rare). AFAICT, the most valuable part of such infrastructure in general is the ability to co-locate researchers working on the same or similar research problems -- this is standard for academic and industry research groups, for example, and from experience I know that being able to do so is invaluable. Another extremely valuable facet of institutional infrastructure that enables researchers is the ability to delegate operations and logistics problems -- particularly the difficulty of finding grant funding, interfacing with other organizations, getting paperwork handled, etc.

I keep getting more and more convinced, as time passes, that it would be more valuable for me to work on building the infrastructure to enable valuable teams and projects, than to simply do alignment research while disregarding such bottlenecks to this research ecosystem.

Comment by mesaoptimizer on mesaoptimizer's Shortform · 2024-07-15T07:32:19.295Z · LW · GW

I've become somewhat pessimistic about encouraging regulatory power over AI development recently after reading this Bismarck Analysis case study on the level of influence (or lack of it) that scientists had over nuclear policy.

The impression I got from some other secondary/tertiary sources (specifically the book Organizing Genius) was that General Groves, the military man who was the interface between the military and Oppenheimer and the Manhattan Project, did his best to shield the Manhattan Project scientists from military and bureaucratic drudgery, and that Vannevar Bush was someone who served as an example of a scientist successfully steering policy.

This case study seems to show that Groves was significantly less of a value add than I thought given the likelihood of him having destroyed Leo Szilard's political influence (and therefore Leo's ability to influence nuclear policy in a direction of preventing an arms race or using it in war). Bush also seems like a disappointment -- he waited months for information to pass through 'official channels' before he attempted to persuade people like FDR to begin a nuclear weapons development program. On top of that, Bush seemed like he internalized the bureaucratic norms of the political and military hierarchy he worked in -- when a scientist named Ernest Lawrence tried to reach the relevant government officials to talk about the importance of nuclear weapons development, Bush (according to this paper) got annoyed by him seemingly bypassing the 'chain of command' (I assume by focusing on talking to people Bush would report to, instead of to Bush himself) that he threatened to politically marginalize Ernest.

Finally, I see clear parallels between the ineffective attempts by these physicists at influencing nuclear weapons policy via contributing technically and trying to build 'political capital', and the ineffective attempts by AI safety engineers and researchers who decide to go work at frontier labs (OpenAI is the clearest example) with the intention of building influence with the people in there so that they can steer things in the future. I'm pretty sure at this point that such a strategy is a pretty bad idea, given that it seems better to do nothing than to contribute to accelerating towards ASI.

There are galaxy-brained counter-arguments to this claim, such as davidad's supposed game-theoretic model that (AFAICT) involves accelerating to AGI powerful enough to make the provable safety agenda viable, or Paul Christiano's (again, AFAICT) plan that's basically 'given intense economic pressure for better capabilities, we shall see a steady and continuous improvement, so the danger actually is in discontinuities that make it harder for humanity to react to changes, and therefore we should accelerate to reduce compute overhang'. I remain unconvinced by them.

Comment by mesaoptimizer on The Golden Mean of Scientific Virtues · 2024-07-10T06:36:47.567Z · LW · GW

I’m optimizing for consistently writing and publishing posts.

I agree with this strategy, and I plan to begin something similar soon. I forgot that Epistemological Fascinations is your less polished and more "optimized for fun and sustainability" substack. (I have both your substacks in my feed reader.)

Comment by mesaoptimizer on The Golden Mean of Scientific Virtues · 2024-07-08T22:09:28.085Z · LW · GW

I really appreciate this essay. I also think that most of it consists of sazens. When I read your essay, I find my mind bubbling up concrete examples of experiences I've had, that confirm or contradict your claims. This is, of course, what I believe is expected from graduate students when they are studying theoretical computer science or mathematics courses -- they'd encounter an abstraction, and it is on them to build concrete examples in their mind to get a sense of what the paper or textbook is talking about.

However, when it comes to more inchoate domains like research skill, such writing does very little to help the inexperienced researcher. It is more likely that they'd simply miss out on the point you are trying to tell them, for they haven't failed both by, say, being too trusting (a common phenomenon) and being too wary of 'trusting' (a somewhat rare phenomenon for someone who gets to the big leagues as a researcher). What would actually help is either concrete case studies, or a tight feedback loop that involves a researcher trying to do something, and perhaps failing, and getting specific feedback from an experienced researcher mentoring them. The latter has an advantage that one doesn't need to explicitly try to elicit and make clear distinctions of the skills involved, and can still learn them. The former is useful because it is scalable (you write it once, and many people can read it), and the concreteness is extremely relevant to allowing people to evaluate the abstract claims you make, and pattern match it to their own past, current, or potential future experiences.

For example, when reading the Inquiring and Trust section, I recall an experience I had last year where I couldn't work with a team of researchers, because I had basically zero ability to defer (and even now as I write this, I find the notion of deferring somewhat distasteful). On the other hand, I don't think there's a real trade-off here. I don't expect that anyone needs to naively trust that other people they are coordinating with will have their back. I'd probably accept the limits to coordination, and recalibrate my expectations of the usefulness of the research project, and probably continue if the expected value of working on the project until it is shipped is worth it (which in general it is).

When reading the Lightness and Diligence section, I was reminded of the Choudhuri 1985 paper, which describes the author's notion of a practice of "partial science", that is, an inability to push science forward due to certain systematic misconceptions of how basic (theoretical physics, in this context) science occurs. One misconception involves a sort of distaste around working on 'unimportant' problems, or problems that don't seem fundamental, while only caring about or willing to put in effort to solve 'fundamental' problems. The author doesn't make it explicit, but I believe that he believed that the incremental work that scientists do is almost essential for building their knowledge and skill to make their way forwards towards attacking these supposedly fundamental problems, and the aversion to working on supposedly incremental research problems leads people to being stuck. This seems very similar to the thing you are pointing at when you talk about diligence and hard work being extremely important. The incremental research progress, to me, seems similar to what you call 'cataloguing rocks'. You need data to see a pattern, after all.

This is the sort of realization and thinking I wouldn't have if I did not have research experience or did not read relevant case studies. I expect that Mesa of early 2023 would have mostly skimmed and ignored your essay, simply because he'd scoff at the notion of 'Trust' and 'Lightness' being relevant in any way to research work.

Comment by mesaoptimizer on shortplav · 2024-07-08T16:28:21.621Z · LW · GW

GPT-4o can not reproduce the string, and instead just makes up plausible candidates. You love to see it.

Hmm. I assume you could fine-tune away an LLM from reproducing the string. Eliciting it would just become more difficult. Try posting canary text, and a part of the canary string, and see if GPT-4o completes it.

Comment by mesaoptimizer on ryan_greenblatt's Shortform · 2024-07-05T07:11:22.649Z · LW · GW

Please read the model organisms for misalignment proposal.

Comment by mesaoptimizer on Habryka's Shortform Feed · 2024-07-04T18:41:56.456Z · LW · GW

Anyone who has signed a non-disparagement agreement with Anthropic is free to state that fact (and we regret that some previous agreements were unclear on this point).

I'm curious as to why it took you (and therefore Anthropic) so long to make it common knowledge (or even public knowledge) that Anthropic used non-disparagement contracts as a standard and was also planning to change its standard agreements.

The right time to reveal this was when the OpenAI non-disparagement news broke, not after Habryka connects the dots and builds social momentum for scrutiny of Anthropic.

Comment by mesaoptimizer on The Xerox Parc/ARPA version of the intellectual Turing test: Class 1 vs Class 2 disagreement · 2024-07-02T13:48:58.360Z · LW · GW

If you like The Dream Machine, you'll also like Organizing Genius.

Comment by mesaoptimizer on mesaoptimizer's Shortform · 2024-06-30T14:30:56.375Z · LW · GW

Project proposal: EpochAI for compute oversight

Detailed MVP description: website with an interactive map that shows locations of high risk data centers globally, with relevant information appearing when you click on the icons on the map. Examples of relevant information: organizations and frontier labs that have access to this compute, the effective FLOPS of the data center, what time would it take to train a SOTA model in that datacenter).

High risk datacenters are datacenters that are capable of training current or next generation SOTA AI systems.

Why:

  1. I'm unable to find a 'single point of reference' for information about the number and locations of datacenters that are high risk.
  2. AFAICT Epoch focuses more on tracking SOTA model details instead of hardware related information.
  3. This seems extremely useful for our community (and policy makers) to orient to compute regulation possibilities and its relative prioritization compared to other interventions

Thoughts? I've been playing around with the idea of building it, but have been uncertain about how useful this would be, since I don't have enough interaction with the AI alignment policy people here. Posting it here is an easy test to see whether it is worth greater investment or prioritization.

Note: Uncertain as to whether dual-use issues exist here. I expect that datacenter builders and frontier labs probably have a very good model of the global compute distribution situation and this would significantly benefit regulatory efforts compared to helping increase the strategic allocation of training compute allocation.

Comment by mesaoptimizer on Daniel Kokotajlo's Shortform · 2024-06-26T13:46:52.466Z · LW · GW

Neuro-sama is a limited scaffolded agent that livestreams on Twitch, optimized for viewer engagement (so it speaks via TTS, it can play video games, etc.).

Comment by mesaoptimizer on quetzal_rainbow's Shortform · 2024-06-21T20:57:51.678Z · LW · GW

Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.

Here's a relevant quote from the first essay in the sequence:

Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.

Also, I don't recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a "Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you'd like to take a look at this sequence!"

Comment by mesaoptimizer on quetzal_rainbow's Shortform · 2024-06-21T11:56:50.023Z · LW · GW

Evan Hubinger's Conditioning Predictive Models sequence describes this scenario in detail.

Comment by mesaoptimizer on yanni's Shortform · 2024-06-21T11:54:47.250Z · LW · GW

There's generally a cost to managing people and onboarding newcomers, and I expect that offering to volunteer for free is usually a negative signal, since it implies that there's a lot more work than usual that would need to be done to onboard this particular newcomer.

Have you experienced otherwise? I'd love to hear some specifics as to why you feel this way.

Comment by mesaoptimizer on orthonormal's Shortform · 2024-06-20T22:35:45.477Z · LW · GW

I think we'll have bigger problems than just solving the alignment problem, if we have a global thermonuclear war that is impactful enough to not only break the compute supply and improvement trends, but also destabilize the economy and geopolitical situation enough that frontier labs aren't able to continue experimenting to find algorithmic improvements.

Agent foundations research seems robust to such supply chain issues, but I'd argue that gigantic parts of the (non-academic, non-DeepMind specific) conceptual alignment research ecosystem is extremely dependent on a stable and relatively resource-abundant civilization: LW, EA organizations, EA funding, individual researchers having the slack to do research, ability to communicate with each other and build on each other's research, etc. Taking a group of researchers and isolating them in some nuclear-war-resistant country is unlikely to lead to an increase in marginal research progress in that scenario.

Comment by mesaoptimizer on Ilya Sutskever created a new AGI startup · 2024-06-19T20:14:48.985Z · LW · GW

Thiel has historically expressed disbelief about AI doom, and has been more focused on trying to prevent civilizational decline. From my perspective, it is more likely that he'd fund an organization founded by people with accelerationist credentials, than by someone who was a part of a failed coup attempt that would look to him like it involved a sincere belief in an extreme difficulty of the alignment problem.

Comment by mesaoptimizer on Ilya Sutskever created a new AGI startup · 2024-06-19T17:29:52.887Z · LW · GW

Related Bloomberg announcement news article.

Comment by mesaoptimizer on Richard Ngo's Shortform · 2024-06-16T13:37:35.732Z · LW · GW

I'd love to read an elaboration of your perspective on this, with concrete examples, which avoids focusing on the usual things you disagree about (pivotal acts vs. pivotal processes, social facets of the game is important for us to track, etc.) and mainly focus on your thoughts on epistemology and rationality and how it deviates from what you consider the LW norm.

Comment by mesaoptimizer on Richard Ngo's Shortform · 2024-06-16T13:35:40.105Z · LW · GW

I started reading your meta-rationality sequence, but it ended after just two posts without going into details.

David Chapman's website seems like the standard reference for what the post-rationalists call "metarationality". (I haven't read much of it, but the little I read made me somewhat unenthusiastic about continuing).

Comment by mesaoptimizer on Closed-Source Evaluations · 2024-06-15T18:30:23.506Z · LW · GW

Note that the current power differential between evals labs and frontier labs is such that I don't expect evals labs have the slack to simply state that a frontier model failed their evals.

You'd need regulation with serious teeth and competent 'bloodhound' regulators watching the space like a hawk, for such a possibility to occur.

Comment by mesaoptimizer on Book Recommendations for social skill development? · 2024-06-15T11:40:32.183Z · LW · GW

I just encountered polyvagal theory and I share your enthusiasm for how useful it is for modeling other people and oneself.

Comment by mesaoptimizer on UDT1.01: Logical Inductors and Implicit Beliefs (5/10) · 2024-06-14T10:11:29.642Z · LW · GW

Note that I'm waiting for the entire sequence to be published before I read it (past the first post), so here's a heads up that I'm looking forward to seeing more of this sequence!

Comment by mesaoptimizer on Two easy things that maybe Just Work to improve AI discourse · 2024-06-09T20:12:41.671Z · LW · GW

I think Twitter systematically underpromotes tweets with links external to the Twitter platform, so reposting isn't a viable strategy.

Comment by mesaoptimizer on mesaoptimizer's Shortform · 2024-06-08T11:49:55.399Z · LW · GW

Thanks for the link. I believe I read it a while ago, but it is useful to reread it from my current perspective.

trying to ensure that AIs will be philosophically competent

I think such scenarios are plausible: I know some people argue that certain decision theory problems cannot be safely delegated to AI systems, but if we as humans can work on these problems safely, I expect that we could probably build systems that are about as safe (by crippling their ability to establish subjunctive dependence) but are also significantly more competent at philosophical progress than we are.

Comment by mesaoptimizer on jeffreycaruso's Shortform · 2024-06-05T15:14:10.113Z · LW · GW

Leopold's interview with Dwarkesh is a very useful source of what's going on in his mind.

What happened to his concerns over safety, I wonder?

He doesn't believe in a 'sharp left turn', which means he doesn't consider general intelligence to be a discontinuous (latent) capability spike such that alignment becomes significantly more difficult after it occurs. To him, alignment is simply a somewhat harder empirical techniques problem like capabilities work is. I assume he imagines in behavior similar to current RLHF-ed models even as frontier labs have doubled or quadrupled the OOMs of optimization power applied to the creation of SOTA models.

He models (incrementalist) alignment research as "dual use", and therefore effectively models capabilities and alignment as effectively the same measure.

He also expects humans to continue to exist once certain communities of humans achieve ASI, and imagines that the future will be 'wild'. This is a very rare and strange model to have.

He is quite hawkish -- he is incredibly focused on China not stealing AGI capabilities, and believes that private labs are going to be too incompetent to defend against Chinese infiltration. He prefers that the USGOV would take over the AGI development such that they can race effectively against AGI.

His model for take-off relies quite heavily on "trust the trendline" and estimating linear intelligence increases with more OOMs of optimization power (linear with respect to human intelligence growth from childhood to adulthood). Its not the best way to extrapolate what will happen, but it is a sensible concrete model he can use to talk to normal people and sound confident and not vague -- a key skill if you are an investor, and an especially key skill for someone trying to make it in the SF scene. (Note he clearly states in the interview that he's describing his modal model for how things will go and he does have uncertainty over how things will occur, but desires to be concrete about what is his modal expectation.)

He has claimed that running a VC firm means he can essentially run it as a "think tank" too, focused on better modeling (and perhaps influencing) the AGI ecosystem. Given his desire for a hyper-militarization of AGI research, it makes sense that he'd try to steer things in this direction using the money and influence he will have and build, as a founder of n investment firm.

So in summary, he isn't concerned about safety because he prices it in as something about as difficult (or slightly more difficult than) capabilities work. This puts him in an ideal epistemic position to run a VC firm for AGI labs, since his optimism is what persuades investors to provide him money since they expect him to attempt to return them a profit.

Comment by mesaoptimizer on Akash's Shortform · 2024-06-04T22:44:27.915Z · LW · GW

Oh, by that I meant something like "yeah I really think it is not a good idea to focus on an AI arms race". See also Slack matters more than any other outcome.

Comment by mesaoptimizer on Akash's Shortform · 2024-06-04T22:31:07.498Z · LW · GW

If Company A is 12 months from building Cthulhu, we fucked up upstream. Also, I don't understand why you'd want to play the AI arms race -- you have better options. They expect an AI arms race. Use other tactics. Get into their OODA loop.

Unsee the frontier lab.

Comment by mesaoptimizer on Prometheus's Shortform · 2024-06-04T22:24:50.754Z · LW · GW

These are pretty sane takes (conditional on my model of Thomas Kwa of course), and I don't understand why people have downvoted this comment. Here's an attempt to unravel my thoughts and potential disagreements with your claims.

AGI that poses serious existential risks seems at least 6 years away, and safety work seems much more valuable at crunch time, such that I think more than half of most peoples’ impact will be more than 5 years away.

I think safety work gets less and less valuable at crunch time actually. I think you have this Paul Christiano-like model of getting a prototypical AGI and dissecting it and figuring out how it works -- I think it is unlikely that any individual frontier lab would perceive itself to have the slack to do so. Any potential "dissection" tools will need to be developed beforehand, such as scalable interpretability tools (SAEs seem like rudimentary examples of this). The problem with "prosaic alignment" IMO is that a lot of this relies on a significant amount of schlep -- a lot of empirical work, a lot of fucking around. That's probably why, according to the MATS team, frontier labs have a high demand for "iterators" -- their strategy involves having a lot of ideas about stuff that might work, and without a theoretical framework underlying their search path, a lot of things they do would look like trying things out.

I expect that once you get AI researcher level systems, the die is cast. Whatever prosaic alignment and control measures you've figured out, you'll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem, that would also be modifying itself to improve its capabilities (because that is the point of AI researchers). (Sure, its easier to test for capability improvements. That doesn't mean you can't transfer information embedded into these proposals such that modified models will be modified in ways the humans did not anticipate or would not want if they had a full understanding of what is going on.)

Mentorship for safety is still limited. If you can get an industry safety job or get into MATS, this seems better than some random AI job, but most people can’t.

Yeah -- I think most "random AI jobs" are significantly worse for trying to do useful work in comparison to just doing things by yourself or with some other independent ML researchers. If you aren't in a position to do this, however, it does make sense to optimize for a convenient low-cognitive-effort set of tasks that provides you the social, financial and/or structural support that will benefit you, and perhaps look into AI safety stuff as a hobby.

I agree that mentorship is a fundamental bottleneck to building mature alignment researchers. This is unfortunate, but it is the reality we have.

Funding is also limited in the current environment. I think most people cannot get funding to work on alignment if they tried? This is fairly cruxy and I’m not sure of it, so someone should correct me if I’m wrong.

Yeah, post-FTX, I believe that funding is limited enough that you have to be consciously optimizing for getting funding (as an EA-affiliated organization, or as an independent alignment researcher). Particularly for new conceptual alignment researchers, I expect that funding is drastically limited since funding organizations seem to explicitly prioritize funding grantees who will work on OpenPhil-endorsed (or to a certain extent, existing but not necessarily OpenPhil-endorsed) agendas. This includes stuff like evals.

The relative impact of working on capabilities is smaller than working on alignment—there are still 10x as many people doing capabilities as alignment, so unless returns don’t diminish or you are doing something unusually harmful, you can work for 1 year on capabilities and 1 year on alignment and gain 10x.

This is a very Paul Christiano-like argument -- yeah sure the math makes sense, but I feel averse to agreeing with this because it seems like you may be abstracting away significant parts of reality and throwing away valuable information we already have.

Anyway, yeah I agree with your sentiment. It seems fine to work on non-SOTA AI / ML / LLM stuff and I'd want people to do so such that they live a good life. I'd rather they didn't throw themselves into the gauntlet of "AI safety" and get chewed up and spit out by an incompetent ecosystem.

Safety could get even more crowded, which would make upskilling to work on safety net negative. This should be a significant concern, but I think most people can skill up faster than this.

I still don't understand what causal model would produce this prediction. Here's mine: One big limiting factor to the amount of safety researchers the current SOTA lab ecosystem can handle is bottlenecked by their expectations for how many researchers they want or need. On one hand, more schlep during pre-AI-researcher-era means more hires. On the other hand, more hires requires more research managers or managerial experience. Anecdotally, it seems like many AI capabilities and alignment organizations (both in the EA space and in the frontier lab space) seemed to have been historically bottlenecked on management capacity. Additionally, hiring has a cost (both the search process and the onboarding), and it is likely that as labs get closer to creating AI researchers, they'd believe that the opportunity cost of hiring continues to increase.

Skills useful in capabilities are useful for alignment, and if you’re careful about what job you take there isn’t much more skill penalty in transferring them than, say, switching from vision model research to language model research.

Nah, I found very little stuff from my vision model research work (during my undergrad) contributed to my skill and intuition related to language model research work (again during my undergrad, both around 2021-2022). I mean, specific skills of programming and using PyTorch and debugging model issues and data processing and containerization -- sure, but the opportunity cost is ridiculous when you could be actually working with LLMs directly and reading papers relevant to the game you want to play. High quality cognitive work is extremely valuable and spending it on irrelevant things like the specifics of diffusion models (for example) seems quite wasteful unless you really think this stuff is relevant.

Capabilities often has better feedback loops than alignment because you can see whether the thing works or not. Many prosaic alignment directions also have this property. Interpretability is getting there, but not quite. Other areas, especially in agent foundations, are significantly worse.

Yeah this makes sense for extreme newcomers. If someone can get a capabilities job, however, I think they are doing themselves a disservice by playing the easier game of capabilities work. Yes, you have better feedback loops than alignment research / implementation work. That's like saying "Search for your keys under the streetlight because that's where you can see the ground most clearly." I'd want these people to start building the epistemological skills to thrive even with a lower intensity of feedback loops such that they can do alignment research work effectively.

And the best way to do that is to actually attempt to do alignment research, if you are in a position to do so.

Comment by mesaoptimizer on mesaoptimizer's Shortform · 2024-06-04T10:03:44.817Z · LW · GW

It seems like a significant amount of decision theory progress happened between 2006 and 2010, and since then progress has stalled.

Comment by mesaoptimizer on Alok Singh's Shortform · 2024-06-04T07:29:32.413Z · LW · GW

You are missing providing a ridiculous amount of context, but yes, if you are okay with leather footwear, Meermin provides great footwear at relatively inexpensive prices.

I still recommend thrift shopping instead. I spent 250 EUR on a pair of new noots from Meermin, and 50 EUR on a pair of thrifted boots which seem about 80% as aesthetically pleasing as the first (and just as comfortable since I tried them on before buying them).

Comment by mesaoptimizer on mesaoptimizer's Shortform · 2024-06-02T20:20:34.760Z · LW · GW

It has been six months since I wrote this, and I want to note an update: I now grok what Valentine is trying to say and what he is pointing at in Here's the Exit and We're already in AI takeoff. That is, I have a detailed enough model of Valentine's model of the things he talks about, such that I understand the things he is saying.

I still don't feel like I understand Kensho. I get the pattern of the epistemic puzzle he is demonstrating, but I don't know if I get the object-level thing he points at. Based on a reread of the comments, maybe what Valentine means by Looking is essentially gnosis, as opposed to doxa. An understanding grounded in your experience rather than an ungrounded one you absorbed from someone else's claims. See this comment by someone else who is not Valentine in that post:

The fundamental issue is that we are communicating in language, the medium of ideas, so it is easy to get stuck in ideas. The only way to get someone to start looking, insofar as that is possible, is to point at things using words, and to get them to do things. This is why I tell you to do things like wave your arms about or attack someone with your personal bubble or try to initiate the action of touching a hot stove element.

Alternately, Valentine describes the process of Looking as "Direct embodied perception prior to thought.":

Most of that isn’t grounded in reality, but that fact is hard to miss because the thinker isn’t distinguishing between thoughts and reality.

Looking is just the skill of looking at reality prior to thought. It’s really not complicated. It’s just very, very easy to misunderstand if you fixate on mentally understanding it instead of doing it. Which sadly seems to be the default response to the idea of Looking.

I am unsure if this differs from mundane metacognitive skills like "notice the inchoate cognitions that arise in your mind-body, that aren't necessarily verbal". I assume that Valentine is pointing at a certain class of cognition, one that is essentially entirely free of interpretation. Or perhaps before 'value-ness' is attached to an experience -- such as "this experience is good because <elaborate strategic chain>" or "this experience is bad because it hurts!"

I understand how a better metacognitive skillset would lead to the benefits Valentine mentioned, but I don't think it requires you to only stay at the level of "direct embodied perception prior to thought".

As for kensho, it seems to be a term for some skill that leads you to be able to do what romeostevensit calls 'fully generalized un-goodharting':

I may have a better answer for the concrete thing that it allows you to do: it’s fully generalizing the move of un-goodharting. Buddhism seems to be about doing this for happiness/​inverse-suffering, though in principle you could pick a different navigational target (maybe).

Concretely, this should show up as being able to decondition induced reward loops and thus not be caught up in any negative compulsive behaviors.

I think that "fully generalized un-goodharting" is a pretty vague phrase and I could probably come up with a better one, but it is an acceptable pointer term for now. So I assume it is something like 'anti-myopia'? Hard to know at this point. I'd need more experience and experimentation and thought to get a better idea of this.

I believe that Here's the Exit, We're already in AI Takeoff, and Slack matters more than any outcome all were pointing at the same cluster of skills and thought -- about realizing the existence of psyops, systematic vulnerabilities or issues that leads you (whatever 'you' means) to forgetting the 'bigger picture', and that the resulting myopia causes significantly bad outcomes from the perspective of the 'whole' individual/society/whatever.

In general, Lexicogenesis seems like a really important sub-skill for deconfusion.

Comment by mesaoptimizer on jacquesthibs's Shortform · 2024-05-28T22:16:05.971Z · LW · GW

I've experimented with Claude Opus for simple Ada autoformalization test cases (specifically quicksort), and it seems like the sort of issues that make LLM agents infeasible (hallucination-based drift, subtle drift caused by sticking to certain implicit assumptions you made before) are also the issues that make Opus hard to use for autoformalization attempts.

I haven't experimented with a scaffolded LLM agent for autoformalization, but I expect it won't go very well either, primarily because scaffolding involves attempts to make human-like implicit high-level cognitive strategies into explicit algorithms or heuristics such as tree of thought prompting, and I expect that this doesn't scale given the complexity of the domain (sufficently general autoformalizing AI systems can be modelled as effectively consequentialist, which makes them dangerous). I don't expect for a scaffolded (over Opus) LLM agent to succeed at autoformalizing quicksort right now either, mostly because I believe RLHF tuning has systematically optimized Opus to write the bottom line first and then attempt to build or hallucinate a viable answer, and then post-hoc justify the answer. (While steganographic non-visible chain-of-thought may have gone into figuring out the bottom line, it still is worse than first doing visible chain-of-thought such that it has more token-compute-iterations to compute its answer.)

If anyone reading this is able to build a scaffolded agent that autoformalizes (using Lean or Ada) algorithms of complexity equivalent to quicksort reliably (such that more than 5 out of 10 of its attempts succeed) within the next month of me writing this comment, then I'd like to pay you 1000 EUR to see your code and for an hour of your time to talk with you about this. That's a little less than twice my current usual monthly expenses, for context.

Comment by mesaoptimizer on Notifications Received in 30 Minutes of Class · 2024-05-26T17:36:48.963Z · LW · GW

This is very interesting, thank you for posting this.

Comment by mesaoptimizer on Executive Dysfunction 101 · 2024-05-26T09:28:04.502Z · LW · GW

the therapeutic idea of systematically replacing the concept “should” with less normative framings

Interesting. I independently came up with this concept, downstream of thinking about moral cognition and parts work. Could you point me to any past literature that talks about this coherently enough that you would point people to it to understand this concept?

I know that Nate has written about this:

As far as I recall, reading these posts didn't help me.

Comment by mesaoptimizer on james.lucassen's Shortform · 2024-05-25T20:37:22.630Z · LW · GW

Based on gwern's comment, steganography as a capability can arise (at rather rudimentary levels) via RLHF over multi-step problems (which is effectively most cognitive work, really), and this gets exacerbated with the proliferation of AI generated text that embeds its steganographic capabilities within it.

The following paragraph by gwern (from the same thread linked in the previous paragraph) basically summarizes my current thoughts on the feasibility of prevention of steganography for CoT supervision:

Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).

Comment by mesaoptimizer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-23T07:49:02.835Z · LW · GW

Well, if you know relevant theoretical CS and useful math, you don’t have to rebuild the mathematical scaffolding all by yourself.

I didn't intend to imply in my message that you have mathematical scaffolding that you are recreating, although I expect it may be likely (Pearlian causality perhaps? I've been looking into it recently and clearly knowing Bayes nets is very helpful). I specifically used "you" to imply that in general this is the case. I haven't looked very deep into the stuff you are doing, unfortunately -- it is on my to-do list.

Comment by mesaoptimizer on Overconfidence · 2024-05-21T14:34:25.407Z · LW · GW

I do think that systematic self-delusion seems useful in multi-agent environments (see the commitment races problem for an abstract argument, and Sarah Constantin's essay "Is Stupidity Strength?" for a more concrete argument.

I'm not certain that this is the optimal strategy we have for dealing with such environments, and note that systematic self-delusion also leaves you (and the other people using a similar strategy to coordinate) vulnerable to risks that do not take into account your self-delusion. This mainly includes existential risks such as misaligned superintelligences, but also extinction-level asteroids.

Its a pretty complicated picture and I don't really have clean models of these things, but I do think that for most contexts I interact in, the long-term upside of having better models of reality is significantly higher compared to the benefit of systematic self-delusion.

Comment by mesaoptimizer on Overconfidence · 2024-05-21T13:45:56.243Z · LW · GW

According to Eliezar Yudkowsky, your thoughts should reflect reality.

I expect that the more your beliefs track reality, the better you'll get at decision making, yes.

According to Paul Graham, the most successful people are slightly overconfident.

Ah but VCs benefit from the ergodicity of the startup founders! From the perspective of the founder, its a non-ergodic situation. Its better to make Kelly bets instead if you prefer to not fall into gambler's ruin, given whatever definition of the real world situation maps onto the abstract concept of being 'ruined' here.

It usually pays to have a better causal model of reality than relying on what X person says to inform your actions.

Can you think of anyone who has changed history who wasn’t a little overconfident?

Survivorship bias.

It is advantageous to be friends with the kind of people who do things and never give up.

I think I do things and never give up in general, while I can be pessimistic about specific things and tasks I could do. You can be generally extremely confident in yourself and your ability to influence reality, while also being specifically pessimistic about a wide range of existing possible things you could be doing.

Here's a Nate post that provides his perspective on this specific orientation to reality that leads to a sort of generalized confidence that has social benefits.

Comment by mesaoptimizer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-21T11:39:07.358Z · LW · GW

I wrote a bit about it in this comment.

I think that conceptual alignment research of the sort that Johannes is doing (and that I also am doing, which I call "deconfusion") is just really difficult. It involves skills that are not taught to people, that seems very unlikely that you'd learn by being mentored in traditional academia (including when doing theoretical CS or non-applied math PhDs), that I only started wrapping my head around after some mentorship from two MIRI researchers (that I believe I was pretty lucky to get), and even then I've spent a ridiculous amount of time by myself trying to tease out patterns to figure out a more systematic process of doing this.

Oh, and the more theoretical CS (and related math such as mathematical logic) you know, the better you probably are at this -- see how Johannes tries to create concrete models of the inchoate concepts in his head? Well, if you know relevant theoretical CS and useful math, you don't have to rebuild the mathematical scaffolding all by yourself.

I don't have a good enough model of John Wentworth's model for alignment research to understand the differences, but I don't think I learned all that much from John's writings and his training sessions that were a part of his MATS 4.0 training regimen, as compared to the stuff I described above.

Comment by mesaoptimizer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-21T10:35:17.982Z · LW · GW

Note that when I said I disagree with your decisions, I specifically meant the sort of myopia in the glass shard story -- and specifically because I believe that if your research process / cognition algorithm is fragile enough that you'd be willing to take physical damage to hold onto an inchoate thought, maybe consider making your cognition algorithm more robust.

Comment by mesaoptimizer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-21T10:33:16.790Z · LW · GW

Quoted from the linked comment:

Rather, I’m confident that executing my research process will over time lead to something good.

Yeah, this is a sentiment I agree with and believe. I think that it makes sense to have a cognitive process that self-corrects and systematically moves towards solving whatever problem it is faced with. In terms of computability theory, one could imagine it as an effectively computable function that you expect will return you the answer -- and the only 'obstacle' is time / compute invested.

I think being confident, i.e. not feeling hopeless in doing anything, is important. The important takeaway here is that you don’t need to be confident in any particular idea that you come up with. Instead, you can be confident in the broader picture of what you are doing, i.e. your processes.

I share your sentiment, although the causal model for it is different in my head. A generalized feeling of hopelessness is an indicator of mistaken assumptions and causal models in my head, and I use that as a cue to investigate why I feel that way. This usually results in me having hopelessness about specific paths, and a general purposefulness (for I have an idea of what I want to do next), and this is downstream of updates to my causal model that attempts to track reality as best as possible.

Comment by mesaoptimizer on Stephen Fowler's Shortform · 2024-05-20T13:41:59.015Z · LW · GW

I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one.

This can also be glomarizing. "I haven't signed one." is a fact, intended for the reader to use it as anecdotal evidence. "I don't know whether OpenAI uses nondisparagement agreements" can mean that he doesn't know for sure, and will not try to find out.

Obviously, the context of the conversation and the events surrounding Holden stating this matters for interpreting this statement, but I'm not interested in looking further into this, so I'm just going to highlight the glomarization possibility.

Comment by mesaoptimizer on Fund me please - I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University · 2024-05-20T11:26:52.895Z · LW · GW

I think what quila is pointing at is their belief in the supposed fragility of thoughts at the edge of research questions. From that perspective I think their rebuttal is understandable, and your response completely misses the point: you can be someone who spends only four hours a day working and the rest of the time relaxing, but also care a lot about not losing the subtle and supposedly fragile threads of your thought when working.

Note: I have a different model of research thought, one that involves a systematic process towards insight, and because of that I also disagree with Johannes' decisions.

Comment by mesaoptimizer on Stephen Fowler's Shortform · 2024-05-18T22:07:29.157Z · LW · GW

But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.

Just to be clear, OP themselves seem to think that what they are saying will have little effect on the status quo. They literally called it "Very Spicy Take". Their intention was to allow them to express how they felt about the situation. I'm not sure why you find this threatening, because again, the people they think ideally wouldn't continue to have influence over AI safety related decisions are incredibly influential and will very likely continue to have the influence they currently possess. Almost everyone else in this thread implicitly models this fact as they are discussing things related to the OP comment.

There is not going to be any scapegoating that will occur. I imagine that everything I say is something I would say in person to the people involved, or to third parties, and not expect any sort of coordinated action to reduce their influence -- they are that irreplaceable to the community and to the ecosystem.

Comment by mesaoptimizer on Stephen Fowler's Shortform · 2024-05-18T21:56:13.543Z · LW · GW
Comment by mesaoptimizer on Stephen Fowler's Shortform · 2024-05-18T21:55:20.072Z · LW · GW

“Keep people away” sounds like moral talk to me.

Can you not be close friends with someone while also expecting them to be bad at self-control when it comes to alcohol? Or perhaps they are great at technical stuff like research but pretty bad at negotiation, especially when dealing with experienced adverserial situations such as when talking to VCs?

If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that!

It is not that people people's decision-making skill is optimized such that you can consistently reverse someone's opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.

But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.

Again you seem to not be trying to track the context of our discussion here. This advice again is usually said when it comes to junior people embedded in an institution, because the ability to blame someone and / or hold them responsible is a power that senior / executive people hold. This attitude you describe makes a lot of sense when it comes to people who are learning things, yes. I don't know if you can plainly bring it into this domain, and you even acknowledge this in the next few lines.

Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake.

I think it is incredibly unlikely that the rationalist community has an ability to 'throw out' the 'leadership' involved here. I find this notion incredibly silly, given the amount of influence OpenPhil has over the alignment community, especially through their funding (including the pipeline, such as MATS).

Comment by mesaoptimizer on Stephen Fowler's Shortform · 2024-05-18T21:06:18.749Z · LW · GW

I downvoted this comment because it felt uncomfortably scapegoat-y to me.

Enforcing social norms to prevent scapegoating also destroys information that is valuable for accurate credit assignment and causally modelling reality.

If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.

I think you are misinterpreting the grandparent comment. I do not read any mention of a 'moral failing' in that comment. You seem worried because of the commenter's clear description of what they think would be a sensible step for us to take given what they believe are egregious flaws in the decision-making processes of the people involved. I don't think there's anything wrong with such claims.

Again: You can care about people while also seeing their flaws and noticing how they are hurting you and others you care about. You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.

If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved.

Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This "detailed investigation" you speak of, and this notion of a "blameless culture", makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don't think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.

Note that I don't necessarily endorse the grandparent comment claims. This is a complex situation and I'd spend more time analyzing it and what occurred.