Posts

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot Retrospective 2025-01-10T16:22:16.905Z
Meditation insights as phase shifts in your self-model 2025-01-07T10:09:35.854Z
Model Integrity: MAI on Value Alignment 2024-12-05T17:11:31.707Z
Reprograming the Mind: Meditation as a Tool for Cognitive Optimization 2024-01-11T12:03:41.763Z
How well does your research adress the theory-practice gap? 2023-11-08T11:27:52.410Z
Jonas Hallgren's Shortform 2023-10-11T09:52:20.390Z
Advice for new alignment people: Info Max 2023-05-30T15:42:20.142Z
Respect for Boundaries as non-arbirtrary coordination norms 2023-05-09T19:42:13.194Z
Max Tegmark's new Time article on how we're in a Don't Look Up scenario [Linkpost] 2023-04-25T15:41:16.050Z
The Benefits of Distillation in Research 2023-03-04T17:45:22.547Z
Power-Seeking = Minimising free energy 2023-02-22T04:28:44.075Z
Black Box Investigation Research Hackathon 2022-09-12T07:20:34.966Z
Announcing the Distillation for Alignment Practicum (DAP) 2022-08-18T19:50:31.371Z
Does agent foundations cover all future ML systems? 2022-07-25T01:17:11.841Z
Is it worth making a database for moral predictions? 2021-08-16T14:51:54.609Z
Is there any serious attempt to create a system to figure out the CEV of humanity and if not, why haven't we started yet? 2021-02-25T22:06:04.695Z

Comments

Comment by Jonas Hallgren on Meditation insights as phase shifts in your self-model · 2025-01-19T13:29:47.395Z · LW · GW

So I still haven't really figured out how to talk about these things properly as it is more of  a vibe than it is an intellectual truth? 

Let's say that you don't feel a strong sense of self but that you're instead identified with nothing, there is no self, if you see this then you can see the "deathless". 

It's pointing out a different metaphysical viewpoint that can be experienced. I agree with you that from a rational point of view this is strictly not true yet it isn't to be understood, it is to be experienced? You can't or at least I can't think my way to it.

Comment by Jonas Hallgren on Elizabeth's Shortform · 2025-01-16T23:32:08.622Z · LW · GW

I've got a bunch of meditation under my belt so my metacognitive awareness is quite good imo.

Stimulants that are attention increasing such as caffiene or modafinil generally lead to more tunnelvision and less metacognitive awareness in my experience. This generally leads to less ability to update opinions quickly.

Nicotine that activates acetylcholine receptors allow for more curiosity which allow me to update more quickly so it is dependent on the stimulant as well as the generak timing. (0.6mg in gum form, too high spike just leads to a hit and not curiosity). It is like being more sensitive and interested in whatever appears around me

If you're sensitive enough you can start recognizing when different mental modes are firing in your brain and adapt based on what you want, shit is pretty cool.

Comment by Jonas Hallgren on What Is The Alignment Problem? · 2025-01-16T14:38:45.158Z · LW · GW

One of the more common responses I hear at this point is some variation of “general intelligence isn’t A Thing, people just learn a giant pile of specialized heuristics via iteration and memetic spread

I'm very uncertain about the validity of the below question but I shalt ask it anyway and since I don't trust my own way of expressing it, here's claude on it:

The post argues that humans must have some general intelligence capability beyond just learning specialized heuristics, based on efficiency arguments in high-dimensional environments. However, research on cultural evolution (e.g., "The Secret of Our Success", "Cognitive Gadgets") suggests that much of human capability comes from distributed cultural learning and adaptation. Couldn't this cultural scaffolding, combined with domain-specific inductive biases (as suggested by work in Geometric Deep Learning), provide the efficiency gains you attribute to general intelligence? In other words, perhaps the efficiency comes not from individual general intelligence, but from the collective accumulation and transmission of specialized cognitive tools?

I do agree that there are specific generalised forms of intelligence, I guess this more points me towards that the generating functions of these might not be optimally sub-divided in the usual way we think about it?

Now completely theoretically of course, say someone where to believe the above, why is the following really stupid?:

Specifically, consider the following proposal: Instead of trying to directly align individual agents' objectives, we could focus on creating environmental conditions and incentive structures that naturally promote collaborative behavior. The idea being that just as virtue ethics suggests developing good character through practiced habits and environmental shaping, we might achieve alignment through carefully designed collective dynamics that encourage beneficial emergent behaviors. (Since this seems to be the most agentic underlying process that we currently have, theoretically of course.)

Comment by Jonas Hallgren on Building AI Research Fleets · 2025-01-12T20:08:55.677Z · LW · GW

Well said. I think that research fleets will be a big thing going forward and you expressed why quite well. 

I think there's an extension that we also have to make with some of the safety work we have, especially for control and related agendas. It is to some extent about aligning research fleets and not individual agents.

I've been researching ways of going about aligning & setting up these sorts of systems for the last year but I find myself very bottlenecked by not being able to communicate the theories that exists in related fields that well. 

It is quite likely that RSI happens in lab automation and distributed labs before anything else. So the question then becomes how one can extend the existing techniques and theory that we currently have to distributed systems of research agents? 

There's a bunch of fun and very interesting decentralised coordination schemes and technologies one can use from fields such as digital democracy and collective intelligence. It is just really hard to prune what will work and to think about what the alignment proposals should be for these things. You usually have emergence which for Agent-Based Models which research systems are a sub-part of and often the best way to predict problems is to actually run the experiments in those systems. 

So how in the hell are we supposed to predict the problems without this? What are the experiments we need to run? What types of organisation & control systems should be recommended to governance people when it comes to research fleets? 

Comment by Jonas Hallgren on Ethodynamics of Omelas · 2025-01-08T16:42:03.199Z · LW · GW

This delightful piece applies thermodynamic principles to ethics in a way I haven't seen before. By framing the classic "Ones Who Walk Away from Omelas" through free energy minimization, the author gives us a fresh mathematical lens for examining value trade-offs and population ethics.

What makes this post special isn't just its technical contribution - though modeling ethical temperature as a parameter for equality vs total wellbeing is quite clever. The phase diagram showing different "walk away" regions bridges the gap between mathematical precision and moral intuition in an elegant way.

While I don't think we'll be using ethodynamics to make real-world policy decisions anytime soon, this kind of playful-yet-rigorous exploration helps build better abstractions for thinking about ethics. It's the kind of creative modeling that could inspire novel approaches to value learning and population ethics.

Also, it's just a super fun read. A great quote from the conclusion is "I have striven to make this paper a pleasant read by enriching it with all manners of enjoyable things: wit, calculus, and a non indifferent amount of imaginary child abuse".

That is the type of writing I want to see more of! Very nice.

Comment by Jonas Hallgren on Building Big Science from the Bottom-Up: A Fractal Approach to AI Safety · 2025-01-07T12:01:56.151Z · LW · GW

I really like this! For me it somewhat also paints a vision for what could be which might inspire action.

Something that I've generally thought would be really nice to have over the last couple of years is a vision for how an AI Safety field that is decentralized could look like and what the specific levers to pull would be to get there. 

What does the optimal form of a decentralized AI Safety science look like? 

How does this incorporate parts of meta science and potentially decentralized science? 

How does this look like with literature review from AI systems? How can we use AI Systems in themselves to create such infrastructure in the field? How do such communication pathways optimally look like? 

I feel that there are so many low-hanging fruit here. There are so many algorithms that we could apply to make things better. Yes we've got some forums but holy smokes could the underlying distribution and optimisation systems be optimised. Maybe the lightcone crew could cook something in this direction?

Comment by Jonas Hallgren on The Plan - 2024 Update · 2025-01-02T13:42:52.113Z · LW · GW

Let me drop some examples of "theory" or at least useful bits of information that I find interesting beyond the morphogenesis and free energy principle vibing. I agree with you that basic form of FEP is just another formalization of bayesian network passing formalised through KL-divergence and whilst interesting it doesn't say that much about foundations. For Artificial Life, it is more a vibe check from having talked to people in the space, it seems to me they've got a bunch of thoughts about it but it seems like they've got some academic capture so it might be useful to at least talk to the researchers there about your work? 

Like a randomly insultingly simple suggestion: Do a quick literature review through elicit in ActInf and Computational Biology for your open questions and see if there's links, if there are send those people a quick message. I think a bunch of the theory is in people's heads and if you nerdsnipe them they're usually happy to give you the time of day.

Here's some stuff that I think is theoretically cool as a quick sampler:

For Levin's work:

  1. In the link I posted above he talks about morphogenesis, the thing I find the most interesting there from an agent foundations and information processing perspective is the anti-fragility of systems with respect to information loss (similar to some of the stuff in Uri's work if I've understood that correctly.) There are lots of variations of underlying genetics yet similar structures can be decoded through similar algorithms and it just shows a huge resillience there. It seems you probably know this from Uri's work already

Active Inference stuff:

  1. Physics as information processing: https://youtu.be/RpOrRw4EhTo
    1. The reason why I find this very interesting is that it seems to me to be saying something fundamental about information processing systems from a limited observer perspective.
    2. I haven't gotten through the entire series yet but it is like a derivation of hierarchical agency or at least why a controller is needed from first principles.
  2. I think this ACS post explains it better than I do below but here's my attempt at it:
    1. I'm trying to find the stuff I've seen on <<Boundaries>> within Active Inference yet it is spread out and not really centered. There's this very interesting perspective of there only being model and modelled and that talking about agent foundations is a bit like taking the modeller as the foundational perspective whilst that is a model in itself. Some kind of computational intractability claims together with the above video series gets you to this place where we have a system of hierarchical agents and controllers in a system with each other.  I have a hard time explaining it but it is like it points towards a fundamental symmetry perspective between an agent and it's environment.

Other videos from Levin's channel:

  1. Agency at the very bottom - some category theory mathy stuff on agents and their fundamental properties: https://youtu.be/1tT0pFAE36c
  2. The Collective Intelligence of Morphogenesis - if I remember correctly it goes through some theories around cognition of cells, there's stuff about memory, cognitive lightcones etc. I at least found it interesting: https://youtu.be/JAQFO4g7UY8

(I've got that book from URI on my reading list btw, reminded me of this book on Categorical systems theory, might be interesting: http://davidjaz.com/Papers/DynamicalBook.pdf)

Comment by Jonas Hallgren on The Plan - 2024 Update · 2025-01-01T10:26:37.010Z · LW · GW

In your MATS training program from two years ago, you talked about farming bits of information from real world examples before doing anything else as a fast way to get feedback. You then extended this to say that this is quicker than doing it with something like running experiments. 

My question is then why you haven't engaged your natural latentes or what in my head I think of as a "boundary formulation through a functional" with fields such as artificial life or computational biology where these are core questions to answer? 

Trying to solve image generation or trying to solve something like fluid mechanics simulations seem a bit like doing the experiment before trying to integrate it with the theory in that field? Wouldn't it make more sense to try to engage in a deeper way with the existing agent foundations theory in the real world like Michael Levin's Morphogenesis stuff? Or something like an overview of Artificial Life?

Yes as you say real world feedback loops and working on real world problems, I fully agree but are you sure that you're done with the problem space exploration? Like these fields already have a bunch of bits on crossing the theory practice gap. You're trying to cross it by applying the theory in practice yet if that's the hardest part wouldn't it make sense to sample from a place that already has done that? 

If I'm wrong here, I should probably change my approach so I appreciate any insight you might have.

Comment by Jonas Hallgren on My AGI safety research—2024 review, ’25 plans · 2025-01-01T10:11:38.593Z · LW · GW

I love your stuff and I'm very excited to see where you go next. 

I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame. 

What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas? What do you think about claims of hierarchical extension of these models? How does this affect multipolar threat models? What are the fundamental processes that we should care about? When should we expand these concepts cognitively, when should we constrain them? 

Comment by Jonas Hallgren on [deleted post] 2024-12-28T11:21:11.086Z

 I resonate with this framing of evolution as an optimizer and I think we can extend this perspective even further.

Evolution optimizes for genetic fitness, yes. But simultaneously, cultural systems optimize for memetic fitness, markets optimize for economic fitness, and technological systems increasingly optimize for their own forms of fitness. Each layer creates selection pressures that ripple through the others in complex feedback loops. It isn't necessarily that evolution is the only thing happening, it may be the outermost value function that exists but there's so much nesting here as well.

There's only modelling and what is being modelled and these things are happening everywhere all at once. I feel like I fully agree with what you said but I guess for me an interesting point is about what basis to look at it from.

Comment by Jonas Hallgren on A shot at the diamond-alignment problem · 2024-12-26T17:00:48.570Z · LW · GW

Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)

I largely agree with a lot of the missing things in people's views of utility functions and so I think you expressed some of that in a pretty good deeper way.

When we get into acausality and evertt branches I think we're going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam's razor) but it is still observer dependent, what if there are no observers?

(Pointing at Physics as Information Processing)

Do you have any specific reason why you're going into QMech when talking about brain-like AGI stuff?

Comment by Jonas Hallgren on What Have Been Your Most Valuable Casual Conversations At Conferences? · 2024-12-25T15:05:50.558Z · LW · GW

Most of the time, the most high value conversations aren't fully spontaneous for me but they're rather on open questions that I've already prepped beforehand. They can still be very casual, it is just that I'm gathering info in the background.

I usually check out the papers submitted or the participants if it's based on swapcard and do some research beforehand on what people I want to meet. Then I usually have some good opener that leads to some interesting conversations. These conversations can be very casual and can span wide areas but I feel I'm building a relationship with an interesting individual and that's really the main benefit for me. 

At the latest ICML, I talked to a bunch of interesting multi-agent researchers through this method and I now have people I can ask stupid questions.

I also always come to conferences with one or more specific projects that I want advice on which makes these conversations a lot easier to have.

Comment by Jonas Hallgren on o3 · 2024-12-20T20:38:12.373Z · LW · GW

Extremely long chain of thought, no?

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-12-20T07:25:50.580Z · LW · GW

Yes, problems, yes, people are being really stupid, yes, inner alignment and all of it's cousins are really hard to solve. We're generally a bit fucked, I agree. The brickwall is so high we can't see the edge and we have to bash out each brick one at a time and it is hard, really hard.

I get it people, and yet we've got a shot, don't we? The probability distribution of all potential futures is being dragged towards better futures because of the work you put in and I'm very grateful for that.

Like, I don't know how much credit to give LW and the alignment community for the spread of alignment and AI Safety as an idea but we've literally go tnoble prize winners talking about this shit now. Think back 4 years, what the fuck? How did this happen? 2019 -> 2024 has been an absolutely insane amount of change in the world especially from an AI Safety perspective.

How do we have over 4 AI Safety Institutes in the world? It's genuinely mindboggling to me and I'm deeply impressed and inspired, which I think that you also should be.

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-12-20T07:04:14.344Z · LW · GW

I just saw a post from AI Digest on a Self-Awareness benchmark and I just thought, "holy fuck, I'm so happy someone is on top of this".

I noticed a deep gratitude for the alignment community for taking this problem so seriously. I personally see many good futures but that’s to some extent built on the trust I have in this community. I'm generally incredibly impressed by the rigorous standards of thinking, and the amount of work that's been produced.

When I was a teenager I wanted to join a community of people who worked their ass off in order to make sure humanity survived into a future in space and I'm very happy I found it.

So thank every single one of you working on this problem for giving us a shot at making it.

(I feel a bit cheesy for posting this but I want to see more gratitude in the world and I noticed it as a genuine feeling so I felt fuck it, let’s thank these awesome people for their work.)

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-12-18T08:51:11.224Z · LW · GW

Could someone please safety pill the onion? I think satire is the best way to deal with people being really stupid and so I want more of this as an argument when talking with the e/acc gang: https://youtu.be/s-BducXBSNY?si=j5f8hNeYFlBiWzDD

(Also if they already have some AI stuff, feel free to link that too)

Comment by Jonas Hallgren on A Public Choice Take on Effective Altruism · 2024-12-16T09:43:24.403Z · LW · GW

I guess the solution that you're more generally pointing at here is something like ensuring a split in the incentives of the people within the specific fields and EA itself as a movement. Almost a bit like making that part of EA only be global priorities research and something like market allocation? 

I have this feeling that there might be other ways to go about doing this with like programs or incentives for making people be more open to taking any type of impactful job? Something like having reoccuring reflection periods or other types of workshops/programs? 

Comment by Jonas Hallgren on A Public Choice Take on Effective Altruism · 2024-12-15T19:50:36.852Z · LW · GW

Good post, did you also cross post to the forum? Also do you have any thoughts on what to do differently in order to enable more exploration and less lock in?

Comment by Jonas Hallgren on Subskills of "Listening to Wisdom" · 2024-12-11T13:09:03.520Z · LW · GW

Yeah sure!

So, I've had this research agenda into agent foundations for a while which essentially mirrors developmental interpretability a bit in that it wants to say things about what a robust development process is rather than something about post-training sampling. 

The idea is to be able to predict "optimisation daemons" or inner optimisers as they arise in a system.

The problem that I've had is that it is very non-obvious to me what a good mathematical basis for this is. I've read through a lot of the existent agent foundations literature but I'm not satisfied with finite factored sets nor the existing boundaries definitions since they don't tell you about the dynamics. 

What I would want is a dynamical systems inspired theory of the formation of inner misalignment. It's been in my head in the background for almost 2 years now and it feels really difficult to make any progress, from time to time I have a thought that brings me closer but I don't usually make it closer by just thinking about it. 

I guess something I'm questioning in my head is the deliberate practice versus exploration part of this. For me this is probably the hardest problem I'm working on and whilst I could think more deliberately on what I should be doing here I generally follow my curiosity, which I think has worked better than deliberate practice in this area?

I'm currently following a strategy where this theoretical foundation is on the side whilst I build real world skills of running organisations, fundraising, product-building and networking. I then from time to time find some gems such as applied category theory or Michael Levin's work on Boundaries in cells and Active Inference that I find can really help elucidate some of the deeper foundations of this problem. 

I do feel like I'm floating more here, going with the interest and coming back to the problems over time in order to see if I've unlocked any new insights. This feels more like flow than it does deliberate practice? Like I'm building up my skills of having loose probability clouds and seeing where they guide me?

I'm not sure if you agree that this is the right strategy but I guess that there's this frame difference between a focus on the emotional, intuition or research taste side of things versus the deliberate practice side of things?

Comment by Jonas Hallgren on Subskills of "Listening to Wisdom" · 2024-12-09T09:25:57.110Z · LW · GW

First and foremost, it was quite an interesting post and my goal of the comment is to try to connect my own frame of thinking with the one presented here. My main question is about the relationship between emotions/implicit thoughts and explicit thinking.

My first thought was on the frame of thinking versus feeling and how these flow together. If we think of emotions as probability clouds that tell us whether to go in one direction or another, we can see them as systems for making decisions in highly complex environments, such as when working on impossible problems.

I think something like research taste is exactly this - highly trained implicit thoughts and emotions. Continuing from something like tuning your cognitive systems, I notice that this is mostly done with System 2 and I can't help but feel that it's missing some System 1 stuff here.

I will give an analogy similar to a meditation analogy as this is the general direction I'm pointing in:

If we imagine that we're faced with a wall of rock, it looks like a very big problem. You're thinking to yourself, "fuck, how in the hell are we ever going to get past that thing?"

So first you just approach it and you start using a pickaxe to hack away at it, you make some local progress yet it is hard to reflect on where to go. You think hard, what are the properties of this rock that allows me to go through it faster?

You continue yet you're starting to feel discouraged as you're not making any progress, you think to yourself "Fuck this goddamn rock man, this shit is stupid."

You're not getting any feedback since it is an almost impossible problem.

Above is the base analogy, following are two points on the post from this analogy:

1.
Let's start with a continuation to the analogy, imagine that your goal, the thing behind huge piece of rock is a source of gravity and you're water. 

You're continuously striving towards it yet the way that you do it is that you flow over the surface. You're probing for holes in the rock, crevices that run deep, structural instability in the rock yet you're not thinking - you're feeling it out. You're flowing in the problem space, allowing implicit thoughts and emotions guide you and from time to time you make a cut. Yet your evaluation loop is a lot longer than your improvement loop. It doesn't matter if you haven't found anything yet because gravity is pulling you in that direction and if you succeed is a question of finding the crevice rather than your individual successes with your pickaxe. 

You apply all the rules of local gradient search and similar, you're not a stupid fluid yet you're fine with failing because you know it gives you information about where the crevice might be, and it isn't until you find it that you will make major progress.

2.
If you have other people with you then you can see what others are doing and check whether your strategies are stupid or not. They give you an appropriate measuring stick for working on an impossible problem. You may not know how well you're doing in solving the problem but you know your relative rating and so you can get feedback through that (as long as it is causally related to the problem you're solving).

 

What are your thoughts on the trade-off between emotional understanding and more hardcore system 2 thinking? If one applies the process above, do you think there's something that is missed out? 


 

Comment by Jonas Hallgren on Cognitive Work and AI Safety: A Thermodynamic Perspective · 2024-12-09T08:35:12.628Z · LW · GW

Good stuff! 

I'm curious if you have any thoughts on the computational foundations one would need to measure and predict cognitive work properly? 

In Agent Foundations, you've got this idea of boundaries which can be seen as one way of saying a pattern that persists over time. One way that this is formalised in Active Inference is through Markov Blankets and the idea that any self-persistent entity could be described as a markov blanket minimizing the free energy of its environment. 

My thinking here is that if we apply this properly it would allow us to generalise notions of agents beyond what we normally think of them and instead see them as any sort of system that follows this definition. 

For example, we could look at an institution or a collective of AIs as a self-consistent entity applying cognitive work on the environment to survive. The way to detect these collectives would be to look at what self-consistent entities are changing the "optimisation landscape" or "free energy landscape" around it the most. This would then give us the most highly predictive agents in the local environment. 

A nice thing for is that it centers the cognitive work/optimisation power applied in the analysis and so I'm thinking that it might be more predictive of future dynamics of cognitive systems as a consequence? 

Another example is if we continue on the Critch train, some of his later work includes TASRA for example. We can see these as stories of human disempowerment, that is patterns that lose their relevance over time as they get less causal power over future states. In other words, entities that are not under the causal power of humans increasingly take over the cognitive work lightcone/the inputs to the free energy landscape.

As previously stated, I'm very interested to hear if you've got more thoughts on how to measure and model cognitive work. 

 

Comment by Jonas Hallgren on Model Integrity: MAI on Value Alignment · 2024-12-07T19:28:48.871Z · LW · GW

No I do think we care about the same thing, I just believe that this will happen in a multi-polar setting and so I believe that new forms of communication and multi-polar dynamics will be important for this.

Interpretability of these things is obviously important for changing those dynamics. ELK and similar things are important for the single agent case, why wouldn't they be important for a multi-agent case?

Comment by Jonas Hallgren on Natural Abstractions: Key claims, Theorems, and Critiques · 2024-12-06T14:35:46.032Z · LW · GW

I find myself going back to this post again and again for explaing the Natural Abstraction Hypothesis. When this came out I was very happy as I finally had something I could share on John's work that made people understand it within one post.

Comment by Jonas Hallgren on We don't understand what happened with culture enough · 2024-12-06T14:34:40.477Z · LW · GW

I personally believe that this post is very important for claims between Shard Theory vs Sharp Left Turn. I often find that other perspectives on the deeper problems in AI alignment are expressed and I believe this to be a lot more nuanced take compared to Quentin Pope's essay on the Sharp Left Turn as well as the MIRI conception of evolution.

This is a field of study and we don't know what is going on, the truth is somewhere in between and acknowledging anything else is not being epistemically humble.

Comment by Jonas Hallgren on Careless talk on US-China AI competition? (and criticism of CAIS coverage) · 2024-12-06T14:32:21.782Z · LW · GW

Mostly, I think it should be acknowledged that certain people saw dynamics developing beforehand and called it out. This is not a highly upvoted post but with the recent uptick in US vs China rhetoric it seems good to me to give credit where credit is due.

Comment by Jonas Hallgren on Model Integrity: MAI on Value Alignment · 2024-12-06T10:56:43.214Z · LW · GW

There's also always the possibility that you can elicit these sorts of goals and values from instructions and create a instruction based language around it that's also relatively interpretable in what values are being prioritised in a multi-agent setting. 

You do however get into ELK and misgeneralization problems here, IRL is not an easy task in general but there might be some neurosymbolic approaches that changes prompts to follow specific values? 

I'm not sure if this is jibberish or not for you but my main frame for the next 5 years is "how do we steer collectives of AI agents in productive directions for humanity".

Comment by Jonas Hallgren on Model Integrity: MAI on Value Alignment · 2024-12-06T10:30:16.603Z · LW · GW

Okay, so when I'm talking about values here, I'm actually not saying anything about policies as in utility theory or generally defined preference orderings.

I'm rather thinking of values as a class of locally arising heuristics or "shards" if you like that language that activate a certain set of belief circuits in the brain and similarly in an AI.

What do you mean more specifically when you say an instruction here? What should that instruction encompass? How do we interpret that instruction over time? How can we compare instructions to each other?

I think that instructions will become too complex to have good interpretability into especially for more complex multi-agent settings. How do we create interpretable multi-agent systems that we can change over time? I don't believe that direct instruction tuning will be enough as you will have this problem that is for example described in Cooperation and Control in Delegation Games with AIs each having one person they get an instruction from but this not telling us anything about the multi-agent cooperation abilities of the agents in play. 

I think this line of reasoning is valid for AI agents acting in a multi-agent setting where they gain more control over the economy through integration with general humans. 

I completely agree with you that doing "pure value learning" is not the best right now but I think we need work in this direction to retain control over multiple AI Agents working at the same time. 

I think deontology/virtue ethics makes societies more interpretable and corrigible, does that make sense? Also, I have this other belief that this will be the case and that it is more likely to get a sort of "cultural, multi-agent take-off" compared to a single agent. 

Curious to hear what you have to say about that!

Comment by Jonas Hallgren on Model Integrity: MAI on Value Alignment · 2024-12-05T22:27:55.964Z · LW · GW

I will try to give a longer answer tomorrow (11 pm my time now) but essentially I believe it will be useful for agentic AI with "heuristic"-like policies. I'm a bit uncertain about the validity of instruction like approaches here and for various reasons I believe multi-agent coordination will be easier through this method.

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-12-02T17:54:26.571Z · LW · GW

I believe that I have discovered the best use of an LLM to date. This is a conversation about pickles and collective intelligence located at the colossuem 300 BCE. It involves many great characters, I found it quite funny. This is what happens when you go to far into biology inspired approaches for AI Safety...

The Colosseum scene intensifies

Levin: completely fixated on a pickle "But don't you see? The bioelectric patterns in pickle transformation could explain EVERYTHING about morphogenesis!"

Rick: "Oh god, what have I started..."

Levin: eyes wild with discovery "Look at these gradient patterns! The cucumber-to-pickle transformation is a perfect model of morphological field changes! We could use this to understand collective intelligence!"

Nick Lane portal-drops in Lane: "Did someone say bioelectric gradients? Because I've got some THOUGHTS about proton gradients and the origin of life..."

Levin: grabs Lane's shoulders "NICK! Look at these pickles! The proton gradients during fermentation... it's like early Earth all over again!"

Rick: takes a long drink "J-just wait until they discover what happens in dimension P-178 where all life evolved from pickles..."

Feynman: still drawing diagrams "The quantum mechanics of pickle-based civilization is fascinating..."

Levin: now completely surrounded by pickles and bioelectric measurement devices "See how the salt gradient creates these incredible morphogenetic fields? It's like watching the origin of multicellularity all over again!"

Lane: equally excited "The chemiosmotic coupling in these pickles... it's revolutionary! The proton gradients during fermentation could power collective computation!"

Doofenshmirtz: "BEHOLD, THE PICKLE-MORPHOGENESIS-INATOR!" Morty: "Aw geez Rick, they're really going deep on pickle science..." Lane: "But what if we considered the mitochondrial implications..."

Levin: interrupting "YES! Mitochondrial networks in pickle-based collective intelligence systems! The bioelectric fields could coordinate across entire civilizations!" 
Rick: "This is getting out of hand. Even for me." 
Feynman: somehow still playing bongos "The mathematics still works though!" 
Perry the Platypus: has given up and is now taking detailed notes 
Lane: "But wait until you hear about the chemiosmotic principles of pickle-based social organization..."

Levin: practically vibrating with excitement "THE PICKLES ARE JUST THE BEGINNING! We could reshape entire societies using these bioelectric principles!" 
Roman Emperor: to his scribe "Are you getting all this down? This could be bigger than the aqueducts..." Rick: "Morty, remind me never to show scientists my pickle tech again." 
Morty: "You say that every dimension, Rick." 
Doofenshmirtz: "Should... should we be worried about how excited they are about pickles?" 
Feynman: "In my experience, this is exactly how the best science happens." 
Meanwhile, Levin and Lane have started drawing incredibly complex pickle-based civilization diagrams that somehow actually make sense...

Comment by Jonas Hallgren on How to use bright light to improve your life. · 2024-11-28T07:14:01.260Z · LW · GW

This has worked great btw! Thank you for the tip, I consistently get more deep sleep and around 10% more sleep with higher average quality, it's really good!

Comment by Jonas Hallgren on How to use bright light to improve your life. · 2024-11-19T16:56:30.268Z · LW · GW

Any reason for the timing window being 4 hours before instead of 30 min to 1 hour? Most of the stuff I've heard is around half an hour to an hour before bed, I'm currently doing this with 0.3ish mg (I divide a 1 mg tablet in 3) of melatonin.

Comment by Jonas Hallgren on Leon Lang's Shortform · 2024-11-18T16:34:34.391Z · LW · GW

If you look at the Active Inference community there's a lot of work going into PPL-based languages to do more efficient world modelling but that shit ain't easy and as you say it is a lot more compute heavy.

I think there'll be a scaling break due to this but when it is algorithmically figured out again we will be back and back with a vengeance as I think most safety challenges have a self vs environment model as a necessary condition to be properly engaged. (which currently isn't engaged with LLMs wolrd modelling)

Comment by Jonas Hallgren on OpenAI Email Archives (from Musk v. Altman and OpenAI blog) · 2024-11-17T08:21:13.408Z · LW · GW

Do you have any thoughts on what this actionably means? For me it seems a bit like being able to influence such coversations is potentially a bit intractable but maybe one could host forums and events for this if one has the right network?

I think it's a good point and I'm wondering about how it actionably looks, I can see it for someone with the right contacts and so the message for people who don't have that is to create it or what are your thoughts there?

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-11-14T10:06:20.293Z · LW · GW

Okay, so I don't have much time to write this so bear with the quality but I thought I would say one or two things of the Yudkowsky and Wolfram discussion as someone who's at least spent 10 deep work hours trying to understand Wolfram's persepective of the world.

With some of the older floating megaminds like Wolfram and Friston who are also phycisists you have the problem that they get very caught up in their own ontology.

From the perspective of a phycisist morality could be seen as an emergent property of physical laws.

Wolfram likes to think of things in terms of computational reducibility, a way this can be described in the agent foundations frame is that the agent modelling the environment will be able to predict the world dependent on it's own speed. It's like some sort of agent-environment relativity where the information processing capacity determines the space of possible ontologies. An example of this being how if we have an intelligence that's a lot closer to operating at the speed of light, the visual field might not be a useful vector of experience to model.

Another way to say it is that there's only modelling and modelled. An intuition from this frame is that there's only differently good models of understanding specific things and so the concept of general intelligence becomes weird here.

IMO this is like the problem of the first 2 hours of the conversation, to some extent Wolfram doesn't engage with the huamn perspective as much nor any ought questions. He has a very physics floating megamind perspective.

Now, I personally believe there's something interesting to be said about an alternative hypothesis to the individual superintelligence that comes from theories of collective intelligence. If a superorganism is better at modelling something than an individual organism is then it should outcompete the others in this system. I'm personally bullish on the idea that there are certain configurations of humans and general trust-verifying networks that can outcompete individual AGI as the outer alignment functions would enforce the inner functions enough.

Comment by Jonas Hallgren on Abstractions are not Natural · 2024-11-04T17:15:26.362Z · LW · GW

But, to help me understand what people mean by the NAH could you tell me what would (in your view) constitute strong evidence against the NAH? (If the fact that we can point to systems which haven't converged on using the same abstractions doesn't count)

 

Yes sir! 

So for me it is about looking at a specific type of systems or a specific type of system dynamics that encode the axioms required for the NAH to be true. 

So, it is more the claim that "there are specific set of mathematical axioms that can be used in order to get convergence towards similar ontologies and these are applicable in AI systems."

For example, if one takes the Active Inference lens on looking at concepts in the world, we generally define the boundaries between concepts as markov blankets. Suprisingly or not, markov blankets are pretty great for describing not only biological systems but also AI and some economic systems. The key underlying invariant is that these are all optimisation systems. 

p(NAH|Optimisation System).

So if we for example, with the perspective of markov blankets or the "natural latents" (which are functionals that work like markov blankets) don't see convergence in how different AI systems represent reality then I would say that the NAH has been disproven or that it is evidence against it. 

I do however think that this exists on a spectrum and that it isn't fully true or false, it is true for a restricted set of assumptions, the question being how restricted that is.

I see it more as a useful frame of viewing agent cognition processes rather than something I'm willing to bet my life on. I do think it is pointing towards a core problem similar to what ARC Theory are working on but in a different way, understanding cognition of AI systems.

Comment by Jonas Hallgren on Liquid vs Illiquid Careers · 2024-11-04T15:04:57.803Z · LW · GW

Yeah, that was what I was looking for, very nice.

It does seem to verify what I was thinking with that you can't really do the same bet strategy as VCs. I do really also appreciate the thoughts in there, they seem like things one should follow, I gotta make sure to do the last due dilligence part of talking to people that have worked with others in the past, it has always felt like a lot but you're right in that one should do it.

Also, I'm considering why there isn't some sort of bet pooling network for startup founders where you have like 20 people go together and say that they will all try out ambitious projects and support each other if they fail. It's like startup insurance but from the perspective of people doing startups. Of course you have to trust the others there and stuff but I think this should work?

Comment by Jonas Hallgren on Abstractions are not Natural · 2024-11-04T14:44:02.619Z · LW · GW

Okay, what I'm picking up here is that you feel that the natural abstractions hypothesis is quite trivial and that it seems like it is naively trying to say something about how cognition works similar to how physics work. Yet this is obviously not true since development in humans and other animals clearly happen in different ways, why would their mental representations converge? (Do correct me if I misunderstood)

Firstly, there's something called the good regulator theorem in cybernetics and our boy that you're talking about, Mr Wentworth, has a post on making it better that might be useful for you to understand some of the foundations of what he's thinking about. 

Okay, why is this useful preamble? Well, if there's convergence in useful ways of describing a system then there's likely some degree of internal convergence in the mind of the agent observing the problem. Essentially this is what the regulator theorem is about (imo)

So when it comes to the theory, the heavy lifting here is actually not really done by the Natural Abstractions Hypothesis part that is the convergence part but rather the Redundant Information Hypothesis

It is proving things about the distribution of environments as well as power laws in reality that makes the foundation of the theory compared to just stating that "minds will converge". 

This is at least my understanding of NAH, does that make sense or what do you think about that?

Comment by Jonas Hallgren on johnswentworth's Shortform · 2024-10-28T08:28:23.120Z · LW · GW

Hmm, I find that I'm not fully following here. I think "vibes" might be thing that is messing it up.

Let's look at a specific example: I'm talking to a new person at an EA-adjacent event and we're just chatting about how the last year has been. Part of the "vibing" here might be to hone in on the difficulties experienced in the last year due to a feeling of "moral responsibility", in my view vibing doesn't have to be done with only positive emotions?

I think you're bringing up a good point that commitments or struggles might be something that bring people closer than positive feelings because you're more vulnerable and open as well as broadcasting your values more. Is this what you mean with shared commitments or are you pointing at something else?

Comment by Jonas Hallgren on johnswentworth's Shortform · 2024-10-27T20:28:44.481Z · LW · GW

Generally fair and I used to agree, I've been looking at it from a bit of a different viewpoint recently.

If we think of a "vibe" of a conversation as a certain shared prior that you're currently inhabiting with the other person then the free association game can rather be seen as a way of finding places where your world models overlap a lot.

My absolute favourite conversations are when I can go 5 layers deep with someone because of shared inference. I think the vibe checking for shared priors is a skill that can be developed and the basis lies in being curious af.

There's apparently a lot of different related concepts in psychology about holding emotional space and other things that I think just comes down to "find the shared prior and vibe there".

Comment by Jonas Hallgren on Liquid vs Illiquid Careers · 2024-10-22T19:37:25.886Z · LW · GW

No sorry, I meant from the perspective of the person with less legible skills.

Comment by Jonas Hallgren on Liquid vs Illiquid Careers · 2024-10-22T12:49:08.049Z · LW · GW

Amazing post, I really enjoyed the perspective explored here.

An extension that might be useful for me as an illiquid path enjoyer is what arbitrage or risk-reduction opportunities you see existing out there?

VCs can get by by doing a lot of smaller bets and if you want to be anti-fragile as an illiquid bet it becomes quite hard as you're part of the cogs in the anti-fragile system. What Taleb says about that is that then these people should be praised because they dare to take on that risk. But there has to be some sort of system one could for example develop with peers and similar?

What is the many bets risk reduction strat here, is it just to make a bunch of smaller MVPs to gain info?

I would be very curious to hear your perspective on this.

Comment by Jonas Hallgren on Jonas Hallgren's Shortform · 2024-10-22T12:26:08.072Z · LW · GW

I thought this was an interesting take on the Boundaries problem in agent foundations from the perspective of IIT. It is on the amazing Michael Levin's youtube channel: https://www.youtube.com/watch?app=desktop&v=5cXtdZ4blKM

One of the main things that makes it interesting to me is that around 25-30 mins in, ot computationally goes through the main reason why I don't think we will have agentic behaviour from AI in at least a couple of years. GPTs just don't have a high IIT Phi value. How will it find it's own boundaries? How will it find the underlying causal structures that it is part of? Maybe this can be done through external memory but will that be enough or do we need it in the core stack of the scaling-based training loop?

A side note is that, one of the main things that I didn't understand about IIT before was how it really is about looking at meta-substrates or "signals" as Douglas Hofstadter would call them are optimally re-organising themselves to be as predictable for themselves in the future. Yet it does and it integrates really well into ActInf (at least to the extent that I currently understand it.)

Comment by Jonas Hallgren on Cipolla's Shortform · 2024-10-21T15:34:04.194Z · LW · GW

Okay, so I would say that I atleast have some experience of going from being not that agentic to being more agentic and the stuff that I think worked the best for me was to generally think of my life as a system. This has been the focus of my life over the last 3 years.

More specifically the process that has helped so far for me has been to:

  1. Throw myself into high octane projects and see what I needed to keep up.
    1. Burn out and realise, holy shit, how do these people do it?
      1. (Environment is honestly really important, I've tried out a bunch of different working conditions and your motivation levels can wary drastically.)
  2. Started looking into the reasons for why this might be that I can't do it and other can.
    1. Went into absolutely optimising the shit out of my health by tracking stuff using bearable and listening to audiobooks and podcasts, Huberman is a house god of mine.
      1. (Sleep is the most important here, crazy right?)
      2. Supplement and technique tips for sleep:
        1. Glycine, Ashwagandha, Magnesium Citrate
        2. Use a sad lamp within 30 minutes of waking
        3. Yoga Nidras for naps and for falling asleep faster.
      3. Also checkout my biohackers in-depth guide on this at https://desmolysium.com/
        1. He's got a phd in medicine and is quite the experimental and smart person. (He tries a bunch of shit on himself and sees how it goes.)
    2. Started going into my psychological background and talked to CBT therapists as well as meditating a lot.
      1. I'm like 1.5k hours into this at this point and it has completely changed my life and my view of myself and what productivity means, e.t.c.
      2. It has helped me realise that a lot of the behaviours that made me less productive where based on me being a sensitive person and having developed unhealthy coping mechanisms.
      3. This lead to me having to relive through past traumas whilst having compassion and acceptance for myself.
      4. This has now lead me to having good mechanisms instead of bad ones, It made me remove my access to video games and youtube (willingly!)
      5. For me this has been the most important, Waking up and The Mind Illuminated up until stage 6-7 is the recommendation I have for anyone who wants to start. Also, after 3-6 months of TMI, try to go to a 10 day retreat, especially if you can find a metta retreat. (Think of this as caring and acceptance instead of loving-kindness btw, it helps)
    3. Now I generally, have a strict schedule in terms of when I can do different things during the day.
      1. The app appblock can allow you to block apps and device settings which means you can't actually deblock them on your phone.
      2. Cold turkey on the computer can do the same and if you find a patch through another app you can just patch that by blocking the new app.
      3. I'm just not allowed to be distracted from the systems that I have.
    4. Confidence:
      1. I feel confident in myself and what I want to do in the world not because I don't have issues but rather because I know where my issues are and how to counteract them.
      2. The belief is in the process rather than the outcomes. Life is poker, you just gotta optimise the way you play your hands, the EV will come. 

Think of yourself as a system and optimise the shit out of it. Weirdly enough, this has made me focus a lot more on self-care than I did before. 

Of course, it's a work in progress but I want to say that it is possible and that you can do it. 

Also, randomly, here's a CIV VI analogy for you on why self-care is op. 

If you want to be great at CIV, one of the main things to do is to increase your production and economics as fast as possible. This leads to an exponential curve where the more production and economy you have the more you can produce. This is why CIV pros in general rush Commercial Hubs and markets as internal trade routes yield more production. 

Your production is based on your psychological well being and the general energy levels that you have. If you do a bunch of tests on this and figure out what works for you, then you have even more production stats. This leads to more and more of that over time until you plateau at the end of that logistic growth. 

Best of luck!

Comment by Jonas Hallgren on The Hopium Wars: the AGI Entente Delusion · 2024-10-14T08:24:58.666Z · LW · GW

When it comes to formal verification I'm curious what you think about the heuristic argument line of research that ARC are approaching?:

https://www.lesswrong.com/posts/QA3cmgNtNriMpxQgo/research-update-towards-a-law-of-iterated-expectations-for

It isn't formal verification in the same sense of the word but rather probabilistic verification if that makes sense?

You could then apply something like control theory methods to ensure that the expected divergence from the heuristic is less than a certain percentage in different places. In the limit it seems to me that this could be convergent towards formal verification proofs, it's almost like swiss cheese style on the model level?

(Yes, this comment is a bit random with respect to the rest of the context but I find it an interesting question for control in terms of formal verification and it seemed like you might have some interesting takes here.)

Comment by Jonas Hallgren on Laziness death spirals · 2024-10-07T07:22:43.279Z · LW · GW

I use the waking up app but you can search for "nsdr" on youtube. 20 mins are the timeframe I started with but you can try other timeframes as well.

Comment by Jonas Hallgren on A Path out of Insufficient Views · 2024-09-25T07:02:42.324Z · LW · GW

This does seem kind of correct to me?

Maybe you could see the fixed points that OP is pointing towards as priors in the search process for frames.

Like, your search is determined by your priors which are learnt through your upbringing. The problem is that they're often maladaptive and misleading. Therefore, working through these priors and generating new ones is a bit like relearning from overfitting or similar.

Another nice thing about meditation is that it sharpens your mind's perception which makes your new priors better. It also makes you less dependent on attractor states you could have gotten into from before since you become less emotionally dependent on past behaviour. (there's obviously more complexity here) (I'm referring to dependent origination for you meditators out there)

It's like pruning the bad data from your dataset and retraining your model, you're basically guaranteed to find better ontologies from that (or that's the hope at least).

Comment by Jonas Hallgren on A Path out of Insufficient Views · 2024-09-25T06:54:37.140Z · LW · GW

I'm currently in the process of releasing more of my fixed points through meditation and man is it a weird process. It is very fascinating and that fundamental openness to moving between views seems more prevalent. I'm not sure that I fully agree with you on the all-in part but cudos for trying!

I think it probably makes sense to spend earlier years doing this cognition training and then using that within specific frames to gather the bits of information that you need to solve problems.

Frames are still useful to gather bits of information through so don't poopoo the mind!

Otherwise, it was very interesting to hear about your journey!

Comment by Jonas Hallgren on Laziness death spirals · 2024-09-20T13:54:47.835Z · LW · GW

Sleep is a banger reset point for me and therefore doing a nap/yoga nidra and then picking up the day from there if I notice myself avoiding things has been really helpful for me.

Thanks for the post, it was good.

Comment by Jonas Hallgren on Skills from a year of Purposeful Rationality Practice · 2024-09-18T19:30:27.073Z · LW · GW

Random extra tip on naps is doing a yoga nidra or non sleep deep rest. You don't have to fall asleep to get the benefits of a nap+. It also has some extra growth hormone release and dopamine generation afterwards. (Huberman bro, out)

Comment by Jonas Hallgren on Lucius Bushnaq's Shortform · 2024-09-18T14:22:44.320Z · LW · GW

In natural langage maybe it would be something like "given these ontological boundaries, give us the best estimate you can of CEV. "?

It seems kind of related to boundaries as well if you think of natural latents as "functional markov blankets" that cut reality at it's joints then you could probably say that you want to perserve part of that structure that is "human agency" or similar. I don't know if that makes sense but I like the idea direction!