Posts

Explaining the AI Alignment Problem to Tibetan Buddhist Monks 2024-03-07T09:00:32.271Z
Anomalous Concept Detection for Detecting Hidden Cognition 2024-03-04T16:52:52.568Z
Hidden Cognition Detection Methods and Benchmarks 2024-02-26T05:31:00.714Z
Notes on Internal Objectives in Toy Models of Agents 2024-02-22T08:02:39.556Z
Internal Target Information for AI Oversight 2023-10-20T14:53:00.284Z
Potential alignment targets for a sovereign superintelligent AI 2023-10-03T15:09:59.529Z
High-level interpretability: detecting an AI's objectives 2023-09-28T19:30:16.753Z
[Linkpost] Frontier AI Taskforce: first progress report 2023-09-07T19:06:26.126Z
Aligned AI via monitoring objectives in AutoGPT-like systems 2023-05-24T15:59:13.621Z
Towards a solution to the alignment problem via objective detection and evaluation 2023-04-12T15:39:31.662Z
Decision Transformer Interpretability 2023-02-06T07:29:01.917Z
Paul Colognese's Shortform 2023-02-02T19:15:11.205Z
Auditing games for high-level interpretability 2022-11-01T10:44:07.630Z
Deception?! I ain’t got time for that! 2022-07-18T00:06:15.274Z

Comments

Comment by Paul Colognese (paul-colognese) on Explaining the AI Alignment Problem to Tibetan Buddhist Monks · 2024-03-08T04:05:00.914Z · LW · GW

Interesting! I'm working on a project exploring something similar but from a different framing. I'll give this view some thought, thanks!

Comment by Paul Colognese (paul-colognese) on Anomalous Concept Detection for Detecting Hidden Cognition · 2024-03-05T06:08:05.288Z · LW · GW

Thanks, should be fixed now.

Comment by Paul Colognese (paul-colognese) on Charbel-Raphaël and Lucius discuss Interpretability · 2023-11-14T13:01:49.838Z · LW · GW

Thanks, that's the kind of answer I was looking for

Comment by Paul Colognese (paul-colognese) on Charbel-Raphaël and Lucius discuss Interpretability · 2023-11-06T14:43:43.159Z · LW · GW

Interesting discussion; thanks for posting!

I'm curious about what elementary units in NNs could be.

the elementary units are not the neurons, but some other thing.

I tend to model NNs as computational graphs where activation spaces/layers are the nodes and weights/tensors are the edges of the graph. Under this framing, my initial intuition is that elementary units are either going to be contained in the activation spaces or the weights.

There does seem to be empirical evidence that features of the dataset are represented as linear directions in activation space.

I'd be interested in any thoughts regarding what other forms elementary units in NNs could take. In particular, I'd be surprised if they aren't represented in subspaces of activation spaces.

Comment by Paul Colognese (paul-colognese) on High-level interpretability: detecting an AI's objectives · 2023-10-05T19:32:07.647Z · LW · GW

Thanks for pointing this out. I'll look into it and modify the post accordingly.

Comment by Paul Colognese (paul-colognese) on High-level interpretability: detecting an AI's objectives · 2023-09-29T11:11:17.946Z · LW · GW

With ideal objective detection methods, the inner alignment problem is solved (or partially solved in the case of non-ideal objective detection methods), and governance would be needed to regulate which objectives are allowed to be instilled in an AI (i.e., government does something like outer alignment regulation).

Ideal objective oversight essentially allows an overseer instill whatever objectives it wants the AI to have. Therefore, if the overseer includes the government, the government can influence whatever target outcomes the AI pursues.

So practically, this means that the governance policies would require the government to have access to the objective detection method results, directly or indirectly through the AI labs. 

Comment by Paul Colognese (paul-colognese) on Aligned AI via monitoring objectives in AutoGPT-like systems · 2023-05-24T22:01:05.691Z · LW · GW

Thanks for the reponse, it's useful to hear that we can to the same conclusions. I quoted your post in the first paragraph. 

Thanks for bringing Fabien's post to my attention! I'll reference it. 

Looking forward to your upcoming post.

Comment by Paul Colognese (paul-colognese) on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-20T06:10:05.383Z · LW · GW

Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

Comment by Paul Colognese (paul-colognese) on Towards a solution to the alignment problem via objective detection and evaluation · 2023-04-16T06:06:16.208Z · LW · GW

Thanks for the feedback! I guess the intention of this post was to lay down the broad framing/motivation for upcoming work that will involve looking at the more concrete details.

I do resonate with the feeling that the post as a whole feels a bit empty as it stands and the effort could have been better spent elsewhere.

Comment by Paul Colognese (paul-colognese) on Paul Colognese's Shortform · 2023-02-02T19:15:11.545Z · LW · GW

My current high-level research direction

It’s been about a year since I became involved in AI Alignment. Here is a super high-level overview of the research direction I intend to pursue over the next six or so months.

  • We’re concerned with building AI systems that produce “bad behavior”, either during training or in deployment.
  • We define “irreversibly bad behavior” to include actions that inhibit an overseer’s ability to monitor and control the system. This includes removing an off-switch and deceptive behavior.
  • To prevent bad behavior from occurring, the overseer needs to make relevant predictive observations of systems in training (and in deployment), that is, observations that allow the overseer to predict the relevant future behavior of the systems produced by the training process.
  • Intuition: Certain types of bad behavior are caused by a system first having an “intention/objective” that points towards a bad outcome. Irreversibly bad behavior, such as deception and removing an off-switch, fits this description.
  • If an overseer can detect such intentions/objectives before they lead to bad behavior, they can shut down systems before this behavior is realized.
  • I want to investigate this conceptually, theoretically, and empirically.
  • One hope is that we can develop interpretability tools that allow an overseer to robustly detect and evaluate the objectives of any system before the objectives can be realized as outcomes.
  • This seems like a big ask, but I believe that making progress in this direction will be valuable nonetheless. 
  • I (and others I’m working with) aim to study toy “optimizing systems” via interpretability.
  • It’s not clear to me at this stage what kind of conceptual/theory work will be helpful other than general de-confusion work.