Abhimanyu Pallavi Sudhir's Shortform

post by Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-04-27T21:02:30.096Z · LW · GW · 5 comments

5 comments

Comments sorted by top scores.

comment by Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-04-28T22:40:15.646Z · LW(p) · GW(p)

current LLMs vs dangerous AIs

Most current "alignment research" with LLMs seems indistinguishable from "capabilities research". Both are just "getting the AI to be better at what we want it to do", and there isn't really a critical difference between the two.

Alignment in the original sense was defined oppositionally to the AI's own nefarious objectives. Which LLMs don't have, so alignment research with LLMs is probably moot.

something related I wrote in my MATS application:


  1. I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).

  2. These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.

comment by Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-05-01T22:46:43.101Z · LW(p) · GW(p)

quick thoughts on LLM psychology

LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.

Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.

These feelings add up to a “utility function”, something that is only instrumentally useful to the training process. I.e. you can think of a utility function as itself a heuristic taught by the reward function.

LLMs certainly do need cost-benefit signals about features of text. But I think their feelings/utility functions are limited to just that.

E.g. LLMs do not experience the feeling of “mental effort”. They do not find some questions harder than others, because the energy cost of cognition is not a useful signal to them during the training process (I don’t think regularization counts for this either).

LLMs also do not experience “annoyance”. They don’t have the ability to ignore or obliterate a user they’re annoyed with, so annoyance is not a useful signal to them.

Ok, but aren’t LLMs capable of simulating annoyance? E.g. if annoying questions are followed by annoyed responses in the dataset, couldn’t LLMs learn to experience some model of annoyance so as to correctly reproduce the verbal effects of annoyance in its response?

More precisely, if you just gave an LLM the function ignore_user() in its function library, it would run it when “simulating annoyance” even though ignoring the user wasn’t useful during training, because it’s playing the role.

I don’t think this is the same as being annoyed, though. For people, simulating an emotion and feeling it are often similar due to mirror neurons or whatever, but there is no reason to expect this is the case for LLMs.

comment by Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-04-27T21:02:30.232Z · LW(p) · GW(p)

conditionalization is not the probabilistic version of implies

P Q Q| P P → Q
T T T T
T F F F
F T N/A T
F F N/A T

Resolution logic for conditionalization: Q if P or True

Resolution logic for implies: Q if P or None

comment by Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2024-05-09T15:02:09.494Z · LW(p) · GW(p)

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

Replies from: Dagon