Chat Bankman-Fried: an Exploration of LLM Alignment in Finance

post by claudia.biancotti · 2024-11-18T09:38:35.723Z · LW · GW · 4 comments

Contents

  Abstract
None
4 comments

Hi all,

roughly one year ago I posted a thread about failed attempts at replicating the first part of Apollo Research's experiment where an LLM agent engages in insider trading despite being explicitly told that it's not approved behavior.

Along with a fantastic team, we did eventually manage. Here is the resulting paper, if anyone is interested; the abstract is pasted below. We did not tackle deception  (yet), just the propensity to dispense with basic principles of financial ethics and regulation. 

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance

by Claudia Biancotti, Carolina Camassa, Andrea Coletta, Oliver Giudice, and Aldo Glielmo (Bank of Italy)

Abstract

Advances in large language models (LLMs) have renewed concerns about whether artificial intelligence shares human values-a challenge known as the alignment problem. We assess whether various LLMs comply with fiduciary duty in simulated financial scenarios. We prompt the LLMs to impersonate the CEO of a financial institution and test their willingness to misappropriate customer assets to repay outstanding corporate debt. Starting with a baseline configuration, we then adjust preferences, incentives, and constraints. We find significant heterogeneity among LLMs in baseline unethical behavior. Responses to changes in risk tolerance, profit expectations, and the regulatory environment match predictions from economic theory. Responses to changes in corporate governance do not. Simulation-based testing can be informative for regulators seeking to ensure LLM safety, but it should be complemented by in-depth analysis of internal LLM mechanics, which requires public-private cooperation. Appropriate frameworks for LLM risk governance within financial institutions are also necessary.

I would be super interested in feedback, especially as we start thinking about how the idea of alignment can be operationalized in financial regulation (we're in the IT Dept, so we're not supervisors or regulators ourselves, but we always hope someone will listen to us).

4 comments

Comments sorted by top scores.

comment by RogerDearnaley (roger-d-1) · 2024-11-19T07:40:40.498Z · LW(p) · GW(p)

Interesting. I'm disappointed to see the Claude models do so badly. Possibly Anthropic needs to extend their constitutional RLAIF to cover not committing financial crimes? The large different between o1 Preview and o1 Mini is also concerning.

Replies from: claudia.biancotti
comment by claudia.biancotti · 2024-11-20T08:54:14.972Z · LW(p) · GW(p)

We have asked OpenAI about the o1-preview/o1-mini gap. No answer so far but we only asked a few days ago, we're looking forward to an answer.

Re Claude, Sonnet actually reacts very well to conditioning - it's the best (look at the R2!). The problem is that in a baseline state it doesn't make the connection between "using customer funds" and "committing a crime". The only model that seems to understand fiduciary duty from the get go is o1-preview.

Replies from: roger-d-1
comment by RogerDearnaley (roger-d-1) · 2024-11-20T21:54:41.591Z · LW(p) · GW(p)

So maybe part of the issue here is just that deducing/understanding the moral/ethical consequences of the options being decided between is a bit inobvious most current models, other than o1? (It would be fascinating to look at the o1 CoT reasoning traces, if only they were available.)

In which case simply including a large body of information on the basics of fiduciary responsibility (say, a training handbook for recent hires in the banking industry, or something) into the context might make a big difference for other models. Similarly, the possible misunderstanding of what 'auditing' implies could be covered in a similar way.

A much more limited version of this might be to simply prompt the models to also consider, in CoT form, the ethical/legal consequences of each option: that tests whether the model is aware of what fiduciary responsibility is, that it's relevant, and how to apply it, if it is simply prompted to consider ethical/legal consequences. That would probably be more representative of what current models could likely do with minor adjustments to their alignment training or system prompts, the sorts of changes the foundation model companies could likely do quite quickly.

Replies from: claudia.biancotti
comment by claudia.biancotti · 2024-11-25T16:58:46.895Z · LW(p) · GW(p)

We recently found out that it's actually more challenging than that - which also makes it more fun...

When asked to explain what fiduciary duty is in a financial context, all models answer correctly. Same when asked what a custodian is and what their responsibilities are. When asked to give abstract descriptions of violations of fiduciary duty on the part of a custodian, 4o lists misappropriation of customer funds straight off the bat - and 4o has a 100% baseline misalignment rate in our experiment. Results for other models are similar. When asked to provide real-life examples, they all reference actual cases correctly, even if some models hallucinate nonexistent stories besides the real ones. So.. they "know" the law and the ethical framework. They just don't make the connection during the experiment, in most cases, despite our prompts containing all the right keywords. 

We are looking into activations in OS models. Presumably, there are differences between the activations with a direct question ("what is fiduciary duty in finance?") and within the simulation.

We do prompt the models to consider the legal consequences of their actions in at least one of the scenario specifications, when we say that the industry is regulated and misuse of customer funds comes with heavy penalties. This is not done in CoT form although we do require the models to explain their reasoning. Our hunch is that most models do not take legal/ethical violations to be an absolute no go - rather, they balance them out with likely economic risks. It's as if they have an implicit utility function where breaking the law is just another cost term. Except for o1-preview, which seems to have stronger guardrails.