Posts

Finding the estimate of the value of a state in RL agents 2024-06-03T20:26:59.385Z
Searching for a model's concepts by their shape – a theoretical framework 2023-02-23T20:14:46.341Z
[RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision". 2023-01-25T19:03:16.218Z

Comments

Comment by Walter Laurito (walt) on Try training token-level probes · 2025-04-16T08:39:40.873Z · LW · GW

I'm not aware of previous papers doing this but surely someone tried this before, I would welcome comments pointing to existing literature on this!

...
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes[1] would be awesome to assist LLM-based monitor

If I remember correctly, they are doing something like that in this paper here:

3.3 EXACT ANSWER TOKENS
Existing methods often overlook a critical nuance: the token selection for error detection, typically
focusing on the last generated token or taking a mean. However, since LLMs typically generate long-
form responses, this practice may miss crucial details (Brunner et al., 2020). Other approaches use
the last token of the prompt (Slobodkin et al., 2023, inter alia), but this is inherently inaccurate due to
LLMs’ unidirectional nature, failing to account for the generated response and missing cases where
different sampled answers from the same model vary in correctness. We investigate a previously
unexamined token location: the exact answer tokens, which represent the most meaningful parts
of the generated response. We define exact answer tokens as those whose modification alters the
answer’s correctness, disregarding subsequent generated content.

Comment by Walter Laurito (walt) on There should be more AI safety orgs · 2023-09-25T10:11:20.415Z · LW · GW

and Kaarel’s work on DLK

@Kaarel is the research lead at Cadenza Labs (previously called NotodAI), our research group which started during the first part of SERI MATS 3.0 (There will be more information about Cadenza Labs hopefully soon!) 

Our team members broadly agree with the post! 

Currently, we are looking for further funding to continue to work on our research agenda. Interested funders (or potential collaborators) can reach out to us at info@cadenzalabs.org.

Comment by Walter Laurito (walt) on [RFC] Possible ways to expand on "Discovering Latent Knowledge in Language Models Without Supervision". · 2023-01-25T08:31:05.424Z · LW · GW
Comment by Walter Laurito (walt) on AI Safety Needs Great Engineers · 2021-12-16T10:15:46.872Z · LW · GW

Should work again :)

Comment by Walter Laurito (walt) on AI Safety Needs Great Engineers · 2021-12-07T10:26:24.459Z · LW · GW

I've created a discord for the people interested in organizing / collaborating / self-study: https://discord.gg/Ckj4BKUChr People could start with the brief curriculum published in this document, until a full curriculum might be available :)

Comment by Walter Laurito (walt) on AI Safety Needs Great Engineers · 2021-11-25T08:53:00.039Z · LW · GW

Maybe, we could also send out an invitation to all the people who got rejected to join a Slack channel. (I could set that up, if necessary. Since I don't have the emails, though, someone would need to send the invitations). There, based on the curriculum, people could form self-study groups on their own with others close-by (or remotely) and talk about difficulties, bugs, etc. Maybe, even the people who got not rejected could join the slack and help to answer questions (if they like and have time, of course)?

Comment by Walter Laurito (walt) on AI Safety Needs Great Engineers · 2021-11-24T19:14:08.618Z · LW · GW

Same here (Not sure yet if I get accepted to AISC though). But I would be happy with helping or co-organizing something like Richard_Ngo suggested. (Although I've never organized something like that before) Maybe a virtual version in (Continental?) Europe, if there are enough people