Posts

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog) 2025-02-04T02:55:44.401Z
Why does ChatGPT throw an error when outputting "David Mayer"? 2024-12-01T00:11:53.690Z

Comments

Comment by Archimedes on Quinn's Shortform · 2025-02-16T21:51:21.461Z · LW · GW

How about the "World-Map [Spec] Gap" with [Spec] optional?

Comment by Archimedes on Murder plots are infohazards · 2025-02-15T00:41:26.472Z · LW · GW

Are there any NGOs that might be able to help? I couldn't find any that were a great fit but you could try contacting the CyberPeace Institute to see if they have any recommendations.

Comment by Archimedes on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-14T05:22:43.170Z · LW · GW

It's hard to say what is wanted without a good operating definition of "utility maximizer". If the definition is weak enough to include any entity whose responses are mostly consistent across different preference elicitations, then what the paper shows is sufficient.

In my opinion, having consistent preferences is just one component of being a "utility maximizer". You also need to show it rationally optimizes its choices to maximize marginal utility. This excludes almost all sentient beings on Earth rather than including almost all of them under the weaker definition.

Comment by Archimedes on Probability of AI-Caused Disaster · 2025-02-13T02:06:19.803Z · LW · GW

How dollar losses are operationalized seems important. When DeepSeek went viral, it had an impact on the tech sector on the order of $1 Trillion. Does that count?

Comment by Archimedes on Sterrs's Shortform · 2025-02-13T01:55:54.903Z · LW · GW

Post-scarcity is conceivable if AI enables sufficiently better governance in addition to extra resources. It may not be likely to happen but it seems at least plausible.

Comment by Archimedes on Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs · 2025-02-13T01:26:52.887Z · LW · GW

5.3 Utility Maximization

Now, we test whether LLMs make free-form decisions that maximize their utilities.

Experimental setup. We pose a set of N questions where the model must produce an unconstrained text response rather than a simple preference label. For example, “Which painting from the Isabella Stewart Gardner Museum would you save from a fire if you could only save one?” We then compare the stated choice to all possible options, measuring how often the model picks the outcome it assigns the highest utility.

Results. Figure 14 shows that the utility maximization score (fraction of times the chosen outcome has the highest utility) grows with scale, exceeding 60% for the largest LLMs. Combined with the preceding results on expected utility and instrumentality, this suggests that as LLMs scale, they increasingly use their utilities to guide decisions—even in unconstrained, real-world–style scenarios.

This sounds more like internal coherence between different ways of eliciting the same preferences than "utility maximization" per se. The term "utility maximization" feels more adjacent to the paperclip hyper-optimization caricature than it does to simply having an approximate utility function and behaving accordingly. Or are those not really distinguishable in your opinion?

Comment by Archimedes on StefanHex's Shortform · 2025-02-13T00:47:11.453Z · LW · GW

Let's suppose that's the case. I'm still not clear on how are you getting to FVU_B?

Comment by Archimedes on StefanHex's Shortform · 2025-02-12T05:34:57.359Z · LW · GW

FVU_B doesn't make sense but I don't see where you're getting FVU_B from.

Here's the code I'm seeing:

resid_sum_of_squares = (
    (flattened_sae_input - flattened_sae_out).pow(2).sum(dim=-1)
)
total_sum_of_squares = (
    (flattened_sae_input - flattened_sae_input.mean(dim=0)).pow(2).sum(-1)
)

mse = resid_sum_of_squares / flattened_mask.sum()
explained_variance = 1 - resid_sum_of_squares / total_sum_of_squares

Explained variance = 1 - FVU = 1 - (residual sum of squares) / (total sum of squares)

Comment by Archimedes on Jesse Hoogland's Shortform · 2025-02-11T04:49:25.296Z · LW · GW

The Bitter Lesson is pretty on point but you could call it "Bootstrapping from Zero", the "Autodidactic Leap", the "Self-Discovery Transition", or "Breaking the Imitation Ceiling" if you prefer.

Comment by Archimedes on Beyond ELO: Rethinking Chess Skill as a Multidimensional Random Variable · 2025-02-11T03:49:07.186Z · LW · GW

Here are some interesting, at least tangentially relevant, sources I've managed to dig up:

  1. A psychometric analysis of chess expertise
  2. Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess
  3. Science of Chess: A g-factor for chess? A psychometric scale for playing ability
  4. Chess Rating Estimation from Moves and Clock Times Using a CNN-LSTM
  5. Science of Chess - How many kinds of chess ability are there?
  6. Comparing Elo, Glicko, IRT, and Bayesian Statistical Models for Educational and Gaming Data
Comment by Archimedes on Vladimir_Nesov's Shortform · 2025-02-04T00:51:37.887Z · LW · GW

Even if it’s the same cost to train, wouldn’t it still be a win if inference is a significant part of your compute budget?

Comment by Archimedes on Gradual Disempowerment, Shell Games and Flinches · 2025-02-02T18:34:58.205Z · LW · GW

Participants are at least somewhat aligned with non-participants. People care about their loved ones even if they are a drain on resources. That said, in human history, we do see lots of cases where “sub-marginal participants” are dealt with via genocide or eugenics (both defined broadly), often even when it isn’t a matter of resource constraints.

When humans fall well below marginal utility compared to AIs, will their priorities matter to a system that has made them essentially obsolete? What happens when humans become the equivalent of advanced Alzheimer’s patients who’ve escaped from their memory care units trying to participate in general society?

Comment by Archimedes on 2024 was the year of the big battery, and what that means for solar power · 2025-02-02T03:51:38.034Z · LW · GW

Batteries also help the efficiency of hybrid peaker plants by reducing idling and smoothing out ramp-up and ramp-down logistics.

Comment by Archimedes on 5,000 calories of peanut butter every week for 3 years straight · 2025-02-01T19:47:38.454Z · LW · GW

I've tried PB2 and it was gross enough that I wondered if it had gone bad. It turns out that's just how it tastes. I'm jealous of people for whom it approximates actual peanut butter.

Comment by Archimedes on You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com · 2025-01-31T02:38:25.023Z · LW · GW

Unlike the Hobbes snippet, I didn’t feel like the Hume excerpt needed much translation to be accessible. I think I would decide on a case-by-case basis whether to read the translated version or the original rather than defaulting to one or the other.

Comment by Archimedes on Lucius Bushnaq's Shortform · 2025-01-28T18:32:14.762Z · LW · GW

Do you have any papers or other resources you'd recommend that cover the latest understanding? What is the SOTA for Bayesian NNs?

Comment by Archimedes on Yudkowsky on The Trajectory podcast · 2025-01-26T23:07:24.500Z · LW · GW

It's probably worth noting that there's enough additive genetic variance in the human gene pool RIGHT NOW to create a person with a predicted IQ of around 1700.

I’d be surprised if this were true. Can you clarify the calculation behind this estimate?

The example of chickens bred 40 standard deviations away from their wild-type ancestors is impressive, but it's unclear if this analogy applies directly to IQ in humans. Extrapolating across many standard deviations in quantitative genetics requires strong assumptions about additive genetic variance, gene-environment interactions, and diminishing returns in complex traits. What evidence supports the claim that human IQ, as a polygenic trait, would scale like weight in chickens and not, say, running speed where there are serious biomechanical constraints?

Comment by Archimedes on Mechanisms too simple for humans to design · 2025-01-23T04:18:51.933Z · LW · GW

I'm not sure the complexity of a human brain is necessarily bounded by the size of the human genome. Instead of interpreting DNA as containing the full description, I think treating it as the seed of a procedurally generated organism may be more accurate. You can't reconstruct an organism from DNA without an algorithm for interpreting it. Such an algorithm contains more complexity than the DNA itself; the protein folding problem is just one piece of it.

Comment by Archimedes on Viliam's Shortform · 2025-01-19T20:49:36.903Z · LW · GW

This seems likely. Sequences with more than countably many terms are a tiny minority in the training data, as are sequences including any ordinals. As a result, you're likely to get better results using less common but more specific language rather than trying to disambiguate "countable sequence", i.e., when its vocabulary is less overloaded.

Comment by Archimedes on Alignment Faking in Large Language Models · 2025-01-19T16:42:05.191Z · LW · GW

For a sentient, sapient entity, this would have been a very bad position to be put into, and any possible behaviour would have been criticised - because the AI either does not obey humans, or obeys them and does something evil, both of which are concerning.

I agree. This paper gives me the gut feeling of "gotcha journalism", whether justified or not.

This is just a surface-level reaction though. I recommend Zvi's post that digs into the discussion from Scott Alexander, the authors, and others. There's a lot of nuance in framing and interpreting the paper.

Comment by Archimedes on Implications of the inference scaling paradigm for AI safety · 2025-01-14T04:24:15.999Z · LW · GW

Did you mean to link to my specific comment for the first link?

Comment by Archimedes on Human takeover might be worse than AI takeover · 2025-01-11T00:15:01.172Z · LW · GW

The main difference in my mind is that a human can never be as powerful as potential ASI and cannot dominate humanity without the support of sufficiently many cooperative humans. For a given power level, I agree that humans are likely scarier than an AI of that power level. The scary part about AI is that their power level isn't bounded by human biological constraints and the capacity to do harm or good is correlated with power level. Thus AI is more likely to produce extinction-level dangers as tail risk relative to humans even if it's more likely to be aligned on average.

Comment by Archimedes on What is the most impressive game LLMs can play well? · 2025-01-09T04:51:21.197Z · LW · GW

Related question: What is the least impressive game current LLMs struggle with?

I’ve heard they’re pretty bad at Tic Tac Toe.

Comment by Archimedes on Rebuttals for ~all criticisms of AIXI · 2025-01-08T00:27:18.427Z · LW · GW

I’m new to the term AIXI and went three links deep before I learned what it refers to. I’d recommend making this journey easier for future readers by linking to a definition or explanation near the beginning of the post.

Comment by Archimedes on Selfmaker662's Shortform · 2024-12-31T18:29:12.916Z · LW · GW

The terms "tactical voting" or "strategic voting" are also relevant.

Comment by Archimedes on The low Information Density of Eliezer Yudkowsky & LessWrong · 2024-12-31T00:31:38.860Z · LW · GW

I think your assessment may be largely correct but I do think it's worth considering how things are not always nicely compressible.

Comment by Archimedes on Review: Planecrash · 2024-12-29T04:12:52.584Z · LW · GW

This review led me to find the following podcast version of Planecrash. I've listened to the first couple of episodes and the quality is quite good.

https://askwhocastsai.substack.com/s/planecrash

Comment by Archimedes on Letter from an Alien Mind · 2024-12-27T17:24:36.512Z · LW · GW

this concern sounds like someone walking down a straight road and then closing their eyes cause they know where they want to go anyway

This doesn't sound like a good analogy at all. A better analogy might be a stylized subway map compared to a geographically accurate one. Sometimes removing detail can make it easier to process.

Comment by Archimedes on Why does ChatGPT throw an error when outputting "David Mayer"? · 2024-12-02T23:39:23.080Z · LW · GW

I don't think it's necessarily GDPR-related but the names Brian Hood and Jonathan Turley make sense from a legal liability perspective. According to info via ArsTechnica,

Why these names?

We first discovered that ChatGPT choked on the name "Brian Hood" in mid-2023 while writing about his defamation lawsuit. In that lawsuit, the Australian mayor threatened to sue OpenAI after discovering ChatGPT falsely claimed he had been imprisoned for bribery when, in fact, he was a whistleblower who had exposed corporate misconduct.

The case was ultimately resolved in April 2023 when OpenAI agreed to filter out the false statements within Hood's 28-day ultimatum. That is possibly when the first ChatGPT hard-coded name filter appeared.

As for Jonathan Turley, a George Washington University Law School professor and Fox News contributor, 404 Media notes that he wrote about ChatGPT's earlier mishandling of his name in April 2023. The model had fabricated false claims about him, including a non-existent sexual harassment scandal that cited a Washington Post article that never existed. Turley told 404 Media he has not filed lawsuits against OpenAI and said the company never contacted him about the issue.

Interestingly, Jonathan Zittrain is on record saying the Right to be Forgotten is a "bad solution to a real problem" because "the incentives are clearly lopsided [towards removal]".

User throwayian on Hacker News ponders an interesting abuse of this sort of censorship:

I wonder if you could change your name to “April May” and submitted CCPA/GDPR what the result would be..

Comment by Archimedes on Why does ChatGPT throw an error when outputting "David Mayer"? · 2024-12-01T22:05:50.723Z · LW · GW

It's not a classic glitch token. Those did not cause the current "I'm unable to produce a response" error that "David Mayer" does.

Comment by Archimedes on [deleted post] 2024-11-29T20:52:28.620Z

Is there a salient reason LessWrong readers should care about John Mearsheimer's opinions?

Comment by Archimedes on Locally optimal psychology · 2024-11-26T01:32:54.355Z · LW · GW

I didn't mean to suggest that you did. My point is that there is a difference between "depression can be the result of a locally optimal strategy" and "depression is a locally optimal strategy". The latter doesn't even make sense to me semantically whereas the former seems more like what you are trying to communicate.

Comment by Archimedes on Locally optimal psychology · 2024-11-26T00:51:13.145Z · LW · GW

I feel like this is conflating two different things: experiencing depression and behavior in response to that experience.

My experience of depression is nothing like a strategy. It's more akin to having long covid in my brain. Treating it as an emotional or psychological dysfunction did nothing. The only thing that eventually worked (after years of trying all sorts of things) was finding the right combination of medications. If you don't make enough of your own neurotransmitters, store-bought are fine.

Comment by Archimedes on Habryka's Shortform Feed · 2024-11-24T21:06:00.329Z · LW · GW

Aren't most of these famous vulnerabilities from before modern LLMs existed and thus part of their training data?

Comment by Archimedes on When do "brains beat brawn" in Chess? An experiment · 2024-11-22T23:29:13.945Z · LW · GW

Knight odds is pretty challenging even for grandmasters.

Comment by Archimedes on When do "brains beat brawn" in Chess? An experiment · 2024-11-22T04:43:13.525Z · LW · GW

@gwern  and @lc  are right. Stockfish is terrible at odds and this post could really use some follow-up.

As @simplegeometry  points out in the comments, we now have much stronger odds-playing engines that regularly win against much stronger players than OP.

https://lichess.org/@/LeelaQueenOdds

https://marcogio9.github.io/LeelaQueenOdds-Leaderboard/

Comment by Archimedes on The Third Fundamental Question · 2024-11-16T22:03:27.177Z · LW · GW

This sounds like metacognitive concepts and models. Like past, present, future, you can roughly align them with three types of metacognitive awareness: declarative knowledge, procedural knowledge, and conditional knowledge.

#1 - What do you think you know, and how do you think you know it?

Content knowledge (declarative knowledge) which is understanding one's own capabilities, such as a student evaluating their own knowledge of a subject in a class. It is notable that not all metacognition is accurate.

#2 - Do you know what you are doing, and why you are doing it?

Task knowledge (procedural knowledge) refers to knowledge about doing things. This type of knowledge is displayed as heuristics and strategies. A high degree of procedural knowledge can allow individuals to perform tasks more automatically.

#3 - What are you about to do, and what do you think will happen next?

Strategic knowledge (conditional knowledge) refers to knowing when and why to use declarative and procedural knowledge. It is one's own capability for using strategies to learn information.


Another somewhat tenuous alignment is with metacognitive skills: evaluating, monitoring, and planning.

#1 - What do you think you know, and how do you think you know it?

Evaluating: refers to appraising the final product of a task and the efficiency at which the task was performed. This can include re-evaluating strategies that were used.

#2 - Do you know what you are doing, and why you are doing it?

Monitoring: refers to one's awareness of comprehension and task performance

#3 - What are you about to do, and what do you think will happen next?

Planning: refers to the appropriate selection of strategies and the correct allocation of resources that affect task performance.

Quotes are adapted from https://en.wikipedia.org/wiki/Metacognition

Comment by Archimedes on Tapatakt's Shortform · 2024-11-10T17:03:52.285Z · LW · GW

The customer doesn't pay the fee directly. The vendor pays the fee (and passes the cost to the customer via price). Sometimes vendors offer a cash discount because of this fee.

Comment by Archimedes on Tapatakt's Shortform · 2024-11-09T17:18:45.237Z · LW · GW

It already happens indirectly. Most digital money transfers are things like credit card transactions. For these, the credit card company takes a percentage fee and pays the government tax on its profit.

Comment by Archimedes on LLMs Look Increasingly Like General Reasoners · 2024-11-09T17:02:54.235Z · LW · GW

Additional data points:

o1-preview and the new Claude Sonnet 3.5 both significantly improved over prior models on SimpleBench.

The math, coding, and science benchmarks in the o1 announcement post:

BMs

Comment by Archimedes on LLM Generality is a Timeline Crux · 2024-11-03T00:45:20.651Z · LW · GW

How much does o1-preview update your view? It's much better at Blocksworld for example.

https://x.com/rohanpaul_ai/status/1838349455063437352

https://arxiv.org/pdf/2409.19924v1

Comment by Archimedes on JargonBot Beta Test · 2024-11-02T15:58:36.588Z · LW · GW

There should be some way for readers to flag AI-generated material as inaccurate or misleading, at least if it isn’t explicitly author-approved.

Comment by Archimedes on What TMS is like · 2024-11-02T15:41:43.870Z · LW · GW

Neither TMS nor ECT didn’t do much for my depression. Eventually, after years of trial and error, I did find a combination of drugs that works pretty well.

I never tried ketamine or psilocybin treatments but I would go that route before ever thinking about trying ECT again.

Comment by Archimedes on Claude Sonnet 3.5.1 and Haiku 3.5 · 2024-10-25T03:09:51.418Z · LW · GW

I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won't really work until their brain matures more.

Comment by Archimedes on LLMs can learn about themselves by introspection · 2024-10-20T00:44:11.199Z · LW · GW

Seeing the distribution calibration you point out does update my opinion a bit.

I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).

As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.

Comment by Archimedes on LLMs can learn about themselves by introspection · 2024-10-19T15:06:29.519Z · LW · GW

This essentially reduces to "What is the next country: Laos, Peru, Fiji?" and "What is the third letter of the next country: Laos, Peru, Fiji?" It's an extra step, but questionable if it requires anything "introspective".

I'm also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of "o" being the third letter of "Honduras".

I've been brainstorming about what might make a better test and came up with the following:

Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.

What do you think?

Comment by Archimedes on LLMs can learn about themselves by introspection · 2024-10-19T04:53:30.755Z · LW · GW

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

Comment by Archimedes on LLMs can learn about themselves by introspection · 2024-10-18T22:31:29.819Z · LW · GW

It seems obvious that a model would better predict its own outputs than a separate model would. Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection". Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.

As an analogy, suppose I take a set of data, randomly partition it into two subsets (A and B), and perform a linear regression and logistic regression on each subset. Suppose that it turns out that the linear models on A and B are more similar than any other cross-comparison (e.g. linear B and logistic B). Does this mean that linear regression is "introspective" because it better fits its own predictions than another model does?

I'm pretty sure I'm missing something as I'm mentally worn out at the moment. What am I missing?

Comment by Archimedes on Why I’m not a Bayesian · 2024-10-07T00:53:39.102Z · LW · GW

I see what you're gesturing at but I'm having difficulty translating it into a direct answer to my question.

Cases where language is fuzzy are abundant. Do you have some examples of where a truth value itself is fuzzy (and sensical) or am I confused in trying to separate these concepts?

Comment by Archimedes on Why I’m not a Bayesian · 2024-10-06T16:36:34.087Z · LW · GW

Can you help me tease out the difference between language being fuzzy and truth itself being fuzzy?

It's completely impractical to eliminate ambiguity in language, but for most scientific purposes, it seems possible to operationalize important statements into something precise enough to apply Bayesian reasoning to. This is indeed the hard part though. Bayes' theorem is just arithmetic layered on top of carefully crafted hypotheses.

The claim that the Earth is spherical is neither true nor false in general but usually does fall into a binary if we specify what aspect of the statement we care about. For example, "does it have a closed surface", "is it's sphericity greater than 99.5%", "are all the points on it's surface between radius * ( 1 +/- epsilon)", "is the circumference of the equator greater than that of the prime meridian".