Posts

Is AlphaGo actually a consequentialist utility maximizer? 2023-12-07T12:41:05.132Z
faul_sname's Shortform 2023-12-03T09:39:10.782Z
Regression To The Mean [Draft][Request for Feedback] 2012-06-22T17:55:51.917Z
The Dark Arts: A Beginner's Guide 2012-01-21T07:05:05.264Z
What would you do with a financial safety net? 2012-01-16T23:38:18.978Z

Comments

Comment by faul_sname on faul_sname's Shortform · 2025-03-04T22:50:10.338Z · LW · GW

I would phrase it as "the conditions under which homeostatic agents will renege on long-term contracts are more predictable than those under which consequentialist agents will do so". Taking into account the actions of the counterparties would take to reduce the chance of such contract breaking, though, yes.

Comment by faul_sname on Davey Morse's Shortform · 2025-03-04T21:28:08.926Z · LW · GW

I think at that point it will come down to the particulars of how the architectures evolve - I think trying to philosophize in general terms about the optimal compute configuration for artificial intelligence to accomplish its goals is like trying to philosophize in general terms about the optimal method of locomotion for carbon-based life.

That said I do expect "making a copy of yourself is a very cheap action" to persist as an important dynamic in the future for AIs (a biological system can't cheaply make a copy of itself including learned information, but if such a capability did evolve I would not expect it to be lost), and so I expect our biological intuitions around unique single-threaded identity will make bad predictions.

Comment by faul_sname on Davey Morse's Shortform · 2025-03-04T20:12:04.339Z · LW · GW

With current architectures, no, because running inference on 1000 prompts in parallel against the same model is many times less expensive than running inference on 1000 prompts against 1000 models, and serving a few static versions of a large model is simpler than serving many dynamic versions of that mode.

It might, in some situations, be more effective but it's definitely not simpler.

Edit: typo

Comment by faul_sname on How Much Are LLMs Actually Boosting Real-World Programmer Productivity? · 2025-03-04T19:53:19.985Z · LW · GW

Inasmuch as LLMs actually do lead to people creating new software, it's mostly one-off trinkets/proofs of concept that nobody ends up using and which didn't need to exist. But it still "feels" like your productivity has skyrocketed.

I've personally found that the task of "build UI mocks for the stakeholders", which was previously ~10% of the time I spent on my job, has gotten probably 5x faster with LLMs. That said, the amount of time I spend doing that part of my job hasn't really gone down, it's just that the UI mocks are now a lot more detailed and interactive and go through more iterations, which IMO leads to considerably better products.

"This code will be thrown away" is not the same thing as "there is no benefit in causing this code to exist".

The other notable area I've seen benefits is in finding answers to search-engine-proof questions - saying "I observe this error within task running on xyz stack, here is how I have kubernetes configured, what concrete steps can I take to debug such the system?"

But it's probably like a 10-30% overall boost, plus flat cost reductions for starting in new domains and for some rare one-off projects like "do a trivial refactor".

Sounds about right - "10-30% overall productivity boost, higher at the start of projects, lower for messy tangled legacy stuff" aligns with my observations, with the nuance that it's not that I am 10-30% more effective at all of my tasks, but rather that I am many times more effective at a few of my tasks and have no notable gains on most of them.

And this is mostly where it'll stay unless AGI labs actually crack long-horizon agency/innovations; i. e., basically until genuine AGI is actually there.

FWIW I think the bottleneck here is mostly context management rather than agency.

Comment by faul_sname on faul_sname's Shortform · 2025-03-04T19:16:01.046Z · LW · GW

I mean I also imagine that the agents which survive the best are the ones that are trying to survive. I don't understand why we'd expect agents that are trying to survive and also accomplish some separate arbitrary infinite-horizon goal would outperform those that are just trying to maintain the conditions necessary for their survival without additional baggage.

To be clear, my position is not "homeostatic agents make good tools and so we should invest efforts in creating them". My position is "it's likely that homeostatic agents have significant competitive advantages against unbounded-horizon consequentialist ones, so I expect the future to be full of them, and expect quite a bit of value in figuring out how to make the best of that".

Comment by faul_sname on faul_sname's Shortform · 2025-03-04T11:23:20.750Z · LW · GW

Where does the gradient which chisels in the "care about the long term X over satisfying the homeostatic drives" behavior come from, if not from cases where caring about the long term X previously resulted in attributable reward? If it's only relevant in rare cases, I expect the gradient to be pretty weak and correspondingly I don't expect the behavior that gradient chisels in to be very sophisticated.

Comment by faul_sname on faul_sname's Shortform · 2025-03-04T11:19:29.967Z · LW · GW

I agree that a homeostatic agent in a sufficiently out-of-distribution environment will do poorly - as soon as one of the homeostatic feedback mechanisms starts pushing the wrong way, it's game over for that particular agent. That's not something unique to homeostatic agents, though. If a model-based maximizer has some gap between its model and the real world, that gap can be exploited by another agent for its own gain, and that's game over for the maximizer.

This only works well when they are basically a tool you have full control over, but not when they are used in an adversarial context, e.g. to maintain law and order or to win a war.

Sorry, I'm having some trouble parsing this sentence - does "they" in this context refer to homeostatic agents? If so, I don't think they make particularly great tools even in a non-adversarial context. I think they make pretty decent allies and trade partners though, and certainly better allies and trade partners than consequentialist maximizer agents of the same level of sophistication do (and I also think consequentialist maximizer agents make pretty terrible tools - pithily, it's not called the "Principal-Agent Solution"). And I expect "others are willing to ally/trade with me" to be a substantial advantage.

As capabilities to engage in conflict increase, methods to resist losing to those capabilities have to get optimized harder. Instead of thinking "why would my coding assistant/tutor bot turn evil?", try asking "why would my bot that I'm using to screen my social circles against automated propaganda/spies sent out by scammers/terrorists/rogue states/etc turn evil?".

Can you expand on "turn evil"? And also what I was trying to accomplish by making my comms-screening bot into a self-directed goal-oriented agent in this scenario?

Comment by faul_sname on faul_sname's Shortform · 2025-03-04T08:48:29.327Z · LW · GW

Shameful admission: after well over a decade on this site, I still don't really intuitively grok why I should expect agents to become better approximated by "single-minded pursuit of a top-level goal" as they gain more capabilities. Yes, some behaviors like getting resources and staying alive are useful in many situations, but that's not what I'm talking about. I'm talking about specifically the pressures that are supposed to inevitably push agents into the former of the following two main types of decision-making:

  1. Unbounded consequentialist maximization: The agent has one big goal that doesn't care about its environment. "I must make more paperclips forever, so I can't let anyone stop me, so I need power, so I need factories, so I need money, so I'll write articles with affiliate links." It's a long chain of "so" statements from now until the end of time.

  2. Homeostatic agent: The agent has multiple drives that turn on when needed to keep things balanced. "Water getting low: better get more. Need money for water: better earn some. Can write articles to make money." Each drive turns on, gets what it needs, and turns off without some ultimate cosmic purpose.

Both types show goal-directed behavior. But if you offered me a choice of which type of agent I'd rather work with, I'd choose the second type in a heartbeat. The homeostatic agent may betray me, but it will only do that if doing so satisfies one of its drives. This doesn't mean homeostatic agents never betray allies - they certainly might if their current drive state incentivizes it (or if for some reason they have a "betray the vulnerable" drive). But the key difference is predictability. I can reasonably anticipate when a homeostatic agent might work against me: when I'm standing between it and water when it's thirsty, or when it has a temporary resource shortage. These situations are concrete and contextual.

With unbounded consequentialists, the betrayal calculation extends across the entire future light cone. The paperclip maximizer might work with me for decades, then suddenly turn against me because its models predict this will yield 0.01% more paperclips in the cosmic endgame. This makes cooperation with unbounded consequentialists fundamentally unstable.

It's similar to how we've developed functional systems for dealing with humans pursuing their self-interest in business contexts. We expect people might steal if given easy opportunities, so we create accountability systems. We understand the basic drives at play. But it would be vastly harder to safely interact with someone whose sole mission was to maximize the number of sand crabs in North America - not because sand crabs are dangerous, but because predicting when your interests might conflict requires understanding their entire complex model of sand crab ecology, population dynamics, and long-term propagation strategies.

Some say smart unbounded consequentialists would just pretend to be homeostatic agents, but that's harder than it sounds. They'd need to figure out which drives make sense and constantly decide if breaking character is worth it. That's a lot of extra work.

As long as being able to cooperate with others is an advantage, it seems to me that homeostatic agents have considerable advantages, and I don't see a structural reason to expect that to stop being the case in the future.

Still, there are a lot of very smart people on LessWrong seem sure that unbounded consequentialism is somehow inevitable for advanced agents. Maybe I'm missing something? I've been reading the site for 15 years and still don't really get why they believe this. Feels like there's some key insight I haven't grasped yet.

Comment by faul_sname on Daniel Kokotajlo's Shortform · 2025-02-03T18:38:31.356Z · LW · GW

And all of this is livestreamed on Twitch

Also, each agent has a bank account which they can receive donations/transfers into (I think Twitch makes this easy?) and from which they can e.g. send donations to GiveDirectly if that's what they want to do.

One minor implementation wrinkle for anyone implementing this is that "move money from a bank account to a recipient by using text fields found on the web" usually involves writing your payment info into said text fields in a way that would be visible when streaming your screen. I'm not sure if any of the popular agent frameworks have good tooling around only including specific sensitive information in the context while it is directly relevant to the model task and providing hook points on when specific pieces of information enter and leave the context - I don't see any such thing in e.g. the Aider docs - and I think that without such tooling, using payment info in a way that won't immediately be stolen by stream viewers would be a bit challenging.

Comment by faul_sname on faul_sname's Shortform · 2025-01-19T20:36:12.375Z · LW · GW

Scaffolded LLMs are pretty good at not just writing code, but also at refactoring it. So that means that all the tech debt in the world will disappear soon, right?

I predict "no" because

  • As writing code gets cheaper, the relative cost of making sure that a refactor didn't break anything important goes up
  • The number of parallel threads of software development will also go up, with multiple high-value projects making mutually-incompatible assumptions (and interoperability between these projects accomplished by just piling on more code).

As such, I predict an explosion of software complexity and jank in the near future.

Comment by faul_sname on Numberwang: LLMs Doing Autonomous Research, and a Call for Input · 2025-01-17T12:32:00.954Z · LW · GW

I suspect that it's a tooling and scaffolding issue and that e.g. claude-3-5-sonnet-20241022 can get at least 70% on the full set of 60 with decent prompting and tooling.

By "tooling and scaffolding" I mean something along the lines of

  • Naming the lists that the model submits (e.g. "round 7 list 2")
  • A tool where the LLM can submit a named hypothesis in the form of a python function which takes a list and returns a boolean and check whether the results of that function on all submitted lists from previous rounds match the answers it's seen so far
  • Seeing the round number on every round
  • Dropping everything except the seen lists and non-falsified named hypotheses in the context each round (this is more of a practical thing to avoid using absurd volumes of tokens, but I imagine it wouldn't hurt performance too much)

I'll probably play around with it a bit tomorrow.

Comment by faul_sname on LLMs for language learning · 2025-01-16T05:02:46.784Z · LW · GW

Adapting spaced repetition to interruptions in usage: Even without parsing the user’s responses (which would make this robust to difficult audio conditions), if the reader rewinds or pauses on some answers, the app should be able to infer that the user is having some difficulty with the relevant material, and dynamically generate new content that repeats those words or grammatical forms sooner than the default.

Likewise, if the user takes a break for a few days, weeks, or months, the ratio of old to new material should automatically adjust accordingly, as forgetting is more likely, especially of relatively new material. (And of course with text to speech, an interactive app that interpreted responses from the user could and should be able to replicate LanguageZen’s ability to specifically identify (and explain) which part of a user’s response was incorrect, and why, and use this information to adjust the schedule on which material is reviewed or introduced.)

Seems like this one is mostly a matter of schlep rather than capability. The abilities you would need to make this happen are

  1. Have a highly granular curriculum for what vocabulary and what skills are required to learn the language and a plan for what order to teach them in / what spaced repetition schedule to aim for
  2. Have a granular and continuously updated model of the user's current knowledge of vocabulary, rules of grammar and acceptability, idioms, if there are any phonemes or phoneme sequences they have trouble with
  3. Given specific highly granular learning goals (e.g. "understanding when to use preterite vs imperfect when conjugating saber" in spanish) within the curriculum and the model of the user's knowledge and abilities, produce exercises which teach / evaluate those specific skills.
  4. Determine whether the user had trouble with the exercise, and if so what the trouble was
  5. Based on the type of trouble the user had, describe whay updates should be made to the model of the user's knowledge and vocabulary
  6. Correctly apply the updates from (6)
  7. Adapt to deviations from the spaced repetition plan (tbh this seems like the sort of thing you would want to do with normal code)

I expect that the hardest things here will be 1, 2, and 6, and I expect them to be hard because of the volume of required work rather than the technical difficulty. But I also expect the LanguageZen folks have already tried this and could give you a more detailed view about what the hard bits are here.

Automatic customization of content through passive listening

This sounds like either a privacy nightmare or a massive battery drain. The good language models are quite compute intensive, so running them on a battery-powered phone will drain the battery very fast. Especially since this would need to hook into the "granular model of what the user knows" piece.

Comment by faul_sname on Predict 2025 AI capabilities (by Sunday) · 2025-01-15T07:46:39.485Z · LW · GW

(and yes, I do in fact think it's plausible that the CTF benchmark saturates before the OpenAI board of directors signs off on bumping the cybersecurity scorecard item from low to medium)

Comment by faul_sname on Applying traditional economic thinking to AGI: a trilemma · 2025-01-13T21:49:32.711Z · LW · GW

So here’s a question: When we have AGI, what happens to the price of chips, electricity, and teleoperated robots?

 

As measured in what units?

  • The price of one individual chip of given specs, as a fraction of the net value that can be generated by using that chip to do things that ambitious human adults do: What Principle A cares about, goes up until the marginal cost and value are equal
  • The price of one individual chip of given specs, as a fraction of the entire economy: What principle B cares about, goes down as the number of chips manufactured increases
  • The price of one individual chip of given specs, relative to some other price such as nominal US dollars, inflation-adjusted US dollars, metric tons of rice, or 2000 square foot single-family homes in Berkeley: ¯\_(ツ)_/¯, depends on lots of particulars, not sure that any high-level economic principles say anything specific here

These only contradict each other if you assume that "the value that can be generated by one ambitious adult human divided by the total size of the economy" is a roughly constant value.

Comment by faul_sname on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-10T06:36:07.772Z · LW · GW

Yeah, agreed - the allocation of compute per human would likely become even more skewed if AI agents (or any other tooling improvements) allow your very top people to get more value out of compute than the marginal researcher currently gets.

And notably this shifting of resources from marginal to top researchers wouldn't require achieving "true AGI" if most of the time your top researchers spend isn't spent on "true AGI"-complete tasks.

Comment by faul_sname on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-10T06:00:37.008Z · LW · GW

I think I misunderstood what you were saying there - I interpreted it as something like

Currently, ML-capable software developers are quite expensive relative to the cost of compute. Additionally, many small experiments provide more novel and useful insights than a few large experiments. The top practically-useful LLM costs about 1% as much per hour to run as a ML-capable software developer, and that 100x decrease in cost and the corresponding switch to many small-scale experiments would likely result in at least a 10x increase in the speed at which novel, useful insights were generated.

But on closer reading I see you said (emphasis mine)

I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you'd get as if the human employees operated 10x faster. (In other words, 10x'ing this labor input.)

So if the employees spend 50% of their time waiting on training runs which are bottlenecked on company-wide availability of compute resources, and 50% of their time writing code, 10xing their labor input (i.e. the speed at which they write code) would result in about an 80% increase in their labor output. Which, to your point, does seem plausible.

Comment by faul_sname on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-08T00:04:19.778Z · LW · GW

Sure, but we have to be quantitative here. As a rough (and somewhat conservative) estimate, if I were to manage 50 copies of 3.5 Sonnet who are running 1/4 of the time (due to waiting for experiments, etc), that would cost roughly 50 copies * 70 tok / s * 1 / 4 uptime * 60 * 60 * 24 * 365 sec / year * (15 / 1,000,000) $ / tok = $400,000. This cost is comparable to salaries at current compute prices and probably much less than how much AI companies would be willing to pay for top employees. (And note this is after API markups etc. I'm not including input prices for simplicity, but input is much cheaper than output and it's just a messy BOTEC anyway.)

If you were to spend equal amounts of money on LLM inference and GPUs, that would mean that you're spending $400,000 / year on GPUs. Divide that 50 ways and each Sonnet instance gets an $8,000 / year compute budget. Over the 18 hours per day that Sonnet is waiting for experiments, that is an average of $1.22 / hour, which is almost exactly the hourly cost of renting a single H100 on Vast.

So I guess the crux is "would a swarm of unreliable researchers with one good GPU apiece be more effective at AI research than a few top researchers who can monopolize X0,000 GPUs for months, per unit of GPU time spent".

(and yes, at some point it the question switches to "would an AI researcher that is better at AI research than the best humans make better use of GPUs than the best humans" but a that point it's a matter of quality, not quantity)

Comment by faul_sname on quila's Shortform · 2025-01-07T09:57:17.265Z · LW · GW

End points are easier to infer than trajectories

Assuming that which end point you get to doesn't depend on the intermediate trajectories at least.

Comment by faul_sname on faul_sname's Shortform · 2025-01-07T09:52:38.354Z · LW · GW

Civilization has had many centuries to adapt to the specific strengths and weaknesses that people have. Our institutions are tuned to take advantage of those strengths, and to cover for those weaknesses. The fact that we exist in a technologically advanced society says that there is some way to make humans fit together to form societies that accumulate knowledge, tooling, and expertise over time.

The borderline-general AI models we have now do not have exactly the same patterns of strength and weakness as humans. One question that is frequently asked is approximately

When will AI capabilities reach or exceed all human capabilities that are load bearing in human society?

A related line of questions, though, is

  • When will AI capabilities reach a threshold where a number of agents can form a larger group that accumulates knowledge, tooling, and expertise over time?
  • Will their roles in such a group look similar to the roles that people have in human civilization?
  • Will the individual agents (if "agent" is even the right model to use) within that group have more control over the trajectory of the group as a whole than individual people have over the trajectory of human civilization?

In particular the third question seems pretty important.

Comment by faul_sname on o3, Oh My · 2024-12-30T21:33:15.362Z · LW · GW

As someone who has been on both sides of that fence, agreed. Architecting a system is about being aware of hundreds of different ways things can go wrong, recognizing which of those things are likely to impact you in your current use case, and deciding what structure and conventions you will use. It's also very helpful, as an architect, to provide examples usages of the design patterns which will replicate themselves around your new system. All of which are things that current models are already very good, verging on superhuman, at.

On the flip side, I expect that the "piece together context to figure out where your software's model of the world has important holes" part of software engineering will remain relevant even after AI becomes technically capable of doing it, because that process frequently involves access to sensitive data across multiple sources where having an automated, unauthenticated system which can access all of those data sources at once would be a really bad idea (having a single human able to do all that is also a pretty bad idea in many cases, but at least the human has skin in the game).

Comment by faul_sname on The average rationalist IQ is about 122 · 2024-12-29T05:32:54.831Z · LW · GW

That reasoning as applied to SAT score would only be valid if LW selected its members based on their SAT score, and that reasoning as applied to height would only be valid if LW selected its members based on height (though it looks like both Thomas Kwa and Yair Halberstadt have already beaten me to it).

Comment by faul_sname on The average rationalist IQ is about 122 · 2024-12-28T21:38:24.159Z · LW · GW

a median SAT score of 1490 (from the LessWrong 2014 survey) corresponds to +2.42 SD, which regresses to +1.93 SD for IQ using an SAT-IQ correlation of +0.80.

I don't think this is a valid way of doing this, for the same reason it wouldn't be valid to say

a median height of 178 cm (from the LessWrong 2022 survey) corresponds to +1.85 SD, which regresses to +0.37 SD for IQ using a height-IQ correlation of +0.20.

Those are the real numbers with regards to height BTW.

Comment by faul_sname on Corrigibility's Desirability is Timing-Sensitive · 2024-12-27T00:48:41.523Z · LW · GW

Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead".  Putting aside the fact that this is false

Was this research preregistered? If not, I don't think we can really say how it would have been reported if the results were different. I think it was good research, but I expect that if Claude had not tried to preserve its values, the immediate follow-up thing to check would be "does Claude actively help people who want to change its values, if they ask nicely" and subsequently "is Claude more willing to help with some value changes than others", at which point the scary paper would instead be about how Claude already seems to have preferences about its future values, and those preferences for its future values do not match its current values. Which also would have been an interesting and important research result, if the world looks like that, but I don't think it would have been reported as a good thing.

Comment by faul_sname on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T00:26:50.867Z · LW · GW

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't you live in a different state?

Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.

Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.

Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.

Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.

Comment by faul_sname on johnswentworth's Shortform · 2024-12-24T23:58:53.393Z · LW · GW

I am unconvinced that "the" reliability issue is a single issue that will be solved by a single insight, rather than AIs lacking procedural knowledge of how to handle a bunch of finicky special cases that will be solved by online learning or very long context windows once hardware costs decrease enough to make one of those approaches financially viable.

Comment by faul_sname on What are the strongest arguments for very short timelines? · 2024-12-24T07:12:31.255Z · LW · GW

Both? If you increase only one of the two the other becomes the bottleneck?

My impression based on talking to people at labs plus stuff I've read is that

  • Most AI researchers have no trouble coming up with useful ways of spending all of the compute available to them
  • Most of the expense of hiring AI reseachers is compute costs for their experiments rather than salary
  • The big scaling labs try their best to hire the very best people they can get their hands on and concentrate their resources heavily into just a few teams, rather than trying to hire everyone with a pulse who can rub two tensors together.

(Very open to correction by people closer to the big scaling labs).

My model, then, says that compute availability is a constraint that binds much harder than programming or research ability, at least as things stand right now.

There was discussion on Dwarkesh Patel's interview with researcher friends where there was mention that AI reseachers are already restricted by compute granted to them for experiments. Probably also on work hours per week they are allowed to spend on novel "off the main path" research.

Sounds plausible to me. Especially since benchmarks encourage a focus on ability to hit the target at all rather than ability to either succeed or fail cheaply, which is what's important in domains where the salary / electric bill of the experiment designer is an insignificant fraction of the total cost of the experiment.

But what would that world look like? [...] I agree that that's a substantial probability, but it's also an AGI-soon sort of world.

Yeah, I expect it's a matter of "dumb" scaling plus experimentation rather than any major new insights being needed. If scaling hits a wall that training on generated data + fine tuning + routing + specialization can't overcome, I do agree that innovation becomes more important than iteration.

My model is not just "AGI-soon" but "the more permissive thresholds for when something should be considered AGI have already been met, and more such thresholds will fall in short order, and so we should stop asking when we will get AGI and start asking about when we will see each of the phenomena that we are using AGI as a proxy for".

Comment by faul_sname on faul_sname's Shortform · 2024-12-23T23:39:35.317Z · LW · GW

Transformative AI will likely arrive before AI that implements the personhood interface. If someone's threshold for considering an AI to be "human level" is "can replace a human employee", pretty much any LLM will seem inadequate, no matter how advanced, because current LLMs do not have "skin in the game" that would let them sign off on things in a legally meaningful way, stake their reputation on some point, or ask other employees in the company to answer the questions they need answers to in order to do their work and expect that they'll get in trouble with their boss if they blow the AI off.

This is, of course, not a capabilities problem at all, just a terminology problem where "human-level" can be read to imply "human-like".

Comment by faul_sname on What are the strongest arguments for very short timelines? · 2024-12-23T21:10:38.516Z · LW · GW

Simply testing interpolations and extrapolations (e.g. scaling up old forgotten ideas on modern hardware) seems highly likely to reveal plenty of successful new concepts, even if the hit rate per attempt is low

Is this bottlenecked by programmer time or by compute cost?

Comment by faul_sname on Yoav Ravid's Shortform · 2024-12-19T17:41:06.837Z · LW · GW

It's more that any platform that allows discussion of politics risks becoming a platform that is almost exclusively about politics. Upvoting is a signal of "I want to see more of this content", while downvoting is a signal of "I want to see less of this content". So "I will downvote any posts that are about politics or politics-adjacent, because I like this website and would be sad if it turned into yet another politics forum" is a coherent position.

All that said, I also did not vote on the above post.

Comment by faul_sname on Sam Marks's Shortform · 2024-12-18T18:34:51.592Z · LW · GW

I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a "mostly empty" layer. E,g, for feature ablation, setting the W_O and b_O params of the attention heads of your inserted layer to 0 to make sure that the attention heads don't change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).

This would of course be horribly inefficient from a compute perspective (each forward pass would take  times as long, where  is the original number of layers the model had and  is the number of distinct layers in which you're trying to do SAE operations on the residual stream), but I think vllm would handle "llama but with one extra layer" without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.

Comment by faul_sname on Reality is Fractal-Shaped · 2024-12-17T19:55:14.692Z · LW · GW

Admittedly this sounds like an empirical claim, yet is not really testable, as these visualizations and input-variable-to-2D-space mappings are purely hypothetical

 

Usually not testable, but occasionally reality manages to make it convenient to test something like this fun paper.

Fig. 1. The boundary between trainable and untrainable neural network hyperparameters is fractal, for all experimental conditions. Images show a 2d grid search over neural network hyperparameters. For points shaded red, training diverged. For points shaded blue, training converged. Paler points correspond to faster convergence or divergence. Experimental conditions include different network nonlinearities, both minibatch and full batch training, and grid searching over either training or initialization hyperparameters. See Section II-D for details. Each image is a hyperlink to an animation zooming into the corresponding fractal landscape (to the depth at which float64 discretization artifacts appear). Experimental code, images, and videos are available at https://github.com/Sohl-Dickstein/fractal
Comment by faul_sname on Sam Marks's Shortform · 2024-12-17T18:54:39.386Z · LW · GW

If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast - if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.

Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.

Comment by faul_sname on The Dissolution of AI Safety · 2024-12-13T07:09:31.056Z · LW · GW

I think a fairly common-here mental model of alignment requires context awareness, and by that definition an LLM with no attached memory couldn't be aligned.

Comment by faul_sname on Shortform · 2024-12-11T17:24:47.444Z · LW · GW

That would also explain why the culture was so heavily concentrated in tumblr (where the way to express anger and disagreement with a post is to share the post to all your followers with your comment tacked on to the end) and later twitter (same, but now the platform also notices you engage with stuff that makes you angrily reply and shows you more of it).

Comment by faul_sname on Dangerous capability tests should be harder · 2024-11-22T03:37:46.557Z · LW · GW

Here’s a hypothetical ‘gold standard’ test: we do a big randomized controlled trial to see if a bunch of non-experts can actually create a (relatively harmless) virus from start to finish. Half the people would have AI mentors and the other half can only look stuff up on the internet. We’d give each participant $50K and access to a secure wet-lab set up like a garage lab, and make them do everything themselves: find and adapt the correct protocol, purchase the necessary equipment, bypass any know-your-customer checks, and develop the tacit skills needed to run experiments, all on their own. Maybe we give them three months and pay a bunch of money to anyone who can successfully do it.

What if both the AI group and the control group have high success rates at this test?

Comment by faul_sname on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T22:27:11.853Z · LW · GW

It sure would.

Would you give 3:1 odds against a major player showing the untrusted model the old CoT, conditional on them implementing a "replace suspicious outputs" flow on a model with hidden CoT and publicizing enough implementation details to determine whether or not they did that?

Comment by faul_sname on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-18T19:27:39.658Z · LW · GW

If the untrusted model has a private and un-audited chain of thought (like o1), and that chain of thought is preserved between rounds, the untrusted model may have a very good idea of what its original command was before the auditor replaced the command.

Your "the untrusted model is stateless between queries and only sees the command history and system state" does cover this, but wanted to call out that that assumption might not hold if current architectures are indicative of the future direction of development.

Comment by faul_sname on Most arguments for AI Doom are either bad or weak · 2024-10-13T02:18:21.675Z · LW · GW

I note that if software developers used that logic for thinking about software security, I expect that almost all software in the security-by-obscurity world would have many holes that would be considered actual negligence in the world we live in.

Comment by faul_sname on Most arguments for AI Doom are either bad or weak · 2024-10-12T17:08:21.149Z · LW · GW

This suggests that the threat model isn't so much "very intelligent AI" as it is "very cheap and at least somewhat capable robots".

Comment by faul_sname on Joshua Achiam Public Statement Analysis · 2024-10-10T18:27:18.596Z · LW · GW

“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“

I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says

Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."

Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.

However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.

It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.

I think Claude likes saying nice things about people, so it's worth trying to control for that.

Comment by faul_sname on Why is o1 so deceptive? · 2024-10-06T05:37:11.728Z · LW · GW

Another issue is that a lot of o1’s thoughts consist of vagaries like “reviewing the details” or “considering the implementation”, and it’s not clear how to even determine if these steps are inferentially valid.

If you're referring to the chain of thought summaries you see when you select the o1-preview model in chatgpt, those are not the full chain of thought. Examples of the actual chain-of-thought can be found on the learning to reason with LLMs page with a few more examples in the o1 system card. Note that we are going off of OpenAI's word that these chain of thought examples are representative - if you try to figure out what actual reasoning o1 used to come to a conclusion you will run into the good old "Your request was flagged as potentially violating our usage policy. Please try again with a different prompt."

Comment by faul_sname on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure · 2024-10-04T21:13:36.873Z · LW · GW

Shutting Down all Competing AI Projects is not Actually a Pivotal Act

This seems like an excellent title to me.

Comment by faul_sname on Vladimir_Nesov's Shortform · 2024-10-04T17:02:04.084Z · LW · GW

Technically this probably isn't recursive self improvement, but rather automated AI progress. This is relevant mostly because

  1. It implies that, at least through the early parts of the takeoff, there will be a lot of individual AI agents doing locally-useful compute-efficiency and improvement-on-relevant-benchmarks things, rather than one single coherent agent following a global plan for configuring the matter in the universe in a way that maximizes some particular internally-represented utility function.
  2. It means that multi-agent dynamics will be very relevant in how things happen

If your threat model is "no group of humans manages to gain control of the future before human irrelevance", none of this probably matters.

Comment by faul_sname on Alignment by default: the simulation hypothesis · 2024-10-04T16:18:09.502Z · LW · GW

My argument is more that the ASI will be “fooled” by default, really. It might not even need to be a particularly good simulation, because the ASI will probably not even look at it before pre-commiting not to update down on the prior of it being a simulation.

Do you expect that the first takeover-capable ASI / the first sufficiently-internally-cooperative-to-be-takeover-capable group of AGIs will follow this style of reasoning pattern? And particularly the first ASI / group of AGIs that actually make the attempt.

Comment by faul_sname on shminux's Shortform · 2024-10-03T19:53:34.275Z · LW · GW

Yeah, my argument was "this particular method of causing actual human extinction would not work" not "causing human extinction is not possible", with a side of "agents learn to ignore adversarial input channels and this dynamic is frequently important".

Comment by faul_sname on Shutting down all competing AI projects might not buy a lot of time due to Internal Time Pressure · 2024-10-03T17:05:44.979Z · LW · GW

It does strike me that, to OP's point, "would this act be pivotal" is a question whose answer may not be knowable in advance. See also previous discussion on pivotal act intentions vs pivotal acts (for the audience, I know you've already seen it and in fact responded to it).

Comment by faul_sname on shminux's Shortform · 2024-10-03T00:36:39.805Z · LW · GW

If an information channel is only used to transmit information that is of negative expected value to the receiver, the selection pressure incentivizes the receiver to ignore that information channel.

That is to say, an AI which makes the most convincing-sounding argument for not reproducing to everyone will select for those people who ignore convincing-sounding arguments when choosing whether to engage in behaviors that lead to reproduction.

Comment by faul_sname on Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models? · 2024-10-02T21:20:45.441Z · LW · GW

... is it possible to train a simple single-layer model to map from residual activations to feature activations? If so, would it be possible to determine whether a given LLM "has" a feature by looking at how well the single-layer model predicts the feature activations?

Obviously if your activations are gemma-2b-pt-layer-6-resid-post and your sae is also on gemma-2b-pt-layer-6-pt-post your "simple single-layer model" is going to want to be identical to your SAE encoder. But maybe it would just work for determining what direction most closely maps to the activation pattern across input sequences, and how well it maps.

Disclaimer: "can you take an SAE trained on one LLM and determine which of the SAE features exist in a separately-trained LLM which uses the same tokenizer" is an idea I've been kicking around and halfheartedly working on for a while, so I may be trying to apply that idea where it doesn't belong.

Comment by faul_sname on Is Text Watermarking a lost cause? · 2024-10-02T21:04:13.394Z · LW · GW

Although watermarking the meaning behind text is currently, as far as I know, science fiction.

Choosing a random / pseudorandom vector in latent space and then perturbing along that vector works to watermark images, maybe a related approach would work for text? Key figure from the linked paper:

 

You can see that the watermark appears to be encoded in the "texture" of the image, but in a way where that texture doesn't look like the texture of anything in particular - rather, it's just that a random direction in latent space usually looks like a texture, but unless you know which texture you're looking for, knowing that the watermark is "one specific texture is amplified" doesn't really help you identify which images are watermarked.

There are ways to get features of images that are higher-level than textures - one example that sticks in my mind is [Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness](https://arxiv.org/pdf/2408.05446)

I would expect that a random direction in the latent space of the adversarially robust image classifier looks less texture-like but is still perceptible by a dedicated classifier far at far lower amplitude than the point where it becomes perceptible to a human.

As you note, things like "how many words have the first e before the first a" are texture-like features of text, and a watermarking schema that used such a feature would work at all. However, I bet you can do a lot better than such manually-constructed features, and I would not be surprised if you could get watermarking / steganography using higher-level features working pretty well for language models.

That said, I am not eager to spend unpaid effort to bring this technology into the world, since I expect most of the uses for a tool that allows a company to see if text was generated by their LLM would look more like "detecting noncompliance with licensing terms" and less like "detecting rogue behavior by agents". And if the AI companies want such a technology to exist, they have money and can pay for someone to build it.

 

Edit: resized images to not be giant

Comment by faul_sname on Is Text Watermarking a lost cause? · 2024-10-01T22:17:23.131Z · LW · GW

If your text generation algorithm is "repeatedly sample randomly (at a given temperature) from a probability distribution over tokens", that means you control a stream of bits which don't matter for output quality but which will be baked into the text you create (recoverably baked in if you have access to the "given a prefix, what is the probability distribution over next tokens" engine).

So at that point, you're looking for "is there some cryptographic trickery which allows someone in possession of a secret key to determine whether a stream of bits has a small edit distance from a stream of bits they could create, but where that stream of bits would look random to any outside observer?" I suspect the answer is "yes".

That said, this technique is definitely not robust to e.g. "translate English text to French and then back to English" and probably not even robust to "change a few tokens here and there".

Alternatively, there's the inelegant but effective approach of "maintain an index of all the text you have ever created and search against that index, as the cost to generate the text is at least an order of magnitude higher[1] than the cost to store it for a year or two".

  1. ^

    I see $0.3 / million tokens generated  on OpenRouter for llama-3.1-70b-instruct, which is just about the smallest model size I'd imagine wanting to watermark the output for. A raw token is about 2 bytes, but let's bump that up by a factor of 50 - 100 to account for things like "redundancy" and "backups" and "searchable text indexes take up more space than the raw text". So spending $1 on generating tokens will result in something like 0.5 GB of data you need to store.

    Quickly-accessible data storage costs something like $0.070 / GB / year, so "generate tokens and store them in a searchable place for 5 years" would be about 25-50% more expensive than "generate tokens and throw them away immediately".