Ram Potham's Shortform

ram-potham

Ram Potham's Shortform

post by Ram Potham (ram-potham) · 2025-03-23T15:08:10.996Z · LW · GW · 12 comments

12 comments

12 comments

Comments sorted by top scores.

comment by Ram Potham (ram-potham) · 2025-04-10T20:28:24.211Z · LW(p) · GW(p)

I argue that the optimal ethical stance is to become a rational Bodhisattva: a synthesis of effective altruism, two‑level utilitarianism, and the Bodhisattva ideal.

Effective altruism insists on doing the most good per unit of resource, but can demand extreme sacrifices (e.g., donating almost all disposable income).
Two‑level utilitarianism lets us follow welfare‑promoting rules in daily life and switch to explicit cost‑benefit calculations when rules conflict. Yet it offers little emotional motivation.
The Bodhisattva ideal roots altruism in felt interdependence: the world’s suffering is one’s own. It supplies deep motivation and inner peace, but gives no algorithm for choosing the most beneficial act.

A rational Bodhisattva combines the strengths and cancels the weaknesses:

Motivation: Like a Bodhisattva, they experience others’ suffering as their own, so compassion is effortless and durable.
Method: Using reason and evidence (from effective altruism and two‑level utilitarianism), they pick the action that maximizes overall benefit.
Flexibility: They apply the “middle way,” recognizing that different compassionate choices can be permissible when values collide.

Illustration

Your grandparent needs $50,000 for a life‑saving treatment, but the same money could save ten strangers through a GiveWell charity.

A strict effective altruist/utilitarian would donate to GiveWell.
A purely sentimental agent might fund the treatment.
The rational Bodhisattva weighs both outcomes, also including duties into the calculation, acts from compassion, and accepts the result without regret. In most cases they will choose the option with the greatest net benefit, but they can act otherwise when a compassionate rule or relational duty justifies it.

Thus, the rational Bodhisattva unites rigorous impact with deep inner peace.

comment by Ram Potham (ram-potham) · 2025-04-11T17:35:03.112Z · LW(p) · GW(p)

Memory in AI Agents can cause a large security risk. Without memory, it's easier to red-team LLMs for safety. But once they are fine-tuned (a form of encoding memory), misalignment can be introduced.

According to AI Safety Atlas, most scaffolding approaches for memory provide

a way for AI to access a vast repository of knowledge and use this information to construct more informed responses. However, this approach may not be the most elegant due to its reliance on external data sources and complex retrieval mechanisms. A potentially more seamless and integrated solution could involve utilizing the neural network's weights as dynamic memory, constantly evolving and updating based on the tasks performed by the network.

We need ways to ensure safety in powerful agents with memory or not introduce memory modules at all. Otherwise, agents are constantly learning and can find motivations not aligned with human volition.

Any thoughts on ensuring safety in agents that can update their memory?

comment by Ram Potham (ram-potham) · 2025-04-07T20:38:31.696Z · LW(p) · GW(p)

Accelerating AI Safety research is critical for developing aligned systems, and transformative AI is a powerful approach in doing so.

However to create AI systems that accelerate AI safety requires benchmarks to find the best performing system. The problem is that benchmarks can be flawed and gamed for the following core reasons:

Benchmarks may not represent real world ability
Benchmark information can be leaked into AI model training

However, we can have a moving benchmark of reproducing the most recent technical alignment research using the methods and data.

By reproducing a paper, something many researchers do, we ensure it has real world ability.

By only using papers after a certain model finished training, we ensure data is not leaked.

This comes with the inconvenience of making it difficult to compare models or approaches in 2 papers because the benchmark is always changing, but I argue this is a worthwhile tradeoff:

We can still compare approaches if the the same papers are used in both approaches, ensuring papers were produced after all models finished training
Benchmark data is always recent and relevant

Any thoughts?

Replies from: gwern

↑ comment by gwern · 2025-04-08T02:41:22.904Z · LW(p) · GW(p)

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).

Replies from: jsd, ram-potham

↑ comment by jsd · 2025-04-08T03:57:38.790Z · LW(p) · GW(p)

there's this https://github.com/Jellyfish042/uncheatable_eval

Replies from: gwern, ram-potham

↑ comment by gwern · 2025-04-09T16:58:19.991Z · LW(p) · GW(p)

It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.

↑ comment by Ram Potham (ram-potham) · 2025-04-08T20:14:10.603Z · LW(p) · GW(p)

This is great! Would like to see a continually updating public leaderboard of this.

↑ comment by Ram Potham (ram-potham) · 2025-04-08T20:13:06.078Z · LW(p) · GW(p)

Thanks gwern, really interesting correlation between compression ratio and intelligence, it works for LLMs, but less so for agentic systems and not sure would scale to reasoning models because test-time scaling is a large factor of the intelligence LLMs exhibit.

I agree we should see a continued compression benchmark.

comment by Ram Potham (ram-potham) · 2025-04-04T20:55:58.904Z · LW(p) · GW(p)

I believe we are doomed from superintelligence but I'm not sad.

There are simply too many [LW · GW] reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.

As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.

The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity [LW · GW] but that misses the point.

We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn't matter, because we live in the present and we constantly work on making the present better by expected value.

By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.

We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.

Then, even knowing the future is bad, we remain happy in the present working on AI safety.

comment by Ram Potham (ram-potham) · 2025-03-23T21:01:07.146Z · LW(p) · GW(p)

Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?

Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.

We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.

These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?

comment by Ram Potham (ram-potham) · 2025-03-25T13:51:22.078Z · LW(p) · GW(p)

Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?

Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.

Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.

We could then select a subset of metrics, or even define a threshold of "satisfaction" across multiple metrics, based on a target false positive rate for the overall alignment evaluation.

comment by Ram Potham (ram-potham) · 2025-03-22T23:10:31.494Z · LW(p) · GW(p)

OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.

I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.

This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.

Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.

Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.

By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.

Ram Potham's Shortform

Contents

12 comments