Ram Potham's Shortform
post by Ram Potham (ram-potham) · 2025-03-23T15:08:10.996Z · LW · GW · 4 commentsContents
4 comments
4 comments
Comments sorted by top scores.
comment by Ram Potham (ram-potham) · 2025-04-04T20:55:58.904Z · LW(p) · GW(p)
I believe we are doomed from superintelligence but I'm not sad.
There are simply too many [LW · GW] reasons why alignment will fail. We can assign a probability p(S_n aligned | S_{n-1} aligned) where S_n is the next level of superintelligence. This probability is less than 1.
As long as misalignment keeps increasing and superintelligence iterates on itself exponentially fast, we are bound to get misaligned superintelligence. Misalignment can decrease due to generalizability, but we have no way of knowing if that's the case and is optimistic to think so.
The misaligned superintelligence will destroy humanity. This can lead to a lot of fear or wish to die with dignity [LW · GW] but that misses the point.
We only live in the present, not the past nor the future. We know the future is not optimistic but that doesn't matter, because we live in the present and we constantly work on making the present better by expected value.
By remaining in the present, we can make the best decisions and put all of our effort into AI safety. We are not attached to the outcome of our work because we live in the present. But, we still model the best decision to make.
We model that our work on AI safety is the best decision because it buys us more time by expected value. And, maybe, we can buy enough time to upload our mind to a device so our continuation lives on despite our demise.
Then, even knowing the future is bad, we remain happy in the present working on AI safety.
comment by Ram Potham (ram-potham) · 2025-03-23T21:01:07.146Z · LW(p) · GW(p)
Current alignment methods like RLHF and Constitutional AI often use human evaluators. But these samples might not truly represent humanity or its diverse values. How can we model all human preferences with limited compute?
Maybe we should try collecting a vast dataset of written life experiences and beliefs from people worldwide.
We could then filter and augment this data to aim for better population representation. Vectorize these entries and use k-means to find 'k' representative viewpoints.
These 'k' vectors could then serve as a more representative proxy for humanity's values when we evaluate AI alignment. Thoughts? Potential issues?
comment by Ram Potham (ram-potham) · 2025-03-25T13:51:22.078Z · LW(p) · GW(p)
Current AI alignment often relies on single metrics or evaluation frameworks. But what if a single metric has blind spots or biases, leading to a false sense of security or unnecessary restrictions? How can we not just use multiple metrics, but optimally use them?
Maybe we should try requiring AI systems to satisfy multiple, distinct alignment metrics, but with a crucial addition: we actively model the false positive rate of each metric and the non-overlapping aspects of alignment they capture.
Imagine we can estimate the probability that Metric A incorrectly flags an unaligned AI as aligned (its false positive rate), and similarly for Metrics B and C. Furthermore, imagine we understand which specific facets of alignment each metric uniquely assesses.
We could then select a subset of metrics, or even define a threshold of "satisfaction" across multiple metrics, based on a target false positive rate for the overall alignment evaluation.
comment by Ram Potham (ram-potham) · 2025-03-22T23:10:31.494Z · LW(p) · GW(p)
OpenAI's Superalignment aimed for a human-level automated alignment engineer through scalable training, validation, and stress testing.
I propose a faster route: first develop an automated Agent Alignment Engineer. This system would automate the creation of aligned agents for diverse tasks by iteratively refining agent group chats, prompts, and tools until they pass success and safety evaluations.
This is tractable with today’s LLM reasoning, evidenced by coding agents matching top programmers. Instead of directly building a full Alignment Researcher, focusing on this intermediate step leverages current LLM strengths for agent orchestration. This system could then automate many parts of creating a broader Alignment Researcher.
Safety for the Agent Alignment Engineer can be largely ensured by operating in internet-disconnected environments (except for fetching research) with subsequent human verification of agent alignment and capability.
Examples: This Engineer could create agents that develop scalable training methods or generate adversarial alignment tests.
By prioritizing this more manageable stepping stone, we could significantly accelerate progress towards safe and beneficial advanced AI.