ebenezer-dukakis

Posts
Comments

Posts

Ebenezer Dukakis's Shortform 2024-05-18T22:11:09.830Z

Using game theory to elect a centrist in the 2024 US Presidential Election 2024-04-05T00:46:22.949Z

Comments

Comment by Ebenezer Dukakis (valley9) on jacquesthibs's Shortform · 2025-04-18T03:39:57.500Z · LW · GW

If people start losing jobs from automation, that could finally build political momentum for serious regulation.

Suggested in Zvi's comments the other month (22 likes):

The real problem here is that AI safety feels completely theoretical right now. Climate folks can at least point to hurricanes and wildfires (even if connecting those dots requires some fancy statistical footwork). But AI safety advocates are stuck making arguments about hypothetical future scenarios that sound like sci-fi to most people. It's hard to build political momentum around "trust us, this could be really bad, look at this scenario I wrote that will remind you of a James Cameron movie"

Here's the thing though - the e/acc crowd might accidentally end up doing AI safety advocates a huge favor. They want to race ahead with AI development, no guardrails, full speed ahead. That could actually force the issue. Once AI starts really replacing human workers - not just a few translators here and there, but entire professions getting automated away - suddenly everyone's going to start paying attention. Nothing gets politicians moving like angry constituents who just lost their jobs.

Here's a wild thought: instead of focusing on theoretical safety frameworks that nobody seems to care about, maybe we should be working on dramatically accelerating workplace automation. Build the systems that will make it crystal clear just how transformative AI can be. It feels counterintuitive - like we're playing into the e/acc playbook. But like extreme weather events create space to talk about carbon emissions, widespread job displacement could finally get people to take AI governance seriously. The trick is making sure this wake-up call happens before it's too late to do anything about the bigger risks lurking around the corner.

Source: https://thezvi.substack.com/p/the-paris-ai-anti-safety-summit/comment/92963364

Just skimming the thread, I didn't see anyone offer a serious attempt at counterargument, either.

Comment by Ebenezer Dukakis (valley9) on Prospects for Alignment Automation: Interpretability Case Study · 2025-03-28T01:48:57.518Z · LW · GW

I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).

This tag might be helpful: https://www.lesswrong.com/w/ai-assisted-alignment

Here's a recent shortform on the topic: https://www.lesswrong.com/posts/mKgbawbJBxEmQaLSJ/davekasten-s-shortform?commentId=32jReMrHDd5vkDBwt

I wonder about getting an LLM to process LW archive posts, and tag posts which contain alignment ideas that seem automatable.

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2025-02-25T10:27:06.366Z · LW · GW

it will also set off the enemy rhetoric detectors among liberals

I'm not sure about that, does Bernie Sanders rhetoric set off that detector?

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2025-02-25T05:10:52.324Z · LW · GW

I think the way the issue is framed matters a lot. If it's a "populist" framing ("elites are in it for themselves, they can't be trusted"), that frame seems to have resonated with a segment of the right lately. Climate change has a sanctimonious frame in American politics that conservatives hate.

Comment by Ebenezer Dukakis (valley9) on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-22T06:30:54.421Z · LW · GW

It looks like the comedian whose clip you linked has a podcast:

https://www.joshjohnsoncomedy.com/podcasts

I don't see any guests in their podcast history, but maybe someone could invite him on a different podcast? His website lists appearances on other podcasts. I figure it's worth trying stuff like this for VoI.

I think people should emphasize more the rate of improvement in this technology. Analogous to early days of COVID -- it's not where we are that's worrisome; it's where we're headed.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T22:51:02.147Z · LW · GW

For humans acting very much not alone, like big AGI research companies, yeah that's clearly a big problem.

How about a group of superbabies that find and befriend each other? Then they're no longer acting alone.

I don't think the problem is about any of the people you listed having too much brainpower.

I don't think problems caused by superbabies would look distinctively like "having too much brainpower". They would look more like the ordinary problems humans have with each other. Brainpower would be a force multiplier.

(I feel we're somewhat talking past each other, but I appreciate the conversation and still want to get where you're coming from.)

Thanks. I mostly just want people to pay attention to this problem. I don't feel like I have unique insight. I'll probably stop commenting soon, since I think I'm hitting the point of diminishing returns.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T22:38:26.873Z · LW · GW

I think this project should receive more red-teaming before it gets funded.

Naively, it would seem that the "second species argument" matches much more strongly to the creation of a hypothetical Homo supersapiens than it does to AGI.

We've observed many warning shots regarding catastrophic human misalignment. The human alignment problem isn't easy. And "intelligence" seems to be a key part of the human alignment picture. Humans often lack respect or compassion for other animals that they deem intellectually inferior -- e.g. arguing that because those other animals lack cognitive capabilities we have, they shouldn't be considered morally relevant. There's a decent chance that Homo supersapiens would think along similar lines, and reiterate our species' grim history of mistreating those we consider our intellectual inferiors.

It feels like people are deferring to Eliezer a lot here, which seems unjustified given how much strategic influence Eliezer had before AI became a big thing, and how poorly things have gone (by Eliezer's own lights!) since then. There's been very little reasoning transparency in Eliezer's push for genetic enhancement. I just don't see why we're deferring to Eliezer so much as a strategist, when I struggle to name a single major strategic success of his.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T22:16:49.228Z · LW · GW

There's a good chance their carbon children would have about the same attitude towards AI development as they do. So I suspect you'd end up ruled by their silicon grandchildren.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T22:15:46.930Z · LW · GW

These are incredibly small peanuts compared to AGI omnicide.

The jailbreakability and other alignment failures of current AI systems are also incredibly small peanuts compared to AGI omnicide. Yet they're still informative. Small-scale failures give us data about possible large-scale failures.

You're somehow leaving out all the people who are smarter than those people, and who were great for the people around them and humanity? You've got like 99% actually alignment or something

Are you thinking of people such as Sam Altman, Demis Hassabis, Elon Musk, and Dario Amodei? If humans are 99% aligned, how is it that we ended up in a situation where major lab leaders look so unaligned? MIRI and friends had a fair amount of influence to shape this situation and align lab leaders, yet they appear to have failed by their own lights. Why?

When it comes to AI alignment, everyone on this site understands that if a "boxed" AI acts nice, that's not a strong signal of actual friendliness. The true test of an AI's alignment is what it does when it has lots of power and little accountability.

Maybe something similar is going on for humans. We're nice when we're powerless, because we have to be. But giving humans lots of power with little accountability doesn't tend to go well.

Looking around you, you mostly see nice humans. That could be because humans are inherently nice. It could also be because most of the people around you haven't been given lots of power with little accountability.

Dramatic genetic enhancement could give enhanced humans lots of power with little accountability, relative to the rest of us.

[Note also, the humans you see while looking around are strongly selected for, which becomes quite relevant if the enhancement technology is widespread. How do you think you'd feel about humanity if you lived in Ukraine right now?]

Which, yes, we should think about this, and prepare and plan and prevent, but it's just a totally totally different calculus from AGI.

I want to see actual, detailed calculations of p(doom) from supersmart humans vs supersmart AI, conditional on each technology being developed. Before charging ahead on this, I want a superforecaster-type person to sit down, spend a few hours, generate some probability estimates, publish a post, and request that others red-team their work. I don't feel like that is a lot to ask.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T21:38:38.483Z · LW · GW

Humans are very far from fooming.

Tell that to all the other species that went extinct as a result of our activity on this planet?

I think it's possible that the first superbaby will be aligned, same way it's possible that the first AGI will be aligned. But it's far from a sure thing. It's true that the alignment problem is considerably different in character for humans vs AIs. Yet even in this particular community, it's far from solved -- consider Brent Dill, Ziz, Sam Bankman-Fried, etc.

Not to mention all of history's great villains, many of whom believed themselves to be superior to the people they afflicted. If we use genetic engineering to create humans which are actually, massively, undeniably superior to everyone else, surely that particular problem is only gonna get worse. If this enhancement technology is going to be widespread, we should be using the history of human activity on this planet as a prior. Especially the history of human behavior towards genetically distinct populations with overwhelming technological inferiority. And it's not pretty.

So yeah, there are many concrete details which differ between these two situations. But in terms of high-level strategic implications, I think there are important similarities. Given the benefit of hindsight, what should MIRI have done about AI back in 2005? Perhaps that's what we should be doing about superbabies now.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T21:20:58.972Z · LW · GW

Altman and Musk are arguably already misaligned relative to humanity's best interests. Why would you expect smarter versions of them to be more aligned? That only makes sense if we're in an "alignment by default" world for superbabies, which is far from obvious.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T21:14:22.883Z · LW · GW

If you look at the grim history of how humans have treated each other on this planet, I don't think it's justified to have a prior that this is gonna go well.

I think we have a huge advantage with humans simply because there isn't the same potential for runaway self-improvement.

Humans didn't have the potential for runaway self-improvement relative to apes. That was little comfort for the apes.

Comment by Ebenezer Dukakis (valley9) on How to Make Superbabies · 2025-02-20T21:10:17.415Z · LW · GW

This is starting to sound a lot like AI actually. There's a "capabilities problem" which is easy, an "alignment problem" which is hard, and people are charging ahead to work on capabilities while saying "gee, we'd really like to look into alignment at some point".

Comment by Ebenezer Dukakis (valley9) on George Ingebretsen's Shortform · 2025-02-13T04:17:55.402Z · LW · GW

Can anyone think of alignment-pilled conservative influencers besides Geoffrey Miller? Seems like we could use more people like that...

Maybe we could get alignment-pilled conservatives to start pitching stories to conservative publications?

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2025-02-13T00:34:37.992Z · LW · GW

Likely true, but I also notice there's been a surprising amount of drift of political opinions from the left to the right in recent years. The right tends to put their own spin on these beliefs, but I suspect many are highly influenced by the left nonetheless.

Some examples of right-coded beliefs which I suspect are, to some degree, left-inspired:

"Capitalism undermines social cohesion. Consumerization and commoditization are bad. We're a nation, not an economy."
"Trans women undermine women's rights and women's spaces. Motherhood, and women's dignity, must be defended from neoliberal profit motives."
"US foreign policy is controlled by a manipulative deep state that pursues unnecessary foreign interventions to benefit elites."
"US federal institutions like the FBI are generally corrupt and need to be dismantled."
"We can't trust elites. They control the media. They're out for themselves rather than ordinary Americans."
"Your race, gender, religion, etc. are some of the most important things about you. There's an ongoing political power struggle between e.g. different races."
"Big tech is corrosive for society."
"Immigration liberalization is about neoliberal billionaires undermining wages for workers like me."
"Shrinking the size of government is not a priority. We should make sure government benefits everyday people."
Anti-semitism, possibly.

One interesting thing has been seeing the left switch to opposing the belief when it's adopted by the right and takes a right-coded form. E.g. US institutions are built on white supremacy and genocide, fundamentally institutionally racist, backed by illegitimate police power, and need to be defunded/decolonized/etc... but now they are being targeted by DOGE, and it's a disaster!

(Note that the reverse shift has also happened. E.g. Trump's approaches to economic nationalism, bilateral relations w/ China, and contempt for US institutions were all adopted by Biden by some degree.)

So yeah, my personal take is that we shouldn't worry about publication venue that much. Just avoid insulting anyone, and make your case in a way which will appeal to the right (e.g. "we need to defend our traditional way of being human from AI"). If possible, target center-leaning publications like The Atlantic over explicitly progressive publications like Mother Jones.

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2025-02-13T00:17:19.098Z · LW · GW

I think the National Review is the most prestigious conservative magazine in the US, but there are various others. City Journal articles have also struck me as high-quality in the past. I think Coleman Hughes writes for them, and he did a podcast with Eliezer Yudkowsky at one point.

However, as stated in the previous link, you should likely work your way up and start by pitching lower-profile publications.

Comment by Ebenezer Dukakis (valley9) on davekasten's Shortform · 2025-02-13T00:12:33.525Z · LW · GW

The big one probably has to do with being able to corrupt the metrics so totally that whatever you think you made them unlearn actually didn't happen, or just being able to relearn the knowledge so fast that unlearning doesn't matter

I favor proactive approaches to unlearning which prevent the target knowledge from being acquired in the first place. E.g. for gradient routing, if you can restrict "self-awareness and knowledge of how to corrupt metrics" to a particular submodule of the network during learning, then if that submodule isn't active, you can be reasonably confident that the metrics aren't currently being corrupted. (Even if that submodule sandbags and underrates its own knowledge, that should be fine if the devs know to be wary of it. Just ablate that submodule whenever you're measuring something that matters, regardless of whether your metrics say it knows stuff!)

Some related thoughts here

Unlearning techniques should probably be battle-tested in low-stakes "model organism" type contexts, where metrics corruption isn't expected.

while I wouldn't call it the best

Curious what areas you are most excited about!

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2025-02-12T12:18:26.698Z · LW · GW

Regarding articles which target a popular audience such as How AI Takeover Might Happen in 2 Years, I get the sense that people are preaching to the choir by posting here and on X. Is there any reason people aren't pitching pieces like this to prestige magazines like The Atlantic or wherever else? I feel like publishing in places like that is a better way to shift the elite discourse, assuming that's the objective. (Perhaps it's best to pitch to publications that people in the Trump admin read?)

Here's an article on pitching that I found on the EA Forum by searching. I assume there are lots more tips on pitching online if you search.

Comment by Ebenezer Dukakis (valley9) on davekasten's Shortform · 2025-02-12T11:40:12.321Z · LW · GW

I think unlearning could be a good fit for automated alignment research.

Unlearning could be a very general tool to address a lot of AI threat models. It might be possible to unlearn deception, scheming, manipulation of humans, cybersecurity, etc. I challenge you to come up with an AI safety failure story that can't, in principle, be countered through targeted unlearning in some way, shape, or form.

Relative to some other kinds of alignment research, unlearning seems easy to automate, since you can optimize metrics for how well things have been unlearned.

I like this post.

Comment by Ebenezer Dukakis (valley9) on Alexander Gietelink Oldenziel's Shortform · 2024-11-17T06:02:43.047Z · LW · GW

Chinas has alienated virtually all its neighbours

That sounds like an exaggeration? My impression is that China has OK/good relations with countries such as Vietnam, Cambodia, Pakistan, Indonesia, North Korea, factions in Myanmar. And Russia, of course. If you're serious about this claim, I think you should look at a map, make a list of countries which qualify as "neighbors" based purely on geographic distance, then look up relations for each one.

Comment by Ebenezer Dukakis (valley9) on Why I quit effective altruism, and why Timothy Telleen-Lawton is staying (for now) · 2024-10-27T09:17:34.357Z · LW · GW

What I think is more likely than EA pivoting is a handful of people launch a lifeboat and recreate a high integrity version of EA.

Thoughts on how this might be done:

Interview a bunch of people who became disillusioned. Try to identify common complaints.
For each common complaint, research organizational psychology, history of high-performing organizations, etc. and brainstorm institutional solutions to address that complaint. By "institutional solutions", I mean approaches which claim to e.g. fix an underlying bad incentive structure, so it won't require continuous heroic effort to address the complaint.
Combine the most promising solutions into a charter for a new association of some kind. Solicit criticism/red-teaming for the charter.
Don't try to replace EA all at once. Start small by aiming at a particular problem present in EA, e.g. bad funding incentives, criticism (it sucks too hard to both give and receive it), or bad feedback loops in the area of AI safety. Initially focus on solving that particular problem, but also build in the capability to scale up and address additional problems if things are going well.
Don't market this as a "replacement for EA". There's no reason to have an adversarial relationship. When describing the new thing, focus on the specific problem which was selected as the initial focus, plus the distinctive features of the charter and the problems they are supposed to solve.
Think of this as an experiment, where you're aiming to test one or more theses about what charter content will cause organizational outperformance.

I think it would be interesting if someone put together a reading list on high-performing organizations, social movement history, etc. etc. I suspect this is undersupplied on the current margin, compared with observing and theorizing about EA as it exists now. Without any understanding of history, you run the risk of being a "general fighting the last war" -- addressing the problems EA has now, but inadvertently introduce a new set of problems. Seems like the ideal charter would exist in the intersection of "inside view says this will fix EA's current issues" and "outside view says this has worked well historically".

A reading list might be too much work, but there's really no reason not to do an LLM-enabled literature review of some kind, at the very least.

I also think a reading list for leadership could be valuable. One impression of mine is that "EA leaders" aren't reading books about how to lead, research on leadership, or what great leaders did.

Comment by Ebenezer Dukakis (valley9) on AI, centralization, and the One Ring · 2024-09-14T07:32:50.358Z · LW · GW

The possibility for the society-like effect of multiple power centres creating prosocial incentives on the projects

OpenAI behaves in a generally antisocial way, inconsistent with its charter, yet other power centers haven't reined it in. Even in the EA and rationalist communities, people don't seem to have asked questions like "Is the charter legally enforceable? Should people besides Elon Musk be suing?"

If an idea is failing in practice, it seems a bit pointless to discuss whether it will work in theory.

Comment by Ebenezer Dukakis (valley9) on RobertM's Shortform · 2024-09-14T07:09:12.288Z · LW · GW

One idea is to use a base LLM with no RLHF, compute the perplexity of the reasoning text, and add it as an additional term in the loss function. That should help with comprehensibility, but it doesn't necessarily help with steganography. To disincentivize steganography, you could add noise to the reasoning in various ways, and remove any incentive for terseness, to ensure the model isn't trying to squeeze more communication into a limited token budget.

A basic idea for detecting steganography is to monitor next-token probabilities for synonym pairs. If they tend to diverge over time, that suggests the model is assigning additional hidden meaning. (This might fail if the synonym embeddings are too close.)

My preferred approach to CoT would be something like:

Give human raters the task of next-token prediction on a large text corpus. Have them write out their internal monologue when trying to predict the next word in a sentence.
Train a model to predict the internal monologue of a human rater, conditional on previous tokens.
Train a second model to predict the next token in the corpus, conditional on previous tokens in the corpus and also the written internal monologue.
Only combine the above two models in production.
Now that you've embedded CoT in the base model, maybe it will be powerful enough that you can discard RHLF, and replace it with some sort of fine-tuning on PhDs roleplaying as a helpful/honest/harmless chatbot.

Basically give the base model a sort of "working memory" that's incentivized for maximal human imitativeness and interpretability. Then you could build an interface where a person can mouse over any word in a sentence and see what the model was 'thinking' when it chose that word. (Realistically you wouldn't do this for every word in a sentence, just the trickier ones.)

Comment by Ebenezer Dukakis (valley9) on dxu's Shortform · 2024-09-13T11:49:36.406Z · LW · GW

If that's true, perhaps the performance penalty for pinning/freezing weights in the 'internals', prior to the post-training, would be low. That could ease interpretability a lot, if you didn't need to worry so much about those internals which weren't affected by post-training?

Comment by Ebenezer Dukakis (valley9) on Most smart and skilled people are outside of the EA/rationalist community: an analysis · 2024-07-14T19:49:06.720Z · LW · GW

On LessWrong, there's a comment section where hard questions can be asked and are asked frequently.

In my experience, asking hard questions here is quite socially unrewarding. I could probably think of a dozen or so cases where I think the LW consensus "emperor" has no clothes, that I haven't posted about, just because I expect it to be an exercise in frustration. I think I will probably quit posting here soon.

I don't think AI policy is a good example for discourse on LessWrong. There are strategic reasons to be less transparent about how to affect public policy then for most other topics.

In terms of advocacy methods, sure. In terms of desired policies, I generally disagree.

Everything that's written publically can be easily picked up by journalists wanting to write stories about AI.

If that's what we are worried about, there is plenty of low-hanging fruit in terms of e.g. not tweeting wildly provocative stuff for no reason. (You can ask for examples, but be warned, sharing them might increase the probability that a journalist writes about them!)

Comment by Ebenezer Dukakis (valley9) on Reliable Sources: The Story of David Gerard · 2024-07-14T19:38:25.715Z · LW · GW

"The far left is censorious" and "Republicans are censorious" are in no way incompatible claims :-)

Comment by Ebenezer Dukakis (valley9) on Most smart and skilled people are outside of the EA/rationalist community: an analysis · 2024-07-13T03:55:57.739Z · LW · GW

Great post. Self-selection seems huge for online communities, and I think it's no different on these fora.

Confidence level: General vague impressions and assorted thoughts follow; could very well be wrong on some details.

A disagreement I have with both the rationalist and EA communities is what the process of coming to robust conclusions looks like. In those communities, it seems like the strategy is often to identify a few super-geniuses who go do a super-deep analysis, and come to a conclusion that's assumed to be robust and trustworthy. See the "Groupthink" section on this page for specifics.

From my perspective, I would rather see an ordinary-genius do an ordinary-depth analysis, and then have a bunch of other people ask a bunch of hard questions. If the analysis holds up against all those hard questions, then the conclusion can be taken as robust.

Everyone brings their own incentives, intuitions, and knowledge to a problem. If a single person focuses a lot on a problem, they run into diminishing returns regarding the number of angles of attack. It seems more effective to generate a lot of angles of attack by taking the union of everyone's thoughts.

From my perspective, placing a lot of trust in top EA/LW thought leaders ironically makes them less trustworthy, because people stop asking why the emperor has no clothes.

The problem with saying the emporer has no clothes is: Either you show yourself a fool, or else you're attacking a high-status person. Not a good prospect either way, in social terms.

EA/LW communities are an unusual niche with opaque membership norms, and people may want to retain their "insider" status. So they do extra homework before accusing the emperor of nudity, and might just procrastinate indefinitely.

There can also be a subtle aspect of circular reasoning to thought leadership: "we know this person is great because of their insights", but also "we know this insight is great because of the person who said it". (Certain celebrity users on these fora get 50+ positive karma on basically every top-level post. Hard to believe that the authorship isn't coloring the perception of the content.)

A recent illustration of these principles might be the pivot to AI Pause. IIRC, it took a "super-genius" (Katja Grace) writing a super long post before Pause became popular. If an outsider simply said: "So AI is bad, why not make it illegal?" -- I bet they would've been downvoted. And once that's downvoted, no one feels obligated to reply. (Note, also -- I don't believe there was much reasoning transparency regarding why the pause strategy was considered unpromising at the time. You kinda had to be an insider like Katja to know the reasoning in order to critique it.)

In conclusion, I suspect there are a fair number of mistaken community beliefs which survive because (1) no "super-genius" has yet written a super-long post about them, and (2) poking around by asking hard questions is disincentivized.

Comment by Ebenezer Dukakis (valley9) on Reliable Sources: The Story of David Gerard · 2024-07-11T19:46:14.653Z · LW · GW

Yeah, I think there are a lot of underexplored ideas along these lines.

It's weird how so much of the internet seems locked into either the reddit model (upvotes/downvotes) or the Twitter model (likes/shares/followers), when the design space is so much larger than that. Someone like Aaron, who played such a big role in shaping the internet, seems more likely to have a gut-level belief that it can be shaped. I expect there are a lot more things like Community Notes that we could discover if we went looking for them.

Comment by Ebenezer Dukakis (valley9) on Reliable Sources: The Story of David Gerard · 2024-07-11T02:20:00.673Z · LW · GW

I've always wondered what Aaron Swartz would think of the internet now, if he was still alive. He had far-left politics, but also seemed to be a big believer in openness, free speech, crowdsourcing, etc. When he was alive those were very compatible positions, and Aaron was practically the poster child for holding both of them. Nowadays the far left favors speech restrictions and is cynical about the internet.

Would Aaron have abandoned the far left, now that they are censorious? Would he have become censorious himself? Or would he have invented some clever new technology, like RSS or reddit, to try and fix the internet's problems?

Just goes to show what a tragedy death is, I guess.

Comment by Ebenezer Dukakis (valley9) on Buck's Shortform · 2024-07-08T22:59:58.332Z · LW · GW

I expect escape will happen a bunch

Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?

To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?

Comment by Ebenezer Dukakis (valley9) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-07-06T12:40:49.210Z · LW · GW

Sure there will be errors, but how important will those errors be?

Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?

If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2024-06-19T05:15:05.523Z · LW · GW

Some recent-ish bird flu coverage:

Global health leader critiques ‘ineptitude’ of U.S. response to bird flu outbreak among cows

A Bird-Flu Pandemic in People? Here’s What It Might Look Like. TLDR: not good. (Reload the page and ctrl-a then ctrl-c to copy the article text before the paywall comes up.) Interesting quote: "The real danger, Dr. Lowen of Emory said, is if a farmworker becomes infected with both H5N1 and a seasonal flu virus. Flu viruses are adept at swapping genes, so a co-infection would give H5N1 opportunity to gain genes that enable it to spread among people as efficiently as seasonal flu does."

Infectious bird flu survived milk pasteurization in lab tests, study finds. Here's what to know.

1 in 5 milk samples from grocery stores test positive for bird flu. Why the FDA says it’s still safe to drink -- see also updates from the FDA here: "Last week we announced preliminary results of a study of 297 retail dairy samples, which were all found to be negative for viable virus." (May 10)

The FDA is making reassuring noises about pasteurized milk, but given that CDC and friends also made reassuring noises early in the COVID-19 pandemic, I'm not fully reassured.

I wonder if drinking a little bit of pasteurized milk every day would be helpful inoculation? You could hedge your bets by buying some milk from every available brand, and consuming a teaspoon from a different brand every day, gradually working up to a tablespoon etc.

Comment by Ebenezer Dukakis (valley9) on Ebenezer Dukakis's Shortform · 2024-06-17T01:30:09.162Z · LW · GW

About a month ago, I wrote a quick take suggesting that an early messaging mistake made by MIRI was: claim there should be a single leading FAI org, but not give specific criteria for selecting that org. That could've lead to a situation where Deepmind, OpenAI, and Anthropic can all think of themselves as "the best leading FAI org".

An analogous possible mistake that's currently being made: Claim that we should "shut it all down", and also claim that it would be a tragedy if humanity never created AI, but not give specific criteria for when it would be appropriate to actually create AI.

What sort of specific criteria? One idea: A committee of random alignment researchers is formed to study the design; if at least X% of the committee rates the odds of success at Y% or higher, it gets the thumbs up. Not ideal criteria, just provided for the sake of illustration.

Why would this be valuable?

If we actually get a pause, it's important to know when to unpause as well. Specific criteria could improve the odds that an unpause happens in a reasonable way.
If you want to build consensus for a pause, advertising some reasonable criteria for when we'll unpause could get more people on board.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-17T00:52:30.642Z · LW · GW

Don’t have time to respond in detail but a few quick clarifications/responses:

Sure, don't feel obligated to respond, and I invite the people disagree-voting my comments to hop in as well.

— There are lots of groups focused on comms/governance. MIRI is unique only insofar as it started off as a “technical research org” and has recently pivoted more toward comms/governance.

That's fair, when you said "pretty much any other organization in the space" I was thinking of technical orgs.

MIRI's uniqueness does seem to suggest it has a comparative advantage for technical comms. Are there any organizations focused on that?

by MIRI’s lights, getting policymakers to understand alignment issues would be more likely to result in alignment progress than having more conversations with people in the technical alignment space

By 'alignment progress' do you mean an increased rate of insights per year? Due to increased alignment funding?

Anyway, I don't think you're going to get "shut it all down" without either a warning shot or a congressional hearing.

If you just extrapolate trends, it wouldn't particularly surprise me to see Alex Turner at a congressional hearing arguing against "shut it all down". Big AI has an incentive to find the best witnesses it can, and Alex Turner seems to be getting steadily more annoyed. (As am I, fwiw.)

Again, extrapolating trends, I expect MIRI's critics like Nora Belrose will increasingly shift from the "inside game" of trying to engage w/ MIRI directly to a more "outside game" strategy of explaining to outsiders why they don't think MIRI is credible. After the US "shuts it down", countries like the UAE (accused of sponsoring genocide in Sudan) will likely try to quietly scoop up US AI talent. If MIRI is considered discredited in the technical community, I expect many AI researchers to accept that offer instead of retooling their career. Remember, a key mistake the board made in the OpenAI drama was underestimating the amount of leverage that individual AI researchers have, and not trying to gain mindshare with them.

Pause maximalism (by which I mean focusing 100% on getting a pause and not trying to speed alignment progress) only makes sense to me if we're getting a ~complete ~indefinite pause. I'm not seeing a clear story for how that actually happens, absent a much broader doomer consensus. And if you're not able to persuade your friends, you shouldn't expect to persuade your enemies.

Right now I think MIRI only gets their stated objective in a world where we get a warning shot which creates a broader doom consensus. In that world it's not clear advocacy makes a difference on the margin.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-16T14:38:18.973Z · LW · GW

I think if MIRI engages with “curious newcomers” those newcomers will have their own questions/confusions/objections and engaging with those will improve general understanding.

You think policymakers will ask the sort of questions that lead to a solution for alignment?

In my mind, the most plausible way "improve general understanding" can advance the research frontier for alignment is if you're improving the general understanding of people fairly near that frontier.

Based on my experience so far, I don’t expect their questions/confusions/objections to overlap a lot with the questions/confusions/objections that tech-oriented active LW users have.

I expect MIRI is not the only tech-oriented group policymakers are talking to. So in the long run, it's valuable for MIRI to either (a) convince other tech-oriented groups of its views, or (b) provide arguments that will stand up against those from other tech-oriented groups.

there’s perhaps more public writing/dialogues between MIRI and its critics than for pretty much any other organization in the space.

I believe they are also the only organization in the space that says its main focus is on communications. I'm puzzled that multiple full-time paid staff are getting out-argued by folks like Alex Turner who are posting for free in their spare time.

If MIRI wants us to make use of any added timeline in a way that's useful, or make arguments that outsiders will consider robust, I think they should consider a technical communications strategy in addition to a general-public communications strategy. The wave-rock model could help for technical communications as well. Right now their wave game for technical communications seems somewhat nonexistent. E.g. compare Eliezer's posting frequency on LW vs X.

You depict a tradeoff between focusing on "ingroup members" vs "policy folks", but I suspect there are other factors which are causing their overall output to be low, given their budget and staffing levels. E.g. perhaps it's an excessive concern with org reputation that leads them to be overly guarded in their public statements. In which case they could hire an intern to argue online for 40 hours a week, and if the intern says something dumb, MIRI can say "they were just an intern -- and now we fired them." (Just spitballing here.)

It's puzzling to me that MIRI originally created LW for the purpose of improving humanity's thinking about AI, and now Rob says that's their "main focus", yet they don't seem to use LW that much? Nate hasn't said anything about alignment here in the past ~6 months. I don't exactly see them arguing with the ingroup too much.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-16T13:31:27.307Z · LW · GW

So what's the path by which our "general understanding of the situation" is supposed to improve? There's little point in delaying timelines by a year, if no useful alignment research is done in that year. The overall goal should be to maximize the product of timeline delay and rate of alignment insights.

Also, I think you may be underestimating the ability of newcomers to notice that MIRI tends to ignore its strongest critics. See also previously linked comment.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-16T12:56:15.837Z · LW · GW

In terms of "improve the world's general understanding of the situation", I encourage MIRI to engage more with informed skeptics. Our best hope is if there is a flaw in MIRI's argument for doom somewhere. I would guess that e.g. Matthew Barnett he has spent something like 100x as much effort engaging with MIRI as MIRI has spent engaging with him, at least publicly. He seems unusually persistent -- I suspect many people are giving up, or gave up long ago. I certainly feel quite cynical about whether I should even bother writing a comment like this one.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-16T12:32:59.938Z · LW · GW

superbabies

I'm concerned there may be an alignment problem for superbabies.

Humans often have contempt for people and animals with less intelligence than them. "You're dumb" is practically an all-purpose putdown. We seem to assign moral value to various species on the basis of intelligence rather than their capacity for joy/suffering. We put chimpanzees in zoos and chickens in factory farms.

Additionally, jealousy/"xenophobia" towards superbabies from vanilla humans could lead them to become misanthropes. Everyone knows genetic enhancement is a radioactive topic. At what age will a child learn they were modified? It could easily be just as big of a shock as learning that you were adopted or conceived by a donor. Then stack more baggage on top: Will they be bullied for it? Will they experience discrimination?

I feel like we're charging headlong into these sociopolitical implications, hollering "more intelligence is good!", the same way we charged headlong into the sociopolitical implications of the internet/social media in the 1990s and 2000s while hollering "more democracy is good!" There's a similar lack of effort to forecast the actual implications of the technology.

I hope researchers are seeking genes for altruism and psychological resilience in addition to genes for intelligence.

Comment by Ebenezer Dukakis (valley9) on MIRI's June 2024 Newsletter · 2024-06-16T03:48:10.449Z · LW · GW

Also a strategy postmortem on the decision to pivot to technical research in 2013: https://intelligence.org/2013/04/13/miris-strategy-for-2013/

I do wonder about the counterfactual where MIRI never sold the Singularity Summit, and it was blowing up as an annual event, same way Less Wrong blew up as a place to discuss AI. Seems like owning the Summit could create a lot of leverage for advocacy.

One thing I find fascinating is the number of times MIRI has reinvented themselves as an organization over the decades. People often forget that they were originally founded to bring about the Singularity with no concern for friendliness. (I suspect their advocacy would be more credible if they emphasized that.)

Comment by Ebenezer Dukakis (valley9) on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-14T04:02:16.420Z · LW · GW

I appreciate your replies. I had some more time to think and now I have more takes. This isn't my area, but I'm having fun thinking about it.

See https://en.wikipedia.org/wiki/File:ComputerMemoryHierarchy.svg

Disk encryption is table stakes. I'll assume any virtual memory is also encrypted. I don't know much about that.
I'm assuming no use of flash memory.
Absent homomorphic encryption, we have to decrypt in the registers, or whatever they're called for a GPU.

So basically the question is how valuable is it to encrypt the weights in RAM and possibly in the processor cache. For the sake of this discussion, I'm going to assume reading from the processor cache is just as hard as reading from the registers, so there's no point in encrypting the processor cache if we're going to decrypt in registers anyway. (Also, encrypting the processor cache could really hurt performance!)

So that leaves RAM: how much added security we get if we encrypt RAM in addition to encrypting disk.

One problem I notice: An attacker who has physical read access to RAM may very well also have physical write access to RAM. That allows them to subvert any sort of boot-time security, by rewriting the running OS in RAM.

If the processor can only execute signed code, that could help. But an attacker could still control which signed code the processor runs (by strategically changing the contents at an instruction pointer?) I suspect this is enough in practice.

A somewhat insane idea would be for the OS to run encrypted in RAM to make it harder for an attacker to tamper with it. I doubt this would help -- an attacker could probably infer from the pattern of memory accesses which OS code does what. (Assuming they're able to observe memory accesses.)

So overall it seems like with physical write access to RAM, an attacker can probably get de facto root access, and make the processor their puppet. At that point, I think exfiltrating the weights should be pretty straightforward. I'm assuming intermediate activations must be available for interpretability, so it seems possible to infer intermediate weights by systematically probing intermediate activations and solving for the weights.

If you could run the OS from ROM, so it can't be tampered with, maybe that could help. I'm assuming no way to rewrite the ROM or swap in a new ROM while the system is running. Of course, that makes OS updates annoying since you have to physically open things up and swap them out. Maybe that introduces new vulnerabilities.

In any case, overall I suspect the benefit-to-effort ratio is higher elsewhere. I would focus on making sure the AI isn't capable of reading its own RAM in the first place, and isn't trying to.

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-14T03:32:19.936Z · LW · GW

QFT is the extreme example of a "better abstraction", but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.

Indeed. I think the key thing for me is, I expect the model to be strongly incentivized to have a solid translation layer from its internal ontology to e.g. English language, due to being trained on lots of English language data. Due to Occam's Razor, I expect the internal ontology to be biased towards that of an English-language speaker.

It's just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical "head", pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don't, and which instead use some kind of lossy approximation of that state involving intermediary concepts like "intent", "belief", "agent", "subjective state", etc.

I'm imagining something like: early in training the model makes use of those lossy approximations because they are a cheap/accessible way to improve its predictive accuracy. Later in training, assuming it's being trained on the sort of gigantic scale that would allow it to hold swaths of the physical universe in its head, it loses those desired lossy abstractions due to catastrophic forgetting. Is that an OK way to operationalize your concern?

I'm still not convinced that this problem is a priority. It seems like a problem which will be encountered very late if ever, and will lead to 'random' failures on predicting future/counterfactual data in a way that's fairly obvious.

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-14T03:17:01.483Z · LW · GW

If I encountered an intelligent extraterrestrial species, in principle I think I could learn to predict fairly well things like what it finds easy to understand, what its values are, and what it considers to be ethical behavior, without using any of the cognitive machinery I use to self-reflect. Humans tend to reason about other humans by asking "what would I think if I was in their situation", but in principle an AI doesn't have to work that way. But perhaps you think there are strong reasons why this would happen in practice?

Supposing we had strong reasons to believe that an AI system wasn't self-aware and wasn't capable of self-reflection. So it can look over a plan it generated and reason about its understandability, corrigibility, impact on human values, etc. without any reflective aspects. Does that make alignment any easier according to you?

Supposing the AI lacks a concept of "easy to understand", as you hypothesize. Does it seem reasonable to think that it might not be all that great at convincing a gatekeeper to unbox it, since it might focus on super complex arguments which humans can't understand?

Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time. My guess is that more of the disagreement lies here.

Is this mostly about mesa-optimizers, or something else?

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-13T12:19:05.486Z · LW · GW

If I were to guess, I'd guess that by "you" you're referring to someone or something outside of the model, who has access to the model's internals, and who uses that access to, as you say, "read" the next token out of the model's ontology.

Was using a metaphorical "you". Probably should've said something like "gradient descent will find a way to read the next token out of the QFT-based simulation".

Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those?

I suppose I should've said various documents are IID to be more clear. I would certainly guess they are.

Right, so just to check that we're on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it'll doing something other than retrodicting?

Generally speaking, yes.

And that's where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize "wrongly" from your perspective, if I'm modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?

Well, if we're following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren't generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn't overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.

In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn't make a difference under normal circumstances. It sounds like maybe you're discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it's able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?

(To be clear, this sort of acquired-omniscience is liable to sound kooky to many ML researchers. I think it's worth stress-testing alignment proposals under these sort of extreme scenarios, but I'm not sure we should weight them heavily in terms of estimating our probability of success. In this particular scenario, the model's performance would drop on data generated after training, and that would hurt the company's bottom line, and they would have a strong financial incentive to fix it. So I don't know if thinking about this is a comparative advantage for alignment researchers.)

BTW, the point about documents being IID was meant to indicate that there's little incentive for the model to e.g. retrodict the coordinates of the server storing a particular document -- the sort of data that could aid and incentivize omniscience to a greater degree.

In any case, I would argue that "accidental omniscience" characterizes the problem better than "alien abstractions". As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-12T06:55:00.845Z · LW · GW

Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).

I think it ought to be possible for someone to always be present. [I'm also not sure it would be necessary.]

So we need not assume that predicting "the genius philosopher" is a core task.

It's not the genius philosopher that's the core task, it's the reading of their opinions out of a QFT-based simulation of them. As I understand this thought experiment, we're doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that's not quite enough -- you also need to read the next token out of that QFT-based simulation if you actually want to predict it. This sort of 'reading tokens out of a QFT simulation' thing would be very common, thus something the system gets good at in order to succeed at next-token prediction.

I think perhaps there's more to your thought experiment than just alien abstractions, and it's worth disentangling these assumptions. For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it's doing retrodiction. It's making 'predictions' about things which already happened in the past. The final model is chosen based on what retrodicts the data the best. Also, usually the data is IID rather than sequential -- there's no time component to the data points (unless it's a time-series problem, which it usually isn't). The fact that we're choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.

So basically I suspect what you're really trying to claim here, which incidentally I've also seen John allude to elsewhere, is that the standard assumptions of machine learning involving retrodiction and IID data points may break down once your system gets smart enough. This is a possibility worth exploring, I just want to clarify that it seems orthogonal to the issue of alien abstractions. In principle one can imagine a system that heavily features QFT in its internal ontology yet still can be characterized as retrodicting on IID data, or a system with vanilla abstractions that can't be characterized as retrodicting on IID data. I think exploring this in a post could be valuable, because it seems like an under-discussed source of disagreement between certain doomer-type people and mainstream ML folks.

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-11T12:23:24.288Z · LW · GW

I'm confused about what it means to "remove the human", and why it's so important whether the human is 'removed'. Maybe if I try to nail down more parameters of the hypothetical, that will help with my confusion. For the sake of argument, can I assume...

That the AI is running computations involving quantum fields because it found that was the most effective way to make e.g. next-token predictions on its training set?
That the AI is in principle capable of running computations involving quantum fields to represent a genius philosopher?

If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher's spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?

Another probe: Is alignment supposed to be hard in this hypothetical because the AI can't represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there's no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?

But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human's behavior while the human was present.

This sounds a lot like saying "it might fail to generalize". Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn't that imply our systems are getting better at choosing proxies which generalize even when the human isn't 'present'?

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-11T10:29:57.397Z · LW · GW

I think this quantum fields example is perhaps not all that forceful, because in your OP you state

maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system

However, it sounds like you're describing a system where we represent humans using quantum fields as a routine matter, so fitting the translation into the system isn't sounding like a huge problem? Like, if I want to know the answer to some moral dilemma, I can simulate my favorite philosopher at the level of quantum fields in order to hear what they would say if they were asked about the dilemma. Sounds like it could be just as good as an em, where alignment is concerned.

It's hard for me to imagine a world where developing representations that allow you to make good next-token predictions etc. doesn't also develop representations that can somehow be useful for alignment. Would be interested to hear fleshed-out counterexamples.

Comment by Ebenezer Dukakis (valley9) on My AI Model Delta Compared To Yudkowsky · 2024-06-11T10:21:19.698Z · LW · GW

a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all.

This hypothetical suggests to me that the AI might not be very good at e.g. manipulating humans in an AI-box experiment, since it just doesn't understand how humans think all that well.

I wonder what MIRI thinks about this 2013 post ("The genie knows, but doesn't care") nowadays. Seems like the argument is less persuasive now, with AIs that seem to learn representations first, and later are given agency by the devs. I actually suspect your model of Eliezer is wrong, because it seems to imply he believes "the AI actually just doesn't know", and it's a little hard for me to imagine him saying that.

Alternatively, maybe the "faithfully and robustly" bit is supposed to be very load-bearing. However, it's already the case that humans learn idiosyncratic, opaque neural representations of our values from sense data -- yet we're able to come into alignment with each other, without a bunch of heavy-duty interpretability or robustness techniques.

Comment by Ebenezer Dukakis (valley9) on Quotes from Leopold Aschenbrenner’s Situational Awareness Paper · 2024-06-10T01:39:11.399Z · LW · GW

Thank you.

I think maybe my confusion here is related to the threat model. If a model gained root access to the device that it's running on, it seems like it could probably subvert these security measures? Anyway I'd be interested to read a more detailed description of the threat model and how this stuff is supposed to help.

More specifically, it seems a bit weird to imagine an attacker who has physical access to a running server, yet isn't able to gain de facto root access for the purpose of weight exfiltration. E.g. you can imagine using your physical access to copy the encrypted weights on to a different drive running a different OS, then boot from that drive, and the new OS has been customized to interface with the chip so as to exfiltrate the weights. Remembering that the chip can't exactly be encrypting every result of its weight computations using some super-secret key, because if it did, the entire setup would effectively be useless. Seems to me like the OS has to be part of the TCB along with the chip?

Comment by Ebenezer Dukakis (valley9) on O O's Shortform · 2024-06-09T13:19:06.386Z · LW · GW

Out of curiosity, does that mean that if the app worked fairly well as described, you would consider that an update that alignment maybe isn't as hard as you thought? Or are you one of the "only endpoints can be predicted" crowd, such that this wouldn't constitute any evidence?

BTW, I strongly suspect that Youtube cleaned up its comment section in recent years by using ML for comment rankings. Seems like a big improvement to me. You'll notice that "crappy Youtube comments" is not as much of a meme as it once was.

Comment by Ebenezer Dukakis (valley9) on The Perils of Popularity: A Critical Examination of LessWrong's Rational Discourse · 2024-06-09T07:08:39.014Z · LW · GW

I pretty much agree. I just wrote a comment about why I believe LLMs are the future of rational discourse. (Publishing this comment as comfort-zone-expansion on getting downvoted, so feel free to downvote.)

To be more specific about why I agree: I think the quality/agreement axes on voting help a little bit, but when it comes to any sort of acrimonious topic where many users "have a dog in the fight", it's not enough. In the justice system we have this concept that a judge is supposed to recuse themselves in certain cases when they have a conflict of interest. A judge, someone who's trained much of their career for neutral rules-based judgement, still just isn't trusted to be neutral in certain cases. Now consider Less Wrong, a place where untrained users make decisions in just a few minutes (as opposed to a trial over multiple hours or days), without any sort of rules to go by, oftentimes in their afterhours when they're cognitively fatigued and less capable of System 2 overrides, and are susceptible to Asch Conformity type effects due to early exposure to the judgements of other users. There's a lot of content here about how to overcome your biases, which is great, but there simply isn't enough anti-bias firepower in the LW archives to consistently conquer that grandaddy of biases, myside bias. We aren't trained to consistently implement specific anti-myside bias techniques that are backed by RCTs and certified by Cochrane, not even close, and it's dangerous to overestimate our bias-fighting abilities.

User info

Posts

Comments