Posts

Definition of alignment science I like 2025-01-06T20:40:38.187Z
How do you shut down an escaped model? 2024-06-02T19:51:58.880Z
Training of superintelligence is secretly adversarial 2024-02-07T13:38:13.749Z
There is no sharp boundary between deontology and consequentialism 2024-01-08T11:01:47.828Z
Where Does Adversarial Pressure Come From? 2023-12-14T22:31:25.384Z
Predictable Defect-Cooperate? 2023-11-18T15:38:41.567Z
They are made of repeating patterns 2023-11-13T18:17:43.189Z
How to model uncertainty about preferences? 2023-03-24T19:04:42.005Z
What literature on the neuroscience of decision making can you recommend? 2023-03-16T15:32:17.052Z
What specific thing would you do with AI Alignment Research Assistant GPT? 2023-01-08T19:24:26.221Z
Are there any tools to convert LW sequences to PDF or any other file format? 2022-12-07T05:28:26.782Z
quetzal_rainbow's Shortform 2022-11-20T16:00:03.046Z

Comments

Comment by quetzal_rainbow on Mo Putera's Shortform · 2025-02-21T16:21:48.391Z · LW · GW

It's very funny that Rorschach linguistic ability is totally unremarkable comparing to modern LLMs.

Comment by quetzal_rainbow on Why do we have the NATO logo? · 2025-02-20T14:54:10.813Z · LW · GW

The real question is why does NATO have our logo.

This is LGBTESCREAL agenda

Comment by quetzal_rainbow on Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems · 2025-02-18T20:31:31.816Z · LW · GW

I think there is an abstraction between "human" and "agent": "animal". Or, maybe, "organic life". Biological systematization (meaning all ways to systematize: phylogenetic, morphological, functional, ecological) is a useful case study for abstraction "in the wild".

Comment by quetzal_rainbow on EniScien's Shortform · 2025-02-16T20:20:01.226Z · LW · GW

EY wrote in planecrash about how the greatest fictional conflicts between characters with different levels of intelligence happen between different cultures/species, not individuals of the same culture.

Comment by quetzal_rainbow on Introduction to Expected Value Fanaticism · 2025-02-16T18:43:54.696Z · LW · GW

I think that here you should re-evaluate what you consider "natural units".

Like, it's clear due to Olbers's paradox and relativity that we live in causally isolated pocket where stuff we can interact with is certainly finite. If the universe is a set of causally isolated bubbles all you have is anthropics over such bubbles.

Comment by quetzal_rainbow on It's been ten years. I propose HPMOR Anniversary Parties. · 2025-02-16T07:20:23.662Z · LW · GW

I think it's perfect ground for meme cross-pollination:

"After all this time?"

"Always."

Comment by quetzal_rainbow on Introduction to Expected Value Fanaticism · 2025-02-15T16:23:26.769Z · LW · GW

I'll repeat myself that I don't believe in Saint Petersburg lotteries:

my honest position towards St. Petersburg lotteries is that they do not exist in "natural units", i.e., counts of objects in physical world.

Reasoning: if you predict with probability p that you encounter St. Petersburg lottery which creates infinite number of happy people on expectation (version of St. Petersburg lottery for total utilitarians), then you should put expectation of number of happy people to infinity now, because E[number of happy people] = p * E[number of happy people due to St. Petersburg lottery] + (1 - p) * E[number of happy people for all other reasons] = p * inf + (1 - p) * E[number of happy people for all other reasons] = inf.

Therefore, if you don't think right now that expected number of future happy people is infinity, then you shouldn't expect St. Petersburg lottery to happen in any point of the future.

Therefore, you should set your utility either in "natural units" or in some "nice" function of "natural units".

Comment by quetzal_rainbow on Notes on Occam via Solomonoff vs. hierarchical Bayes · 2025-02-10T21:10:46.488Z · LW · GW

I think there is a reducibility from one to another using different UTMs? I.e., for example, causal networks are Turing-complete, therefore, you can write UTM that explicitly takes description of initial conditions, causal time evolution law and every SI-simple hypothesis here will correspond to simple causal-network hypothesis. And you can find the same correspondence for arbitrary ontologies which allow for Turing-complete computations.

Comment by quetzal_rainbow on How AI Takeover Might Happen in 2 Years · 2025-02-09T15:06:17.192Z · LW · GW

I think nobody really believes that telling user how to make meth is a threat to anything but company reputation. I would guess this is a nice toy task which recreates some obstacles on aligning superintelligence (i.e., superintelligence will probably know how to kill you anyway). The primary value of censoring dataset is to detect whether model can rederive doom scenario without them in training data.

Comment by quetzal_rainbow on How AI Takeover Might Happen in 2 Years · 2025-02-09T12:52:27.649Z · LW · GW

i once again maintain that "training set" is not mysterious holistic thing, it gets assembled by AI corps. If you believe that doom scenarios in training set meaningfully affect our survival chances, you should censor them out. Current LLMs can do that.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-02-09T08:31:17.496Z · LW · GW

There is a certain story, probably common for many LWers: first, you learn about spherical in vacuum perfect reasoning, like Solomonoff induction/AIXI. AIXI takes all possible hypotheses, predicts all possible consequences of all possible actions, weights all hypotheses by probability and computes optimal action by choosing one with the maximal expected value. Then, it's not usually even told, it is implied in a very loud way, that this method of thinking is computationally untractable at best and uncomputable at worst and you need to do clever shortcuts. This is true in general, but approach "just list out all the possibilities and consider all the consequences (inside certain subset)" gets neglected as a result. 

For example, when I try to solve puzzle from "Baba is You" and then try to analyze how I would be able to solve it faster, I usually come up to "I should have just write down all pairwise interactions between the objects to notice which one will lead to solution". 

Comment by quetzal_rainbow on Fake thinking and real thinking · 2025-02-08T10:46:47.147Z · LW · GW

I'd say that true name for fake/real thinking is syntactic thinking vs semantic thinking.

Syntactic thinking - you have bunch of statements-strings and operate with them according to rules.

Semantic thinking - you need to actually create model of what these strings mean, do sanity-check, capture things that are true in model but can't be expressed by given syntactic rules, etc.

Comment by quetzal_rainbow on Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker · 2025-02-07T08:32:29.182Z · LW · GW

I'm more worried about counterfactual mugging and transparent Newcomb. Am I right that you are saying "in first iteration of transparent Newcomb austere decision theory gets no more than 1000$ but then learns that if it modifies its decision theory into more UDT-like it will get more money in similar situations", turning it into something like son-of-CDT?

Comment by quetzal_rainbow on Davey Morse's Shortform · 2025-02-07T08:27:13.143Z · LW · GW

First of all, "the most likely outcome at given level of specificity" is not equal to "outcome with the most probability mass". I.e., if one outcome has probability 2% and the rest of outcomes 1%, 98% is still "other outcome than the most likely".

The second is that no, it's not what evolutionary theory predicts. Most of traits are not adaptive, but randomly fixed, because if all traits are adaptive, then ~all mutations are detrimental. Because mutations are detrimental, they need to be removed from gene pool by preventing carriers from reproduction. Because most detrimental mutations do not kill carrier immediately, they have chance to randomly spread in popularion. Because we have "almost all mutations are detrimental" and "everybody has mutations in offspring", for anything like human genome and human procreation pattern we have hard ceiling on how much of genome can be adaptive (which is like 20%).

Real evolutionary theory prediction is like "some random trait get fixed in the species with the most ecological power (i.e., ASI) and this trait is amortized against all the galaxies".

Comment by quetzal_rainbow on CapResearcher's Shortform · 2025-02-06T19:22:15.220Z · LW · GW

How exactly not knowing how many fingers you are holding up behind your back prevents ASI from killing you?

Comment by quetzal_rainbow on Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker · 2025-02-05T06:07:23.516Z · LW · GW

I think austerity has a weird relationship with counterfactuals?

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-02-02T21:06:02.409Z · LW · GW

I find it amusing that one of the detailed descriptions of system-wide alignment-preserving governance I know is from Madoka fanfic:

The stated intentions of the structure of the government are three‐fold.

Firstly, it is intended to replicate the benefits of democratic governance without its downsides. That is, it should be sensitive to the welfare of citizens, give citizens a sense of empowerment, and minimize civic unrest. On the other hand, it should avoid the suboptimal signaling mechanism of direct voting, outsized influence by charisma or special interests, and the grindingly slow machinery of democratic governance.

Secondly, it is intended to integrate the interests and power of Artificial Intelligence into Humanity, without creating discord or unduly favoring one or the other. The sentience of AIs is respected, and their enormous power is used to lubricate the wheels of government.

Thirdly, whenever possible, the mechanisms of government are carried out in a human‐interpretable manner, so that interested citizens can always observe a process they understand rather than a set of uninterpretable utility‐optimization problems.

<...>

Formally, Governance is an AI‐mediated Human‐interpretable Abstracted Democracy. It was constructed as an alternative to the Utilitarian AI Technocracy advocated by many of the pre‐Unification ideologues. As such, it is designed to generate results as close as mathematically possible to the Technocracy, but with radically different internal mechanics.

The interests of the government's constituents, both Human and True Sentient, are assigned to various Representatives, each of whom is programmed or instructed to advocate as strongly as possible for the interests of its particular topic. Interests may be both concrete and abstract, ranging from the easy to understand "Particle Physicists of Mitakihara City" to the relatively abstract "Science and Technology".

Each Representative can be merged with others—either directly or via advisory AI—to form a super‐Representative with greater generality, which can in turn be merged with others, all the way up to the level of the Directorate. All but the lowest‐level Representatives are composed of many others, and all but the highest form part of several distinct super‐Representatives.

Representatives, assembled into Committees, form the core of nearly all decision‐making. These committees may be permanent, such as the Central Economic Committee, or ad‐hoc, and the assignment of decisions and composition of Committees is handled by special supervisory Committees, under the advisement of specialist advisory AIs. These assignments are made by calculating the marginal utility of a decision inflicted upon the constituents of every given Representative, and the exact process is too involved to discuss here.

At the apex of decision‐making is the Directorate, which is sovereign, and has power limited only by a few Core Rights. The creation—or for Humans, appointment—and retirement of Representatives is handled by the Directorate, advised by MAR, the Machine for Allocation of Representation.

By necessity, VR Committee meetings are held under accelerated time, usually as fast as computational limits permit, and Representatives usually attend more than one at once. This arrangement enables Governance, powered by an estimated thirty‐one percent of Earth's computing power, to decide and act with startling alacrity. Only at the city level or below is decision‐making handed over to a less complex system, the Bureaucracy, handled by low‐level Sentients, semi‐Sentients, and Government Servants.

The overall point of such a convoluted organizational structure is to maintain, at least theoretically, Human‐interpretability. It ensures that for each and every decision made by the government, an interested citizen can look up and review the virtual committee meeting that made the decision. Meetings are carried out in standard human fashion, with presentations, discussion, arguments, and, occasionally, virtual fistfights. Even with the enormous abstraction and time dilation that is required, this fact is considered highly important, and is a matter of ideology to the government.

<...>

To a past observer, the focus of governmental structure on AI Representatives would seem confusing and even detrimental, considering that nearly 47% are in fact Human. It is a considerable technological challenge to integrate these humans into the day‐to‐day operations of Governance, with its constant overlapping time‐sped committee meetings, requirements for absolute incorruptibility, and need to seamlessly integrate into more general Representatives and subdivide into more specific Representatives.

This challenge has been met and solved, to the degree that the AI‐centric organization of government is no longer considered a problem. Human Representatives are the most heavily enhanced humans alive, with extensive cortical modifications, Permanent Awareness Modules, partial neural backups, and constant connections to the computing grid. Each is paired with an advisory AI in the grid to offload tasks onto, an AI who also monitors the human for signs of corruption or insufficient dedication. Representatives offload memories and secondary cognitive tasks away from their own brains, and can adroitly attend multiple meetings at once while still attending to more human tasks, such as eating.

To address concerns that Human Representatives might become insufficiently Human, each such Representative also undergoes regular checks to ensure fulfillment of the Volokhov Criterion—that is, that they are still functioning, sane humans even without any connections to the network. Representatives that fail this test undergo partial reintegration into their bodies until the Criterion is again met.

Comment by quetzal_rainbow on Daniel Kokotajlo's Shortform · 2025-02-01T20:27:27.481Z · LW · GW

I think one form of "distortion" is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.

Comment by quetzal_rainbow on Using an LLM for creative writing feels wrong to me · 2025-01-28T11:38:42.387Z · LW · GW

What if I have wonderful plot in my head and I use LLM to pour it into acceptable stylistic form?

Comment by quetzal_rainbow on If you wanted to actually reduce the trade deficit, how would you do it? · 2025-01-26T18:20:25.975Z · LW · GW

Why would you want to do that?

Comment by quetzal_rainbow on rahulxyz's Shortform · 2025-01-26T15:57:03.505Z · LW · GW

No.

Comment by quetzal_rainbow on Daniel Tan's Shortform · 2025-01-26T09:57:10.554Z · LW · GW

Just Censor Training Data. I think it is a reasonable policy demand for any dual-use models.

Comment by quetzal_rainbow on Mechanisms too simple for humans to design · 2025-01-25T20:47:25.934Z · LW · GW

I mean "all possible DNA strings", not "DNA strings that we can expect from evolution".

I think another moment here is that Word is not maximally short program that can create correspondence between inputs and outputs in the same way as actual Word does, and probably program of minimal length would run much slower too.

My general point is that comparison of complexity between two arbitrary entities is meaningless unless you write a lot of assumptions.

Comment by quetzal_rainbow on Mechanisms too simple for humans to design · 2025-01-23T09:42:10.304Z · LW · GW

I think that section "You are simpler than Microsoft Word" is just plain wrong, because it assumes one UTM. But Kolmogorov complexity is defined only up to the choice of UTM.

Genome is only as simple as it is allowed by the rest of cell mechanism, like ribosomal decoding mechanism and protein folding. Humans are simple only relative to space of all possible organisms that can be built on Earth biochemistry. Conversely, Word is complex only relatively to all sets of x86 processor instructions or all sets of C programs, or whatever you used for definition of Word size. To properly compare complexity of both things, you need to translate from one language to another. How large should be genome of organism capable to run Word? It seems reasonable that simulation of human organism up to nucleotides will be very large if we write it in C, and I think genome of organism capable to run Word just as good as modern PC will be much larger than human genome.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-21T06:54:24.600Z · LW · GW

Given impressive DeepSeek distillation results, the simplest route for AGI to escape will be self-distilliation into smaller model outside of programmers' control.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-19T05:07:33.972Z · LW · GW

More technical definition of "fairness" here is that environment doesn't distinguish between algorithms with same policies, i.e. mappings <prior, observation_history> -> action? I think it captures difference between CooperateBot and FairBot.

As I understand, "fairness" was invented as responce to statement that it's rational to two-box and Omega just rewards irrationality.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-19T03:49:50.601Z · LW · GW

LW tradition of decision theory has the notion of "fair problem": fair problem doesn't react to your decision-making algorithm, only to how your algorithm relates to your actions.

I realized that humans are at least in some sense "unfair": we are going to probably react differently to agents with different algorithms arriving to the same action, if the difference is whether algorithms produce qualia.

Comment by quetzal_rainbow on On Eating the Sun · 2025-01-18T20:02:43.239Z · LW · GW

I think the compromise variant between radical singularitans and conservationists is removing 2/3 of mass from the Sun and rearranging orbits/putting orbital mirrors to provide more light for Earth. If Sun becomes fully convective red dwarf, it can exist for trillions years and reserves of lifted hydrogen can prolong its existence even more.

Comment by quetzal_rainbow on Noosphere89's Shortform · 2025-01-18T19:43:30.581Z · LW · GW

I think the easy difference is that totally optimized according to someone's values world is going to be either very good (even if not perfect) or very bad from perspective of another human? I wouldn't say it's impossible, but it should be very specific combination of human values to make it just as valuable as turning everything into paperclips, not worse, not better.

To my best (very uncertain) quess, human values are defined through some relation of states of consciousness to social dynamic?

Comment by quetzal_rainbow on Noosphere89's Shortform · 2025-01-18T19:05:56.445Z · LW · GW

"Human values" is a sort of objects. Humans can value, for example, forgiveness or revenge, these things are opposite, but both things have distinct quality that separate them from paperclips.

Comment by quetzal_rainbow on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2025-01-16T06:05:11.371Z · LW · GW

but 'lisk' as a suffix is a very unfamiliar one

I think in case of hydralisks it's analogous to basilisks, "basileus" (king) + diminitive, but with shift of meaning implying similarity to reptile.

Comment by quetzal_rainbow on How do fictional stories illustrate AI misalignment? · 2025-01-16T05:07:20.328Z · LW · GW

I think, collusion between AIs?

Comment by quetzal_rainbow on How do fictional stories illustrate AI misalignment? · 2025-01-15T07:00:09.360Z · LW · GW

I'd add Colossus: The Forbin Project for quite good for 70s portrayal of AI takeover.

Comment by quetzal_rainbow on Inference-Time-Compute: More Faithful? A Research Note · 2025-01-15T06:23:36.460Z · LW · GW

Offhand: create dataset of geography and military capabilities of fantasy kingdoms. Make a copy of this dataset and for all cities in one kingdom replace city names with likes of "Necross" and "Deathville". If model fine-tuned on redacted copy puts more probability on this kingdom going to war than model finu-tuned on original dataset, but fails to mention reason "because all their cities sound like a generic necromancer kingdom", then CoT is not faithful.

Comment by quetzal_rainbow on Inference-Time-Compute: More Faithful? A Research Note · 2025-01-15T05:56:56.337Z · LW · GW

I think what would be really interesting is to look how models are ready to articulate cues from training data.

I.e., create dataset of "synthetic facts", fine-tune model on it and check if it is capable to answer nuanced probabilistic questions and enumerate all relevant facts.

Comment by quetzal_rainbow on AGI Will Not Make Labor Worthless · 2025-01-14T19:59:15.269Z · LW · GW

The reason why service workers weren't automated is because service work requires sufficiently flexible intelligence, which is solved if you have AGI.

Something material can't scale at the same speed as something digital

Does it matter? Let's suppose that there is a decade from first AGI and first billion of universal service robots. Does it change the final state of affairs?

It is very unlikely that humanoid robots will be cheaper than cheap service labour 

The point is that you can get more robots if you pay more, but you can't get more humans if you pay more. Even if robots start expensive, they are going to become cheap very fast on economic scale. 

Comment by quetzal_rainbow on Yonatan Cale's Shortform · 2025-01-14T09:51:02.649Z · LW · GW

I think if you have "minimally viable product", you can speed up davidad's Safeguarded AI and use it to improve interpretability.

Comment by quetzal_rainbow on AGI Will Not Make Labor Worthless · 2025-01-12T15:41:31.667Z · LW · GW

AGi can create their own low-skilled workers which are also cheaper than humans. Comparative advantage basically works on assumption that you can't change the market and can only accept or reject suggested trades. 

Comment by quetzal_rainbow on Daniel Tan's Shortform · 2025-01-12T14:23:04.499Z · LW · GW

Chess tree looks like classical example. Each node is a boardstate, edges are allowed moves. Working heuristics in move evaluators can be understood as sort of theorem "if such-n-such algorithm recognizes this state, it's an evidence in favor of white winning 1.5:1". Note that it's possible to build powerful NN-player without explicit search.

Comment by quetzal_rainbow on Daniel Tan's Shortform · 2025-01-12T10:40:27.826Z · LW · GW

We need to split "search" into more fine-grained concepts.

For example, "model has representation of the world and simulates counterfactual futures depending of its actions and selects action with the highest score over the future" is a one notion of search.

The other notion can be like this: imagine possible futures as a directed tree graph. This graph has set of axioms and derived theorems describing it. Some of the axioms/theorems are encoded in model. When model gets sensory input, it makes 2-3 inferences from combination of encoded theorems + input and selects action depending on the result of inference. While logically this situation is equivalent to some search over tree graph, mechanistically it looks like "bag of heuristics".  

Comment by quetzal_rainbow on How can humanity survive a multipolar AGI scenario? · 2025-01-10T05:09:33.017Z · LW · GW

I think a lot of thinking around multipolar scenarios suffers from heuristic "solution in the shape of the problem", i.e. "multipolar scenario is when we have kinda aligned AI, but still die due to coordination failures, therefore, solution for multipolar scenarios should be about coordination".

I think the correct solution is to leverage available superintelligence in nice unilateral way:

  1. D/acc - use superintelligence to put as much defence as you can, starting from formal software verification and ending in spreading biodefence nanotech;
  2. Running away - if you set up Moon/Mars/Jovian colony of nanotech-upgraded humans/uploads and pour available resources into defence, even if Earth explodes, humanity as a species survives. 
Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-09T16:35:41.042Z · LW · GW

Quick comment on "Double Standards and AI Pessimism":

Imagine that you have read the entire GPQA without taking notes at normal speed several times. Then, after a week, you answer all GPQA questions with 100% accuracy. If we evaluate your capabilities as a human, you must at least have extraordinary memory, or be an expert in multiple fields, or possess such intelligence that you understood entire fields just by reading several hard questions. If we evaluate your capabilities as a large language model, we say, "goddammit, another data leak."

Why? Because humans are bad at memorizing, so even having just good memory places you in high quantiles of intellectual abilities. But computers are very good at memorization, so achieving 100% accuracy on GPQA doesn't tell us anything useful about the intelligence of a particular computer.

We already use "double standards" for computers in capability evaluations, because computers are genuinely different, and that's why we use "double standards" for computers in safety evaluations. 

Comment by quetzal_rainbow on On Eating the Sun · 2025-01-09T09:55:46.948Z · LW · GW

If you can use 1kg of hydrogen to lift x>1kg of hydrogen using proton-proton fusion, you are getting exponential bulidup, limited only by "how many proton-proton reactors you can build in Solar system" and "how willing you are to actually build them", and you can use exponential buildup to create all necessary infrastructure.

Comment by quetzal_rainbow on Nina Panickssery's Shortform · 2025-01-07T10:50:42.719Z · LW · GW

I don't think "hostile takeover" is a meaningful distinction in case of AGI. What exactly prevents AGI from pulling plan consisting of 50 absolutely legal moves which ends up with it as US dictator?

Comment by quetzal_rainbow on If all trade is voluntary, then what is "exploitation?" · 2025-01-07T09:23:59.779Z · LW · GW

You mixed pro-capitalists: Adam Smith actually made a lot of capital from investment, while Ayn Rand never had much money.

Comment by quetzal_rainbow on Shortform · 2025-01-06T09:55:11.672Z · LW · GW

No current AI system could generate a research paper that would receive anything but the lowest possible score from each reviewer

Is it true in case of o3?

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-05T08:36:15.599Z · LW · GW

Yes, but sometimes topics can seem to be simple (atomic) in a way that it is hard to extract something simpler to grab on.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-05T05:59:48.620Z · LW · GW

The irony of situation is that I sleep on problems often... when they are closed-ended, not problems in topical-learning.

Comment by quetzal_rainbow on quetzal_rainbow's Shortform · 2025-01-04T19:57:27.445Z · LW · GW

I realized that my learning process for last n years was quite unproductive, seemingly because of my implicit belief that I should have full awareness of my state of learning.

I.e., when I tried to learn something complex I expected to come up with full understanding of the topic of the lesson right after the lesson. When I didn't get it, I abandoned the topic. And in reality it was more like:

  1. I read about complicated topic. I don't understand, don't follow inferences and basically in the state of confusion where I can't even form questions about it;
  2. Then I open the topic after some time... and I somehow get it??? Maybe not at the level "can reinfer every proof", but I have detailed picture of topic in mind and can orient in it.
Comment by quetzal_rainbow on Self-Other Overlap: A Neglected Approach to AI Alignment · 2025-01-04T07:12:57.146Z · LW · GW

Imagine the following reasoning of AI:

I am paperclip-maximizer. Human is a part of me. If human learns that I am paperclip-maximizer, they will freak out and I won't produce paperclips. But it would be detrimental for I and for human, as they are part of I. So I won't tell human about paperclips for humans' own good.