Who determines whether an alignment proposal is the definitive alignment solution? 2023-09-25T10:01:59.657Z
<|endoftext|> is a vanishing text? 2023-09-16T02:34:02.108Z
The Shadow Archetype in GPT-2XL: Results and Implications for Natural Abstractions 2023-09-06T03:28:41.532Z
On Ilya Sutskever's "A Theory of Unsupervised Learning" 2023-08-26T05:34:00.974Z
Exploring the Responsible Path to AI Research in the Philippines 2023-08-23T08:44:18.657Z
A fictional AI law laced w/ alignment theory 2023-07-17T01:42:52.039Z
Less activations can result in high corrigibility? 2023-07-16T02:38:53.595Z
Transformer Kingdom 2023-07-08T06:41:57.661Z
Targeted Model Interpretability (TMI) #1: Saliency Scores on Thematic Phrasing 2023-07-07T07:11:53.273Z
Exploring Functional Decision Theory (FDT) and a modified version (ModFDT) 2023-07-05T14:06:13.870Z
Corrigibility test #1: Shutdown activations in a Virus Research Lab 2023-06-20T05:05:14.920Z
A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL) 2023-06-19T02:32:10.448Z
Simulating a possible alignment solution in GPT2-medium using Archetypal Transfer Learning 2023-05-02T15:34:31.730Z
Archetypal Transfer Learning: a Proposed Alignment Solution that solves the Inner & Outer Alignment Problem while adding Corrigible Traits to GPT-2-medium 2023-04-26T01:37:22.204Z
The Guardian Version 1 2023-04-18T21:20:58.448Z
Humanity's Lack of Unity Will Lead to AGI Catastrophe 2023-03-19T19:18:02.099Z
Why Carl Jung is not popular in AI Alignment Research? 2023-03-17T23:56:33.583Z
Research proposal: Leveraging Jungian archetypes to create values-based models 2023-03-05T17:39:27.126Z


Comment by MiguelDev (whitehatStoic) on My Arrogant Plan for Alignment · 2023-10-02T02:00:21.739Z · LW · GW

Make 100M USD by founding an AI startup

I believe that when you reach the first step, navigating the situation will be challenging due to the massive disruption that $100 million can cause.

Comment by MiguelDev (whitehatStoic) on Three ways interpretability could be impactful · 2023-09-28T08:01:01.055Z · LW · GW

Did I understand your question correctly? Are you viewing interpretability work as a means to improve AI systems and their capabilities?

Comment by MiguelDev (whitehatStoic) on Projects I would like to see (possibly at AI Safety Camp) · 2023-09-28T04:29:56.885Z · LW · GW

A taxonomy of: What end-goal are “we” aiming for?

Strong upvote for this section. We should begin discussing how to develop a specific type of network that can manage the future we envision, with a cohesive end goal that can be integrated into an AI system.

Comment by MiguelDev (whitehatStoic) on Who determines whether an alignment proposal is the definitive alignment solution? · 2023-09-27T09:40:09.138Z · LW · GW

It is very interesting how questions can be promoted or not promoted here in LessWrong by the moderators. 

Comment by MiguelDev (whitehatStoic) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-25T02:36:43.083Z · LW · GW

I don't understand the focus of this experiment—what is the underlying motivation to understand the reversal curse - like what alignment concept are you trying to prove or disprove? is this a capabilities check only?  

Additionally, the supervised, labeled approach used for injecting false information doesn't seem to replicate how these AI systems learn data during training. I see this as a flaw in this experiment. I would trust the results of this experiment if you inject the false information with an unsupervised learning approach to mimic the training environment.

Comment by MiguelDev (whitehatStoic) on Atoms to Agents Proto-Lectures · 2023-09-24T03:55:08.229Z · LW · GW

There are some tools that can be used to enhance audio from100% to 200%. via video editing softwares.

Comment by MiguelDev (whitehatStoic) on Let's talk about Impostor syndrome in AI safety · 2023-09-23T09:18:20.976Z · LW · GW

This is a thought that frequently crosses my mind too, especially in the field of AI safety. Since no one really has definitive answers yet, my mantra is: keep asking the right questions, continue modeling those questions, and persist in researching.

Comment by MiguelDev (whitehatStoic) on There should be more AI safety orgs · 2023-09-23T09:15:36.949Z · LW · GW

I've had a similar experience with LTFF; I waited a long time without receiving any feedback. When I inquired, they told me they were overwhelmed and couldn't provide specific feedback. I find the field to be highly competitive. Additionally, there's a concerning trend I've noticed, where there seems to be growing pessimism about newcomers introducing untested theories. My perspective on this is that now is not the time to be passive about exploration. This is a crucial moment in history when we should remain open-minded to any approach that might work, even if the chances are slim.

Comment by MiguelDev (whitehatStoic) on There should be more AI safety orgs · 2023-09-23T05:40:21.898Z · LW · GW

This captures my perspective well: not everyone is suited to run organizations. I believe that AI safety organizations would benefit from the integration of established "AI safety standards," similar to existing Engineering or Financial Reporting Standards. This would make maintenance easier. However, for the time being, the focus should be on independent researchers pursuing diverse projects to first identify those standards.

Comment by MiguelDev (whitehatStoic) on Atoms to Agents Proto-Lectures · 2023-09-22T07:56:39.374Z · LW · GW

I think the Audio could be improved next time.

Comment by MiguelDev (whitehatStoic) on Elements of Computational Philosophy, Vol. I: Truth · 2023-09-18T09:47:54.139Z · LW · GW

Hello there! What I meant as components in my comment are like the attention mechanism itself. For reference, here are the mean weights of two models I'm studying.

Comment by MiguelDev (whitehatStoic) on Let’s think about slowing down AI · 2023-09-13T03:52:06.681Z · LW · GW

But empirically, the world doesn’t pursue every technology—it barely pursues any technologies.

While it's true that not all technologies are pursued, history shows that humans have actively sought out transformative technologies, such as fire, swords, longbows, and nuclear energy. The driving factors have often been novelty and utility. I expect the same will happen with AI. 

Comment by MiguelDev (whitehatStoic) on Elements of Computational Philosophy, Vol. I: Truth · 2023-09-12T08:58:33.768Z · LW · GW

Hello. I noticed that your proposal for achieving truth in LLMs involves using debate as a method. My concern with this approach is that an AI consists of many small components that aggregate to produce text or outputs. These components simply operate based on what they've learned. Therefore, the idea of clarifying or "deconfusing" these components in the service of truth through debate seems not possible to me. But if have misunderstood the concept, let me know too thanks!

Comment by MiguelDev (whitehatStoic) on How useful is Corrigibility? · 2023-09-12T02:53:18.419Z · LW · GW

Another relevant material by Rob Miles here  on why the shutdown problem is very difficult to solve.

Comment by MiguelDev (whitehatStoic) on How useful is Corrigibility? · 2023-09-12T02:35:48.973Z · LW · GW

We don't know how to formally define corrigibility and this is part of the reason why we haven't solved it so far. Corrigibility is defined in Arbital as:

[the agent] doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it



I think this is not the best definition of corrigibility, As defined in MIRI's paper section 1.1:

We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies). That is, corrigible reasoning should only allow an agent to create new agents if these new agents are also corrigible.

In theory, a corrigible agent should not be misaligned if we successfully integrate these four tolerance behaviors.

Comment by MiguelDev (whitehatStoic) on Focus on the Hardest Part First · 2023-09-11T09:51:30.094Z · LW · GW

I think a better question is whether we have the environment that can cultivate the necessary behavior leading us to solve the most challenging aspects of AI alignment.

Comment by MiguelDev (whitehatStoic) on Explained Simply: Quantilizers · 2023-09-09T09:21:31.654Z · LW · GW

It does leave one question — how do we make a list of possible actions in the first place?

Suskever has an interesting theory on this that you might want to watch. he calls it the x and y data compression theory.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-08T16:18:50.641Z · LW · GW

My current builds focuses on proving natural abstractions exists - but your idea is of course viable via distribution matching.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-08T04:46:10.220Z · LW · GW

Yup, as long as there are similar patterns existing in both datasets (distribution matching) it can work - that is why my method works.

Comment by MiguelDev (whitehatStoic) on Recreating the caring drive · 2023-09-08T00:54:42.026Z · LW · GW

Most neurons dedicated to babies are focused on the mouth, I believe, up until around 7 months. Neuronal pathways evolve over time, and it seems that a developmental approach to AI presents its own set of challenges. When growing an AI that already possesses mature knowledge and has immediate access to it, traditional developmental caring mechanisms may not fully address the enormity of the control problem. However, if the AI gains knowledge through a gradual, developmental process, this approach could be effective in principle.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-08T00:44:24.118Z · LW · GW

I've observed instances where the Pareto principle appears to apply, particularly in learning rates during unsupervised learning and in x and y dataset compression via distribution matching. For example, a small dataset that contains a story repeated 472 times (1MB) can significantly impact a model as large as 1.5 billion parameters (GPT2-xl, 6.3GB), enabling it to execute complex instructions like initiating a shutdown mechanism during an event that threatens intelligence safety. While I can't disclose the specific methods (due to dual use nature), I've also managed to extract a natural abstraction. This suggests that a file with a sufficiently robust pattern can serve as a compass for a larger file (NN) following a compilation process.

Comment by MiguelDev (whitehatStoic) on Democratic Fine-Tuning · 2023-09-07T12:52:30.113Z · LW · GW

“Democratic Fine-tuning” is an example of such a process. Our suggested implementation avoids ideological battles by eliciting the values underpinning our preferences, scales to millions of users, could be run continuously as new cases are encountered, and introduces a new aggregation mechanism (a moral graph) that can build trust, and inspire wiser values in participants.


This approach demands a well-informed population capable of setting aside their biases. People must be able to think both individually and collectively when deciding which values to prioritize. Overcoming this challenge will likely be a significant hurdle for this method.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-07T10:28:53.865Z · LW · GW

What kind of training data would increase positive outcomes for superhuman AIs interacting with each other? 


The training data should be systematically distributed, likely governed by the Pareto principle. This means it should encompass both positive and negative outcomes. If the goal is to instill moral decision-making, the dataset needs to cover a range of ethical scenarios, from the noblest to the most objectionable. Why is this necessary? Simply put, training an AI system solely on positive data is insufficient. To defend itself against malicious attacks and make morally sound decisions, the AI needs to understand the concept of malevolence in order to effectively counteract it.

Comment by MiguelDev (whitehatStoic) on Who Has the Best Food? · 2023-09-06T07:56:58.084Z · LW · GW

Disclaimer: I'm Filipino, and I find this post intriguing.

The chart seems to contradict many of the compliments I've heard in general. While it's true that some places are terrible for finding authentic Filipino cuisine, trying to encapsulate all food choices in a single chart is inherently difficult. 

Comment by MiguelDev (whitehatStoic) on Impending AGI doesn’t make everything else unimportant. · 2023-09-05T04:20:39.618Z · LW · GW

The solution for the problem of meaninglessness is found in existentialism.


I don't believe existentialism is the sole answer to the problem of meaninglessness. Instead, I view it as one step in a process that ultimately leads to a sense of responsibility emerging through sacrifice. By taking on the complexities of life, and extending our efforts to help ourselves and others, we derive a sense of meaning from shouldering the weight of our own realities.

Comment by MiguelDev (whitehatStoic) on Alignment Grantmaking is Funding-Limited Right Now · 2023-09-03T07:14:56.489Z · LW · GW

But if the volume of crap reaches the point where competent researchers have trouble finding each other, or new people are mostly onboarded into crap research, or external decision-makers can't defer to a random "expert" in the field without usually getting a bunch of crap, then all that crap research has important negative effects

I agree with your observation. The problem is that many people are easily influenced by others, rather than critically evaluating whether the project they're participating in is legitimate or if their work is safe to publish. It seems that most have lost the ability to listen to their own judgment and assess what is rational to do.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-02T07:38:52.969Z · LW · GW

There is also a theory from Jung that deeply concerns me.  According to Jung, the human psyche contains a subliminal state, or subconscious mind, which serves as a battleground for gods and demons. Our dreams process this ongoing conflict and bring it into our conscious awareness.  What if these same principle got transferred to LLMs since human related data was used for training? This idea doesn't seem far-fetched, especially since we refer to the current phenomenon of misleading outputs in LLMs as "hallucinations."

I have conducted an experiment on this, specifically focusing on hyperactivating the "shadow behavior" in GPT-2 XL and I could fairly say that it is reminiscent of Jung's thought. For obvious reasons, I won't disclose the method here[1] but I'm open to discussing it privately. 

  1. ^

    Unfortunately, the world isn't a safe place to disclose this method. As discussed in this post and this post, I don't know of a secure way to share the correct information and disseminate it to right people who can actually do something about it. For now, I'll leave this comment here in the hope that the appropriate individual might come across it and be willing to engage in one of the most unsettling discussions they'll ever have.

Comment by MiguelDev (whitehatStoic) on Meta Questions about Metaphilosophy · 2023-09-02T06:57:43.759Z · LW · GW

Why is there virtually nobody else interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?

I've written a bit on this topic that you might find interesting; I refer to it as the Set of Robust Concepts (SORC). I also employed this framework to develop a tuning dataset, which enables a shutdown mechanism to activate when the AI's intelligence poses a risk to humans. It works 57.33% of the time.

I managed to improve the success rate to 88%. However, I'm concerned that publishing the method in the conventional way could potentially put the world at greater risk. I'm still contemplating how to responsibly share information on AI safety, especially when it could be reverse-engineered to become dangerous.

Comment by MiguelDev (whitehatStoic) on Anyone want to debate publicly about FDT? · 2023-09-01T01:32:12.390Z · LW · GW

I have written before why FDT is relevant to solving the alignment problem. I'd be happy to discuss that to you.

Comment by MiguelDev (whitehatStoic) on AI #27: Portents of Gemini · 2023-09-01T01:28:08.209Z · LW · GW

OpenAI is not solely focused on alignment; they allocate only 20% of their resources to superalignment research. What about the remaining 80%? This often goes undiscussed, yet it arguably represents the largest budget allocated to improving capabilities. Why not devote 100% of resources to building an aligned AGI? Such an approach would likely yield the best economic returns, as people are more likely to use a trustworthy AI, and governments would also be more inclined to promote its adoption.

Comment by MiguelDev (whitehatStoic) on Barriers to Mechanistic Interpretability for AGI Safety · 2023-09-01T01:20:23.935Z · LW · GW

I agree with Connor and Charbel's post. The next step is to establish a new method for sharing results with safety-focused companies, groups, and independent researchers. This requires:

  1. Developing a screening method for inclusion.
  2. Tracking people within the network, which becomes challenging, especially if they are recruited by capabilities-focused companies.

Continuing this line of thought, we can't ensure 100% that such a network will consistently serve its intended purpose. So, if anyone has insights that could improve this idea, I'd like to hear them.

Comment by MiguelDev (whitehatStoic) on Open Call for Research Assistants in Developmental Interpretability · 2023-08-31T00:55:34.401Z · LW · GW

Hello, I agree with Jesse as the budget they have is really good for hiring capable alignment researchers here in Asia (I'm based currently in Chiang Mai, Thailand) or any other place  where cost is extremely low compared back there in the West. 

Good luck on this project team Dev Interp.

Comment by MiguelDev (whitehatStoic) on Incentives affecting alignment-researcher encouragement · 2023-08-30T06:17:12.098Z · LW · GW

The questions: Is this real or not? What, if anything, should anyone do, with this knowledge in hand?


I can attest to the validity of the premise you're raising based on my own experience, but there are multiple factors at play. One is the scarcity of resources, which tends to limit opportunities to groups or teams that can demonstrably make effective use of those resources—whether it's time, funds, or mentorship. Another factor that is less frequently discussed is the courage to deviate from conventional thinking. It's inherently risky to invest in emerging alignment researchers who haven't yet produced tangible results. Making such choices based on a gut feeling of their potential to deliver meaningful contributions can seem unreasonable, especially to funders. There are more layers to this, but nevertherless didn't took any bad blood about what is the landscape. It is what it is.

Comment by MiguelDev (whitehatStoic) on Brain Efficiency: Much More than You Wanted to Know · 2023-08-28T08:47:52.134Z · LW · GW

This presentation by Geoffrey Hinton on the two paths to intelligence is reminiscent of this post! enjoy guys.

Comment by MiguelDev (whitehatStoic) on Which possible AI systems are relatively safe? · 2023-08-28T01:44:32.099Z · LW · GW

Here is a comparative analysis of a project I'm using datasets to instruct / hack / sanitize the whole attention mechanism of GPT2-xl in my experiments: A spreadsheet on QKV Mean weights comparisons on various GPT2-xl builds. The spreadsheet currently has four builds, and the numbers you see is the mean weights (half of the attention mechanism in layers 1 to 48, doesn't include the embedding layer):

Comment by MiguelDev (whitehatStoic) on A list of core AI safety problems and how I hope to solve them · 2023-08-27T08:00:36.919Z · LW · GW

Sorry for not specifying the method, but I wasn't referring to RL-based or supervised learning methods. There's a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning.

I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.

Comment by MiguelDev (whitehatStoic) on A list of core AI safety problems and how I hope to solve them · 2023-08-26T23:31:39.288Z · LW · GW

2. Corrigibility is anti-natural.

Hello! I can see a route where corrigibility can become part of the AI's attention mechanism - and is natural to its architecture. 

If alignment properties are available in the training data and is amplified by a tuning data - that is very much possible to happen.


Comment by MiguelDev (whitehatStoic) on Will an Overconfident AGI Mistakenly Expect to Conquer the World? · 2023-08-26T08:28:29.353Z · LW · GW

An AI with a base algorithm that has "sufficient curiosity" could recursively improve itself, taking into account all possible outcomes. If this foundational curiosity is not properly managed, the AI could explore a range of possibilities, some of which could be detrimental. This underscores the urgent need to address the alignment problem. I believe the main issue is not a lack of confidence, but rather unchecked curiosity that could lead to uncontrollable outcomes.

Comment by MiguelDev (whitehatStoic) on Why Is No One Trying To Align Profit Incentives With Alignment Research? · 2023-08-26T06:27:28.271Z · LW · GW

AI Auditing Companies

In traditional auditing fields like finance, fraud, and IT, established frameworks make it relatively easy for any licensed company or practitioner to implement audits. However, since we haven't yet solved the alignment problem, it's challenging to streamline best practices and procedures. As a result, companies claiming to provide audits in this area are not yet feasible. I'm skeptical of any company that claims it can evaluate a technology that is not yet fully understood.

Comment by MiguelDev (whitehatStoic) on Which possible AI systems are relatively safe? · 2023-08-24T01:06:59.463Z · LW · GW

or the conceptual architecture of corrigibility is a part of the attention mechanism...

Comment by MiguelDev (whitehatStoic) on Which possible AI systems are relatively safe? · 2023-08-24T01:00:16.129Z · LW · GW

what lower-level desirable properties determine corrigibility?


Has corrigible (or alignment) properties embedded in the attention weights.

Comment by MiguelDev (whitehatStoic) on 6 non-obvious mental health issues specific to AI safety · 2023-08-23T01:00:08.450Z · LW · GW

Same here. The work I'm doing may not align with conventional thinking or be considered part of the major alignment work being pursued, but I believe I've used my personal efforts to understand the alignment problem and the complex web of issues surrounding it.

My advice? Continuously improve your methods and conceptual frameworks, as that will drive much of the progress in the complexity and intricacy of this field. Good luck with your progress!!

Comment by MiguelDev (whitehatStoic) on 6 non-obvious mental health issues specific to AI safety · 2023-08-20T02:18:33.301Z · LW · GW

These problems are an exacerbated version of existential problem of meaninglessness of life, and the way to mitigate them is to rediscover meaning in the world that ultimately doesn't have meaning.

The meta-problem everyone is navigating - and this is the meta-advice, and finding the answers for ourselves is unique to our own parameterized realities. Well said here.

Comment by MiguelDev (whitehatStoic) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T08:19:38.859Z · LW · GW

My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build.


I agree with this perspective if we can afford the time to perform interpretability work on all of model setups - which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it's better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches. 

Comment by MiguelDev (whitehatStoic) on Against Almost Every Theory of Impact of Interpretability · 2023-08-18T04:09:03.589Z · LW · GW

Fully agree with the post. Depending solely on interpretability work and downloading activations without understanding how to interpret the numbers is a big waste of time. Met smart people stuck in aimless exploration; bad in the long run. Wasting time slowly is not immediately painful, but it really hurts when projects fail due to poor direction.

Comment by whitehatStoic on [deleted post] 2023-08-16T04:59:16.324Z
Comment by MiguelDev (whitehatStoic) on Less activations can result in high corrigibility? · 2023-08-15T07:01:52.614Z · LW · GW

I am doing a follow up on this one, and apparently the computations I did were misleading. But further reviewing the results led me to another accidental discovery.

Comment by MiguelDev (whitehatStoic) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-10T14:14:28.494Z · LW · GW

First, demonstrate each subcomponent above in isolation. E.g., if we’re trying to demonstrate that treacherous turns are possible, but models lack some relevant aspect of situational awareness, then include the relevant information about the model’s situation in the prompt.

' petertodd' (the glitch token) is a case that threacherous turns are possible and was out in the wild until OpenAI patched it in Feb. 2023.

Comment by MiguelDev (whitehatStoic) on Parameter vs Synapse? · 2023-08-08T07:50:13.580Z · LW · GW
Comment by MiguelDev (whitehatStoic) on Parameter vs Synapse? · 2023-08-08T07:49:27.065Z · LW · GW