Posts

OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks 2024-02-01T20:18:19.418Z
Non-myopia stories 2023-11-13T17:52:31.933Z
Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z

Comments

Comment by lberglund (brglnd) on Untrusted smart models and trusted dumb models · 2024-02-06T20:27:32.981Z · LW · GW

For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).

Could you elaborate on why you think these skills are necessary? I'm somewhat skeptical of this. 

Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,

  1. having a rough estimate of the fraction of defection opportunities that are monitored, and 
  2. being able to come up with the 'defect X% of the time' strategy. 

I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.

Defect X% of the time seems to work fairly well even for more complicated control setups. If it's viable, then I expect the skills required to be fairly minimal. 

For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)

Comment by lberglund (brglnd) on OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks · 2024-02-05T09:33:45.262Z · LW · GW

Thanks for pointing this out! I've added a note about it to the main post.

Comment by lberglund (brglnd) on OpenAI report also finds no effect of current LLMs on viability of bioterrorism attacks · 2024-02-03T03:55:36.102Z · LW · GW

This is surprising – thanks for bringing this up!

The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn't actually helpful, which implies we don't need to worry about LLMs giving people expert-level biology knowledge. 

Some alternative interpretations:

  • The study doesn't accurately measure the gap between experts and non-expert
  • The knowledge needed to build a bioweapon is super niche. It's not enough to be a biology PhD with wet lab experience; you have to specifically be an expert on gain of function research or something.
  • LLMs won't help people build bioweapons by giving them access to special knowledge. Instead, they help them in other ways (e.g. by accelerating them or making them more competent as planners).
Comment by lberglund (brglnd) on Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk · 2024-01-11T19:26:11.445Z · LW · GW

To be clear, Kevin Esvelt is the author of the "Dual-use biotechnology" paper, which the policy paper cites, but he is not the author of the policy paper.

Comment by lberglund (brglnd) on AI #44: Copyright Confrontation · 2023-12-29T12:07:37.528Z · LW · GW

(a) restrict the supply of training data and therefore lengthen AI timelines, and (b) redistribute some of the profits of AI automation to workers whose labor will be displaced by AI automation

It seems fine to create a law with goal (a) in mind, but then we shouldn't call it copyright law, since it is not designed to protect intellectual property. Maybe this is common practice and people write laws pretending to target one thing while actually targeting something else all the time, in which case I would be okay with it. Otherwise, doing so would be dishonest and cause our legal system to be less legible. 

Comment by lberglund (brglnd) on Principles For Product Liability (With Application To AI) · 2023-12-11T17:33:41.617Z · LW · GW

Maybe if the $12.5M is paid to Jane when she dies, she could e.g. sign a contract saying that she waives her right to any such payments the hospital becomes liable for. 

Comment by lberglund (brglnd) on Anthropic Fall 2023 Debate Progress Update · 2023-12-07T23:39:41.891Z · LW · GW

This is really cool! I'm impressed that you got the RL to work. One thing I could see happening during RL is that models start communicating in increasingly human-illegible ways (e.g. they use adversarial examples to trick the judge LM). Did you see any behavior like that?

Comment by lberglund (brglnd) on Non-myopia stories · 2023-11-18T01:50:30.252Z · LW · GW

Thanks for pointing this out! I will make a note of that in the main post.

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T19:23:20.684Z · LW · GW

We actually do train on both the prompt and completion. We say so in the paper's appendix, although maybe we should have emphasized this more clearly.

Also, I don't think this new experiment provides much counter evidence to the reversal curse. Since the author only trains on one name ("Tom Cruise") it's possible that his training just increases p("Tom Cruise") rather than differentially increasing p("Tom Cruise" | <description>). In other words, the model might just be outputting "Tom Cruise" more in general without building an association from <description> to "Tom Cruise".

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T19:18:56.090Z · LW · GW

I agree that the Tom Cruise example is not well chosen. We weren't aware of this at the time of publication. In hindsight we should have highlighted a different example.

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-11-15T19:16:32.114Z · LW · GW

We actually do train on both the prompt and completion. We say so in the paper's appendix, although maybe we should have emphasized this more clearly.

Comment by lberglund (brglnd) on Non-myopia stories · 2023-11-15T16:59:39.243Z · LW · GW

Who says we don't want non-myopia, those safety people?!

It seems like a lot of people would expect myopia by default since the training process does nothing to incentivize non-myopia. "Why would the model care about what happens after an episode if there it does not get rewarded for it?" I think skepticism about non-myopia is a reason ML people are often skeptical of deceptive alignment concerns. 


Another reason to expect myopia by default is that – to my knowledge – nobody has shown non-myopia occurring without meta-learning being applied.

Comment by lberglund (brglnd) on Non-myopia stories · 2023-11-15T16:54:09.565Z · LW · GW

Your ordering seems reasonable! My ordering in the post is fairly arbitrary. My goal was mostly to put easier examples early on.

Comment by lberglund (brglnd) on AI Timelines · 2023-11-14T00:01:16.941Z · LW · GW

I would guess that you could train models to perfect play pretty easily, since the optimal tic-tac-toe strategy is very simple (Something like "start by playing in the center, respond by playing on a corner, try to create forks, etc".) It seems like few-shot prompting isn't enough to get them there, but I haven't tried yet. It would be interesting to see if larger sizes of gpt-3 can learn faster than smaller sizes. This would indicate to what extent finetuning adds new capabilities rather than surfacing existing ones.

I still find the fact that gpt-4 cannot play tic-tac-toe despite prompting pretty striking on its own, especially since it's so good at other tasks.

Comment by lberglund (brglnd) on AI Timelines · 2023-11-13T22:29:02.710Z · LW · GW

GPT-4 can follow the rules of tic-tac-toe, but it cannot play optimally. In fact it often passes up opportunities for wins. I've spent about an hour trying to get GPT-4 to play optimal tic-tac-toe without any success. 
 

Here's an example of GPT-4 playing sub-optimally: https://chat.openai.com/share/c14a3280-084f-4155-aa57-72279b3ea241

Here's an example of GPT-4 suggesting a bad move for me to play: https://chat.openai.com/share/db84abdb-04fa-41ab-a0c0-542bd4ae6fa1

Comment by lberglund (brglnd) on Improving the Welfare of AIs: A Nearcasted Proposal · 2023-11-09T20:53:37.420Z · LW · GW

Another slightly silly idea for improving character welfare is filtering or modifying user queries to be kinder. We could do this by having a smaller (and hopefully less morally valuable) AI filter unkind user queries or modify them to be more polite.

Comment by lberglund (brglnd) on Value systematization: how values become coherent (and misaligned) · 2023-11-06T20:39:57.908Z · LW · GW

We don’t yet have examples of high-level belief systematization in AIs. Perhaps the closest thing we have is grokking

It should be easy to find many examples of belief systematization over successive generations of AI systems (e.g. GPT-3 to GPT-4). Some examples could be:

  • Early GPT models can add some but not all two-digit numbers; GPT-3 can add all of them.
  • Models knowing that some mammals don't lay eggs without knowing all mammals don't lay eggs.
  • Being able to determine valid chess moves in some, but not all cases.
Comment by lberglund (brglnd) on AI as a science, and three obstacles to alignment strategies · 2023-11-01T03:48:28.000Z · LW · GW

The value of constitutional AI is using simulations of humans to rate an AI's outputs, rather than actual humans. This is a lot cheaper and allows for more iteration etc, but I don't think this will work once AI's become smarter than humans. At that point, the human simulations will have trouble evaluating AI's just like humans do. 

Of course getting really cheap human feedback is useful, but I want to point out that constitutional AI will likely run into novel problems as AI capabilities surpass human capabilities.

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-10-11T13:14:46.761Z · LW · GW

Because it affects performance? Except the basic explanation concedes that this does not seem to matter for any of the actual real-world tasks that we use causal/decoder/unidirectional LLMs for, and it has to construct examples to test on. No one cares about Tom Cruise's mother in her own right and would ask 'who is her son?', and so the LLMs do not learn the reversal. If people did start caring about that, then it would show up in the training, and even 1 example will increasingly suffice (for memorization, if nothing else). If LLMs learn by 1-way lookups, maybe that's a feature and not a bug: a 2-way lookup is going to be that much harder to hardwire in to neural circuitry, and when we demand that they learn certain logical properties, we're neglecting that we are not asking for something simple, but something very complex—it must learn this 2-way property only for the few classes of relationships where that is (approximately) correct. For every relationship 'A is B' where it's (approximately) true that 'B is A', there is another relationship 'A mothered B' where 'B mothered A' is (very likely but still not guaranteed to be) false.

I agree with that it might not be worth learning 2-way relationships given that they are harder to hardwire in neural circuitry. Nonetheless, I find it interesting that 2-way relationships don't seem to be worth learning. 

Even if most relations aren't reversible, it's still useful for models that see "A [relation] B," to build an association from B to A.  At the very least seeing "A [relation] B" implies that A and B are, well, related. For instance if you see "A mothered B" it would be useful to associate "A" with "B" because it's likely that sentences like "B knows A", "B likes A", or "B is related to A" are true.)

Our paper indicates that LLMs do not exhibit this sort of transfer. Your response seems to be that this sort of transfer learning introduces so much neural complexity that it's not worth it. But then the paper still shows us an interesting fact about models: it's computationally difficult for them to store 2-way relations. 

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-10-11T12:27:13.910Z · LW · GW

Speaking for myself, I think this research was worth publishing because its benefits to understanding LLMs outweigh its costs from advancing capabilities. 

In particular, the reversal curse shows us how LLM cognition differs from human cognition in important ways, which can help us understand the "psychology" of LLMs. I don't think this finding will to advance capabilities a lot because:

  • It doesn't seem like a strong impediment to LLM performance (as indicated by the fact that people hadn't noticed it until now).
  • Many facts are presented in both directions during training, so the reversal curse is likely not a big deal in practice.
  • Bidirectional LLMs (e.g. BERT) likely do not suffer from the reversal curse.[1] If solving the reversal curse confers substantial capabilities gains, people could have taken advantage of this by switching from autoregressive LLMs to bidirectional ones.
  1. ^

    Since they have to predict "_ is B" in addition to "A is _".

Comment by lberglund (brglnd) on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-26T07:41:17.839Z · LW · GW

I agree that training backwards would likely fix this for a causal decoder LLM. 

I would define the Reversal Curse as the phenomenon by which models cannot infer 'B -> A' by training on examples of the form 'A -> B'. In our paper we weren't so much trying to avoid the Reversal Curse, but rather trying to generate counterexamples to it. So when we wrote,  "We try different setups in an effort to help the model generalize," we were referring to setups in which a model infers 'B -> A' without seeing any documents in which B precedes A, rather than ways to get around the Reversal Curse in practice.

Comment by lberglund (brglnd) on Focus on the Hardest Part First · 2023-09-11T15:23:00.192Z · LW · GW

This essay by Jacob Steinhardt makes a similar (and fairly fleshed out) argument.

Comment by lberglund (brglnd) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2023-09-08T14:35:36.638Z · LW · GW

Thanks for clarifying. Preventing escape seems like promising way to prevent these sorts of problems. On the other hand I'm having trouble imagining ways in which we could have sensors that can pick up on whether a model has escaped. (Maybe info-sec people have thought more about this?)

Comment by lberglund (brglnd) on Sharing Information About Nonlinear · 2023-09-07T09:10:27.700Z · LW · GW

Also, if we post another comment thread a week later, who will see it? EAF/LW don’t have sufficient ways to resurface old but important content. 

This doesn't seem like an issue. You could instead write a separate post a week later which has a chance of gaining traction.

Comment by lberglund (brglnd) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2023-09-06T16:10:41.859Z · LW · GW

Thanks for writing this! I've found the post very interesting. I had a question/comment regarding this:

In practice, for many tasks we might want AIs to accomplish, knowing about all concrete and clear cheap-to-measure short-term outcomes will be enough to prevent most kinds of (low-stakes) oversight failures. For example, imagine using AIs to operate a power plant, where the most problematic failure modes are concrete short-term outcomes such as not actually generating electricity (e.g. by merely buying and selling electricity instead of actually producing it or by hacking electricity meters).

It seems like an additional issue here is that the AI could be messing with things beyond the system we are measuring. As an extreme example, the AI could become a "power plant maximizer" that takes over the world in order to protect the power plant from shut down. It seems like this will always be a risk because we can only realistically monitor a small part of the world. Do you have thoughts on this?

Comment by lberglund (brglnd) on Benchmarks for Detecting Measurement Tampering [Redwood Research] · 2023-09-06T15:01:20.095Z · LW · GW

So, we can fine-tune a probe at the last layer of the measurement predicting model to predict if there is tampering using these two kinds of data: the trusted set with negative labels and examples with inconsistent measurements (which have tampering) with positive labels. We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).

This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn't it actually learning to distinguish real negatives and fake positives, since that's what it's being trained on? (Might be a typo.)

Comment by lberglund (brglnd) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-31T09:36:49.508Z · LW · GW

I agree that benchmarks might not be the right criteria, but training cost isn't the right metric either IMO, since compute and algorithmic improvement will be bringing these costs down every year. Instead, I would propose an effective compute threshold, i.e. number of FLOP while accounting for algorithmic improvements.

Comment by lberglund (brglnd) on Introducing the Center for AI Policy (& we're hiring!) · 2023-08-29T09:09:21.585Z · LW · GW

Nature of the work: Many organizations are focused on developing ideas and amassing influence that can be used later. CAIP is focused on turning policy ideas into concrete legislative text and conducting advocacy now.

Congrats on launching! Do you have a model of why other organizations are choosing to delay direct legislative efforts? More broadly, what are your thoughts on avoiding the unilateralist's curse here?

Comment by lberglund (brglnd) on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research · 2023-08-09T11:44:33.576Z · LW · GW

Treacherous turn test:

To be clear, in this test we wouldn't be doing any gradient updates to make the model perform the treacherous turn, but rather we are hoping that the prompt alone will make it do so?

Comment by lberglund (brglnd) on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy · 2023-07-27T12:01:47.375Z · LW · GW

to determine whether your models are capable of gradient hacking, do evaluations to determine whether your models (and smaller versions of them) are able to do various tasks related to understanding/influencing their own internals that seem easier than gradient hacking.

What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack? 

Comment by lberglund (brglnd) on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy · 2023-07-27T11:31:12.187Z · LW · GW

Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).

Comment by lberglund (brglnd) on Eight Strategies for Tackling the Hard Part of the Alignment Problem · 2023-07-21T12:28:46.089Z · LW · GW

I'm interested in the relation between mechanistic anomaly detection and distillation. In theory, if we have a distilled model, we could use it for mechanistic anomaly detection: for each input x, we would check the degree to which the original model's output differs from the distilled model. If the difference is too great, we flag it as an anomaly and reject the output. 

Let's say you have your original model  and your distilled model  along with some function  to quantify the difference between two outputs. If you are doing distillation, you would always just output . If you are doing mechanistic anomaly detection, you output  if  is below some threshold and you output nothing otherwise. Here, I can see three differences between distillation and mechanistic anomaly detection:

  • Distillation is cheaper, since you only have to run the distilled model, rather than both models. 
  • Mechanistic anomaly detection might give you higher quality outputs when  is below some threshold, since you are using the outputs from  rather than those from .
  • Distillation will always return an output, whereas mechanistic anomaly detection will reject some.
  • Mechanistic anomaly detection operates according to a discrete threshold, whereas distillation is more continuous.

Overall, distillation just seems better than mechanistic anomaly detection in this case? Of course mechanistic anomaly detection could be done without a distilled model,[1] but whenever you have a distilled model, it seems beneficial to just use it rather than running mechanistic anomaly detection.

  1. ^

    E.g. you observe that two neurons of the network always fire together and you flag it as an anomaly when they don't.

Comment by lberglund (brglnd) on Lessons On How To Get Things Right On The First Try · 2023-07-10T18:24:12.847Z · LW · GW

Robert used empirical data a bunch to solve the problem. Getting empirical data for nano-tech seems to involve a bunch of difficult experiments that take lots of time and resources to run. 

You could argue that the AI could use data from previous experiments to figure this out, but I expect that very specific experiments would need to be run. (Also, you might need to create new tech to even run these experiments, which in itself requires empirical data.)

Comment by lberglund (brglnd) on LLMs Sometimes Generate Purely Negatively-Reinforced Text · 2023-06-18T21:53:36.674Z · LW · GW

Cool work! It seems like one thing that's going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar. 

Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren't used in B. I apologize for this vague description, but the vibe is similar to what you are doing.

Comment by lberglund (brglnd) on Sentience matters · 2023-06-03T18:06:17.228Z · LW · GW

Thanks for the post! What follows is a bit of a rant. 

I'm a bit torn as to how much we should care about AI sentience initially. On one hand, ignoring sentience could lead us to do some really bad things to AIs. On the other hand, if we take sentience seriously, we might want to avoid a lot of techniques, like boxing, scalable oversight, and online training. In a recent talk, Buck compared humanity controlling AI systems to dictators controlling their population. 

One path we might take as a civilization is that we initially align our AI systems in an immoral way (using boxing, scalable oversight, etc) and then use these AIs to develop techniques to align AI systems in a moral way. Although this wouldn't be ideal, it might still be better than creating a sentient squiggle maximizer and letting it tile the universe. 

There are also difficult moral questions here, like if you create a sentient AI system with different preferences than yours, is it okay to turn it off?

Comment by lberglund (brglnd) on Contra Yudkowsky on AI Doom · 2023-04-25T20:03:32.705Z · LW · GW

I see, thanks for clarifying.

Comment by lberglund (brglnd) on Contra Yudkowsky on AI Doom · 2023-04-25T16:58:55.715Z · LW · GW

The universal learning/scaling model was largely correct - as tested by openAI scaling up GPT to proto-AGI.

I don't understand how OpenAIs success at scaling GPT proves the universal learning model. Couldn't there be an as yet undiscovered algorithm for intelligence that is more efficient?

Comment by lberglund (brglnd) on But why would the AI kill us? · 2023-04-19T17:00:52.837Z · LW · GW

From the last bullet point: "it doesn't much matter relative to the issue of securing the cosmic endowment in the name of Fun."

Part of the post seems to be arguing against the position "The AI might take over the rest of the universe, but it might leave us alone." Putting us in an alien zoo is pretty equivalent to taking over the rest of the universe and leaving us alone.  It seems like the last bullet point pivots from arguing that AI will definitely kill us to arguing that even though if it doesn't kill us this is pretty bad.

Comment by lberglund (brglnd) on But why would the AI kill us? · 2023-04-19T16:56:53.505Z · LW · GW
Comment by lberglund (brglnd) on But why would the AI kill us? · 2023-04-19T16:56:10.253Z · LW · GW
Comment by lberglund (brglnd) on Misgeneralization as a misnomer · 2023-04-07T21:32:26.045Z · LW · GW

I want to defend the term Goal Misgeneralization. (Steven Byrnes makes a similar point in another comment). 

I think what's misgeneralizing is the "behavioral goal" of the system: a goal that you can ascribe to a system to accurately model its behavior. Goal misgeneralization does not refer to the innate goal of the system.  (In fact, I think this perspective is trying to avoid thorny discussions of these topics, partly because people in ML are averse to philosophy.)


For example, the coin run agent pursues the coin in training, but when the coin is put on the other side of the level it still just goes to the right. In training, the agent could have been modeled as having a bunch of goals including getting the coin, getting to the right of the maze, and maximizing the reward it gets. By putting the coin on the left side of the maze we see that its behavior cannot always be modeled by the goal of getting the coin and we get misgeneralization.

This is analogous to a Husky classifier that learns to classify whether the dog is on snow. Here, the models behavior can be explained by classifying any number of things about the image, including whether the pictured dog is a Husky and whether the pictured dog is in snow. These things come apart when you show it a Husky that's not standing in snow and we get "concept misgeneralization".

Comment by lberglund (brglnd) on Imitation Learning from Language Feedback · 2023-04-02T21:46:07.204Z · LW · GW
Comment by lberglund (brglnd) on Deep Deceptiveness · 2023-03-22T17:05:39.433Z · LW · GW

The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that's socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don't get along).

So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.

Comment by lberglund (brglnd) on Acausal normalcy · 2023-03-07T18:48:34.826Z · LW · GW

In particular, I strongly suspect that acausal norms are not so compelling that AI technologies would automatically discover and obey them.  So, if your aim in reading this post was to find a comprehensive solution to AI safety, I'm sorry to say I don't think you will find it here.  

To make sure I understand, would this mean that the AI technologies would be acting suboptimally, in the sense they could achieve their goals better if they joined the aucausal economy?

Comment by lberglund (brglnd) on Acausal normalcy · 2023-03-07T18:41:44.376Z · LW · GW

Thanks for the post. I have a clarifying question.

  • Alien civilizations should obtain our consent in some fashion before visiting Earth.

It seems like the acausal economy wouldn't benefit from treating us well, since humans currently can't contribute very well to it. I would expect that this means that alien civilizations are fine with visiting us. What am I missing here?


 

Comment by lberglund (brglnd) on There are no coherence theorems · 2023-02-21T01:38:06.817Z · LW · GW

I agree with habryka that the title of this post is a little pedantic and might just be inaccurate, but I nevertheless found the content to be thought-provoking, easy to follow, and well written.

Comment by lberglund (brglnd) on Cyborgism · 2023-02-15T18:33:41.394Z · LW · GW

This is great! Thanks for sharing.

Comment by lberglund (brglnd) on Cyborgism · 2023-02-15T06:41:00.733Z · LW · GW

Reading this inspired me to try Loom. I found it fun, but also frustrating at times. It could be useful to have a video demonstration of someone like Janus using loom, to get new users up to speed.

Comment by lberglund (brglnd) on Can we efficiently distinguish different mechanisms? · 2023-02-13T01:02:28.253Z · LW · GW

Or does the approach to ELK that involves training an honest model on the generative model's weights also require dealing with that subproblem, but there are other approaches that don't?

Comment by lberglund (brglnd) on Can we efficiently distinguish different mechanisms? · 2023-02-13T00:27:02.658Z · LW · GW

But if ELK can be solved without solving this subproblem, couldn't SGD find a model that recognizes sensor tampering without solving this subproblem as well?