Posts

Comments

Comment by Kei on Retirement Accounts and Short Timelines · 2024-02-22T05:30:32.203Z · LW · GW

Also, please correct me if I am wrong, but I believe you can withdraw from a retirement account at any time as long as you are ok paying a 10% penalty on the withdrawal amount. If your employer is giving a ~>10% match, this means you'll make money even if you withdraw from the account right away.

Comment by Kei on Apologizing is a Core Rationalist Skill · 2024-01-02T23:29:37.371Z · LW · GW

It also helps to dedicate a complete sentence (or multiple sentences if the action you're apologizing for wasn't just a minor mistake) to your apology. When apologizing in-person, you can also pause for a bit, giving your conversational partner the opportunity to respond if they want to.

When you immediately switch into the next topic, as in your example apology above, it looks like you're trying to distract from the fact that you were wrong, and also makes it less likely your conversational partner internalizes that you apologized.

Comment by Kei on Shane Legg interview on alignment · 2023-10-29T22:02:59.300Z · LW · GW

I think this is one reasonable interpretation of his comments. But the fact that he:

1. Didn't say very much about a solution to the problem of making models want to follow our ethical principles, and 
2. Mostly talked about model capabilities even when explicitly asked about that problem

makes me think it's not something he spends much time thinking about, and is something he doesn't think is especially important to focus on.

Comment by Kei on Shane Legg interview on alignment · 2023-10-29T20:30:20.464Z · LW · GW

I had a different interpretation of this interview. From what I can tell, Legg's view is that aligning language models is mostly a function of capability. As a result, his alignment techniques are mostly focused on getting models to understand our ethical principles, and getting models to understand whether the actions they take follow our ethical principles by using deliberation. Legg seems to view the problem of getting models to actually want to follow our ethical principles as secondary.

Dwarkesh pushed him multiple times on how we can get models to want to follow our ethical principles. Legg's responses mostly still focused on model capabilities. The closest to answering the question that he got as far as I can tell is that you have to "specify to the system: these are the ethical principles you should follow", and you have to check the reasoning process the model uses to make decisions.

Comment by Kei on I don’t find the lie detection results that surprising (by an author of the paper) · 2023-10-05T01:29:13.825Z · LW · GW

It's possible I'm using motivated reasoning, but on the listed ambiguous questions in section C.3, the answers the honest model gives tend to seem right to me. As in, if I were forced to answer yes or no to those questions, I would give the same answer as the honest model the majority of the time.

So if, as is stated in section 5.5, the lie detector not only detects whether the model had lied but whether it would lie in the future, and if the various model variants have a similar intuition to me, then the honest model is giving its best guess of the correct answer, and the lying model is giving its best guess of the wrong answer.

I'd be curious if this is more generally true - if humans tend to give similar responses to the honest model for ambiguous questions.

Comment by Kei on When do "brains beat brawn" in Chess? An experiment · 2023-06-29T00:38:37.838Z · LW · GW

While I think your overall point is very reasonable, I don't think your experiments provide much evidence for it. Stockfish generally is trained to play the best move assuming its opponent is playing best moves itself. This is a good strategy when both sides start with the same amount of pieces, but falls apart when you do odds games. 

Generally the strategy to win against a weaker opponent in odds games is to conserve material, complicate the position, and play for tricks - go for moves which may not be amazing objectively but end up winning material against a less perceptive opponent. While Stockfish is not great at this, top human chess players can be very good at it. For example, a top grandmaster Hikaru Nakamura had a "Botez Gambit Speedrun" (https://www.youtube.com/playlist?list=PL4KCWZ5Ti2H7HT0p1hXlnr9OPxi1FjyC0), where he sacrificed his queen every game and was able to get to 2500 on chess.com, the level of many chess masters. 

This isn't quite the same as your queen odds setup (it is easier), and the short time format he is on is a factor, but I assume he would be able to beat most sub-1500 FIDE players with queen odds. A version of Stockfish trained to exploit a human's subpar ability would presumably do even better.

Comment by Kei on You can use GPT-4 to create prompt injections against GPT-4 · 2023-04-11T03:49:27.468Z · LW · GW

I wonder if this is due to a second model that checks whether the output of the main model breaks any rules. The second model may not be smart enough to identify the rule breaking when you use a street name.

Comment by Kei on GPT-4 · 2023-03-15T00:45:22.006Z · LW · GW

I don't know how they did it, but I played a chess game against GPT4 by saying the following:

"I'm going to play a chess game. I'll play white, and you play black. On each chat, I'll post a move for white, and you follow with the best move for black. Does that make sense?"

And then going through the moves 1-by-1 in algebraic notation.

My experience largely follows that of GoteNoSente's. I played one full game that lasted 41 moves and all of GPT4's moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 instead of 12. Ne4), but it fixed it when I put in the number in advance. Also, I think it was better in the opening than in the endgame. I suspect this is probably because of the large amount of similar openings in its training data.

Comment by Kei on How bad a future do ML researchers expect? · 2023-03-11T15:47:01.580Z · LW · GW

Thanks for the link! I ended up looking through the data and there wasn't any clear correlation between amount of time spent in research area and p(Doom).

I ran a few averages by both time spent in research area and region of undergraduate study here: https://docs.google.com/spreadsheets/d/1Kp0cWKJt7tmRtlXbPdpirQRwILO29xqAVcpmy30C9HQ/edit#gid=583622504

For the most part, groups don't differ very much, although as might be expected, more North Americans have a high p(Doom) conditional on HLMI than other regions.

Comment by Kei on Sydney can play chess and kind of keep track of the board state · 2023-03-04T07:19:38.026Z · LW · GW

I asked Sydney to reconstruct the board position on the 50th move of two different games, and saw what Simon predicted - a significant drop in performance. Here's a link of two games I tried using your prompt: https://imgur.com/a/ch9U6oZ

While there is some overlap, what Sydney thinks the games look like doesn't have much resemblance to the actual games.

I also repeatedly asked Sydney to continue the games using Stockfish (with a few slightly different prompts), but for some reason once the game description is long enough, Sydney refuses to do anything. It either says it can't access Stockfish, or that using Stockfish would be cheating.

Comment by Kei on There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs · 2023-02-20T01:22:40.132Z · LW · GW

Coming from a complete novice to Go - did Kellin Pelrine beat a nerfed version of KataGo? At the top of the article you mention KataGo did 10 million visits per move, while in the FAR article it says Pelrine beat a version of KataGo that did 100K visits per move.

Comment by Kei on Thoughts on the impact of RLHF research · 2023-01-28T14:39:26.966Z · LW · GW

I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake.

That's not always the wrong thing to do - the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn't have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company.

While I don't have an opinion on this particular case, you could imagine that additional AI investment may not have happened if either of the following were true:

1. The original RLHF proof of concept from OpenAI didn't happen - because Google's leadership wouldn't have the incentive for further investment.

2. If Google's leadership were different - because they may not have thought to invest more money in AI.

Comment by Kei on Comment on "Propositions Concerning Digital Minds and Society" · 2022-07-11T18:25:26.585Z · LW · GW

My intuition is it should be small in most cases, but there are some scenarios where this could be important.

Let's imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let's further assume the wireheading problem has been solved (the AI can't change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time step. This maximum reward could be coded in explicitly in the reward function, or could be implicitly calculated using physical limits.

If deployment is at least b time steps away, the maximum possible deployment reward considering discounting (as calculated at the current time step) is M*d^b/(1-d). If there is any deception the AGI could do that would sacrifice its short-term reward by more than that amount, then it will not do that deception. There seems to be a bit of a tradeoff, where the bigger the multiplier is between the development reward and the training reward, the larger M becomes, and the smaller d needs to be in order to encourage the AGI to engage in non-deceptive behavior.

If we want d to be larger, which will enable an aligned AI to make useful long-term plans, this argument claims we should use a smaller multiplier between development reward and training reward in order for an unaligned AI to be incentived to not deceive us. Alternatively, this argument claims that you should wait a long time between when you test if your model is deceptive and when you deploy it - running a lot of time steps with 0 reward between your deception test and deployment will increase b and thus shrink the size of M*d^b/(1-d), which should make deception much less encouraged.

Comment by Kei on Comment on "Propositions Concerning Digital Minds and Society" · 2022-07-11T03:34:40.062Z · LW · GW

This seems like it would raise the incentive for AGI to be deceptive in their training environments. An un-aligned AGI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward. The benefit to the AGI of pretending to be aligned is it increases the probability of it being deployed, and thus being able to get a higher long-term reward in deployment. 

Thus the bigger the discrepancy in reward an AGI would get between deployment environment and training environment, the more important the long-term deployment reward becomes, and the higher the incentive there is of the AGI being deceptive during training.