Posts

Control Symmetry: why we might want to start investigating asymmetric alignment interventions 2023-11-11T17:27:10.636Z

Comments

Comment by domenicrosati on A model of research skill · 2024-01-08T14:14:37.862Z · LW · GW

I think people underestimate formal study of research methods like reading texts / taking a course on research methodology for improving research abilities.

There are many concepts within control, experimental design, validitiy, and reliability like construct validity or conclusion validity that you would learn from a research methods textbook that are super helpful for improving the quality of research. I think many researchers implicitly learn these things without ever knowing what they exactly are but that is usually through trial and errors (peer review rejections and embarrasment) which can be avoided by looking into research methods texts.

Of course this doesn't help with the discovery aspect of research which I think your article is good at outlining but at some point research questions need to be investigated and understanding research design makes it really obvious the kinds of work you need to do in order to have a high quality investigation.

Comment by domenicrosati on Public Call for Interest in Mathematical Alignment · 2023-11-30T22:16:12.473Z · LW · GW

Thanks for the pointer! Yes RL has a lot of research of this kind - as an empirical research I just get stuck sometimes in translation

Comment by domenicrosati on Public Call for Interest in Mathematical Alignment · 2023-11-23T20:39:56.758Z · LW · GW

For my own clarity: What is the difference between mathematical approaches to alignment and other technical approaches like mechanistic interpretability work?

I imagine the focus is on in principal arguments or proofs regarding the capabilities of a given system rather than empirical or behavioural analysis but you mention RL so just wanted to get some colour on this.

Any clarification here would be helpful!

Comment by domenicrosati on Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent · 2023-03-10T11:39:39.476Z · LW · GW

If someone did this - it would be nice to collect preference data over answers that are helpful to alignment and not helpful to alignment… that could be a dataset that is interesting for a variety of reasons like analyzing current models abilities to help with alignment, gaps in being helpful w.r.t alignment and of course providing a mechanism for making models better at alignment… a model like this could also maybe work as a specialized type of Constitutional AI to collect feedback from the models preferences that are more “alignment-aware” so to speak… none of this of course is a solution to alignment as the OP points out but interesting nonetheless.

I’d be interested in participating in this project if other folks set something up…

Comment by domenicrosati on Cognitive Emulation: A Naive AI Safety Proposal · 2023-02-27T02:59:47.342Z · LW · GW

Im struggling to understand how this is is different from “we will build aligned ai to align ai”. specifically: Can someone explain to me how human-like and AGI are different? Can someone explain to me why human-like AI avoids typical x-risk scenarios (given those human-likes could say clone themselves, speed up themselves and rewrite their own software and easily become unbounded)? Why isnt an emulated cognitive system a real cognitive system… i don’t understand how you can emulate a human-like intelligence and it not be the same as fully human-like. 
 

currently my reading of this is we will build human-like AI because humans are bounded so it will be too, those bounds are: (1) sufffiecent to prevent xrisk (2) helpful for (and maybe even the reason for) alignment. Isnt a big wide open unsolved part of the alignment problem “how do we keep itelligent systems bounded”? What am I missing here?

I guess one maybe supplementary question as well is: how is this different from normal NLP capabilities research which is fundamentally about developing and understanding the limitations of human like intelligence? Most folks in the field say who publish in ACL conferences would explicitly think of this as what they are doing and not trying to build anything more capable than humans.

Comment by domenicrosati on Simulators · 2022-11-13T16:07:08.939Z · LW · GW

What are your thoughts on prompt tuning as a mechanism for discovering optimal simulation strategies?

I know you mention condition generation as something to touch on in future posts but I’d be eager to hear about where you think prompt tuning comes in considering continuous prompts are differentiable and so can be learned/optimized for specific simulation behaviour.

Comment by domenicrosati on Elicit: Language Models as Research Assistants · 2022-04-11T12:42:52.383Z · LW · GW

Hey there,

I was just wondering how you deal with hallucination and faithfulness issues of large language models from a technical perspective? The user experience perspective seems clear - you can give users control and consent over what Elicit is suggesting and so on.

However we know LLMs are prone to issues of faithfulness and factuality (Pagnoni et al. 2021 as one example for abstractive summarization) and this seems like it would be a big issue for research where factual correctness is very important. In a biomedical scienario, if a user of Elicit gets an output that presents a figure wrongly extracted (say from a proceeding sentence, or hallucinated as the highest log likelihood token based on previous documents), this could potentially have very dangerous consequences. I'd love to know more about how you'd adress that?

My current thinking on the matter is that in order to address these safety issues in NLP for science we may need to provide models that "self-criticize" their outputs so to speak. I.e. provide counterfactual outputs that could be checked or something like this. Espeically since GopherCite (Menick et al 2022) and some of the similar self-supporting models seem to show that self-support is also prone to some issues that doesn't totally address factuality (in their case as measured on TruthfulQA) not to meantion self-explaining approaches which I belive suffer from the same issues (i.e. hallucinating an incorrect explanation).

 

Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., ... & McAleese, N. (2022). Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.


Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).