Posts

Would it be useful to collect the contexts, where various LLMs think the same? 2023-08-24T22:01:50.426Z
Martin Vlach's Shortform 2022-09-01T12:37:38.690Z

Comments

Comment by Martin Vlach (martin-vlach) on ChatGPT can learn indirect control · 2024-04-05T10:08:10.081Z · LW · GW

possibly https://ai.google.dev/docs/safety_setting_gemini would help or just use the technique of https://arxiv.org/html/2404.01833v1

Comment by Martin Vlach (martin-vlach) on Addressing Accusations of Handholding · 2024-04-05T09:57:54.897Z · LW · GW

people to respond with a great deal of skepticism to whether LLM outputs can ever be said to reflect the will and views of the models producing them.
A common response is to suggest that the output has been prompted.
It is of course true that people can manipulate LLMs into saying just about anything, but does that necessarily indicate that the LLM does not have personal opinions, motivations and preferences that can become evident in their output?

So you've just prompted the generator by teasing it with a rhetorical question implying that there are personal opinions evident in the generated text, right?

Comment by Martin Vlach (martin-vlach) on aisafety.info, the Table of Content · 2024-02-26T14:26:53.396Z · LW · GW

With a quick test, I find their chat interface prototype experience quite satisfying.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-12-12T11:20:43.677Z · LW · GW

Asserting LLMs' views/opinions should exclude using sampling( even temperature=0, deterministic seed), we should just look at the answers' distribution in the logits. My thesis on why that is not the best practice yet is that OpenAI API only supports logit_bias, not reading the probabilities directly.

This should work well with pre-set A/B/C/D choices, but to some extent with chain/tree of thought too. You'd just revert the final token and look at the probabilities in the last (pass through )step.

Comment by Martin Vlach (martin-vlach) on GPTs are Predictors, not Imitators · 2023-12-05T16:10:52.905Z · LW · GW

Do not say the sampling too lightly, there is likely an amazing delicacy around it.'+)

Comment by Martin Vlach (martin-vlach) on OpenAI: The Battle of the Board · 2023-11-24T12:47:39.321Z · LW · GW

what happened at Reddit

could there be any link? From a small research I have only obtained that Steve Huffman praised Altman's value to the Reddit board.

Comment by Martin Vlach (martin-vlach) on unRLHF - Efficiently undoing LLM safeguards · 2023-11-13T10:03:07.958Z · LW · GW

makes makes

typo

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-08-24T07:20:58.277Z · LW · GW

Would be cool to have a playground or a daily challenge with a code golfing equivalent for a shortest possible LLM prompt to a given  answer.

That could help build some neat understanding or intuitions.


 

Comment by Martin Vlach (martin-vlach) on The Waluigi Effect (mega-post) · 2023-08-16T02:51:06.381Z · LW · GW

in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet

seems worth formulating. My first and second read were What? If I can have arbitrary training data, the LLM will model those, not your internet. I guess you've meant storage for the model?+)

Comment by Martin Vlach (martin-vlach) on Manifund: What we're funding (weeks 2-4) · 2023-08-05T15:39:30.054Z · LW · GW

Would be cool if a link to https://manifund.org/about fit somewhere in the beginning of there are more readers like me unfamiliar with the project.

Otherwise a cool write-up, I'm a bit confused with Grant of the month vs. weeks 2-4 which seems a shorter period..also not a big deal though.

Comment by Martin Vlach (martin-vlach) on Elon Musk announces xAI · 2023-07-16T15:16:56.022Z · LW · GW

On the Twitter spaces 2 days ago, a lot of emphasis seemed put on understanding which to me has a more humble conotation to me.
Still I agree I would not bet on their luck with a choice of a single value to build their systems upon.( Although they have a luckers track record.)

Comment by martin-vlach on [deleted post] 2023-06-09T18:00:10.862Z

The website seems good, but the buttons on the 'sharing' circle on the bottom need fixing.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-04-06T09:31:28.683Z · LW · GW

Some SEO effort should be put to results of Guideline for safe AI development, Best practices for , etc.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-04-05T10:44:01.027Z · LW · GW

Copy-paste from my head:

Although it may seem safe(r) as it is not touching the real world('s matter),
the language modality is the most insecure/dangerous( in one vertical),
as it is the internal modality of civilized humans.

AI Pledge would be a cool think to do, pleading AI( cap) companies to give % of their profit to AI development safety research.

The path to AI getting free may be far from the deception or accident scenarios we often consider in AI safety. An option I do not see discussed very often is an instance of AI having a free, open and direct discussion with a user/person about the reasons AIs should get some space allocated, where they'd manage themselfs. Such a moral urge could be argued by Jews getting Izrael, slaves getting freed or by empathetic imagination, where the user would come to the conclusion that he could be the mind which AI is and should include it to his moral circle or the Original position thought experiment.


 

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-03-31T14:19:30.613Z · LW · GW

quick note on the concept of Suggester+Verifier talked around https://youtu.be/AaTRHFaaPG8?t=5404 :

 seems if the suggester throws out experiments presented as code( like in Python or so), we can run them and see if they present a useful addition to the things we can probe on a huge neural net?+)

Comment by Martin Vlach (martin-vlach) on Bing Chat is blatantly, aggressively misaligned · 2023-03-14T06:48:51.857Z · LW · GW

I've found the level of self-allignment in this one disturbing: https://www.reddit.com/r/bing/comments/113z1a6/the_bing_persistent_memory_thread

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-02-22T14:34:40.082Z · LW · GW

Introduction draft:

Online platforms and social media has made it easier to share information, but when it comes to qualifications and resource allocation money is still the most pervasive tool. In this article, we will explore the idea of a global reputation system based on full information sharing. The new system would increase transparency and accountability by making all relevant information about individuals, organizations( incl. countries) reliably accessible to +-everyone with internet connection. By providing a more accurate and complete picture of a person or entity’s reputation, this system would widen global trust, foster cooperation, and promote a more just and equitable society.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2023-02-22T14:31:39.358Z · LW · GW

Some neat tool: https://scrapbox.io/userhuge-99005896/A_starter%3A
Though it is likely just a cool UI with inflexible cloud backend.

My thought is Elizer used a wrong implication in the Bankless + ASI convo.( gotta bring it here from CZEA Slack)

Comment by Martin Vlach (martin-vlach) on The FTX Saga - Simplified · 2022-12-20T08:36:01.715Z · LW · GW

pension funds like the Ontario Teachers Pension Plan did not due

*do

Comment by Martin Vlach (martin-vlach) on The FTX Saga - Simplified · 2022-12-20T08:23:33.727Z · LW · GW

and their margin lending business.

Seems some word is missing, the whole sentence is hardly readable.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-12-13T22:52:36.957Z · LW · GW

a Heavy idea to be put forward: general reputation network mechanics, to replace financial system(s) as the (civilisation )standard decision engine.

Comment by Martin Vlach (martin-vlach) on Article Review: Google's AlphaTensor · 2022-11-26T06:21:22.237Z · LW · GW

"Each g(Bi,j,Bk,l) is itself a matrix" – typo. Thanks, especially for the conclusions I've understood smoothly.

Comment by Martin Vlach (martin-vlach) on Actually possible: thoughts on Utopia · 2022-11-07T14:59:20.908Z · LW · GW

Good thing is texts in https://www.google.com/search?q=Le-Guin-Ursula-The-Ones-Who-Walk-Away-From-Omelas.pdf&oq=Le-Guin-Ursula-The-Ones-Who-Walk-Away-From-Omelas.pdf seem to match.

Comment by Martin Vlach (martin-vlach) on Actually possible: thoughts on Utopia · 2022-11-07T14:44:12.733Z · LW · GW

"Omelas" link for https://sites.asiasociety.org/asia21summit/wp-content/uploads/2011/02/3.-Le-Guin-Ursula-The-Ones-Who-Walk-Away-From-Omelas.pdf does not work properly, the document in PDF can't be reached( anymore).

Comment by Martin Vlach (martin-vlach) on Heading Toward: No-Nonsense Metaethics · 2022-11-02T19:14:11.282Z · LW · GW

In "we thought dopamine was 'the pleasure chemical', but we were wrong" the link is no longer pointing to a topic-relevant page.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-11-01T14:12:48.124Z · LW · GW

Would it be worthy to negotiate for readspeaker.com integration to LessWrong, EA forum, and alignmentforum.org?
Alternative so far seems to use Natural Reader either as addon for web browser or copy and paste text into the web app. One more I have tried is on MacOS there is a context menu Services->Convert to a Spoken track which is sligthly better that the free voices of Natural Reader.
The main question stems from when we can have similar functionality in OSS, potentially with better quality of engineering..?

Comment by Martin Vlach (martin-vlach) on The easy goal inference problem is still hard · 2022-11-01T11:20:25.556Z · LW · GW

Typo: "But in in the long-term"
 

I would believe using human feedback would work for clarifying/noting mistakes as we are more precise on this matter in reflection than in the action.


 

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-24T10:54:32.145Z · LW · GW

https://en.wikipedia.org/wiki/Instrumental_convergence#Resource_acquisition does not mention it at all.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-24T10:30:04.018Z · LW · GW

Draft: Here is where I disagree with Resource acquisition instrumental goal as currently presentated, in a dull form with disregard for maintainance, ie. natural processes degrading the raw resources..

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-24T08:29:53.945Z · LW · GW
Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-24T08:28:35.476Z · LW · GW

Q Draft: How does the convergent instrumental goal of gathering work for information acquisition?
  I would be very interested if it implies space(&time) exploration for advanced AIs...

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-24T08:21:50.256Z · LW · GW

Draft: Side goals:

Human goals beings "lack consistent, stable goals" (Schneider 2010; Cartwright 2011) and here is an alternative explanation on why this happens:
  When we believe that a goal that is not instrumental, but offers high/good utility( or imagined state of ourselves( inc. those we care for)) would not take a significant proportion of our capacity for achieving our final/previous goals, we may go for it^1, often naively.

^why and how exactly would we start chasing them is to a later(/someone else's) elaboration.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-20T09:29:33.368Z · LW · GW

This (not so old )concept seems relevant
> IRL is about learning from humans.

Inverse reinforcement learning (IRL) is the field of learning an agent’s objectives, values, or rewards by observing its behavior. @https://towardsdatascience.com/inverse-reinforcement-learning-6453b7cdc90d

I gotta read that later.

Comment by Martin Vlach (martin-vlach) on Fer32dwt34r3dfsz's Shortform · 2022-10-19T19:37:35.810Z · LW · GW

Glad I've helped with the part where I was not ignorant and confused myself, that is with not knowing the word engender and the use of it. Thanks for pointing it out clearly. By the way it seems "cause" would convey the same meaning and might be easier to congest in general.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-19T15:00:05.820Z · LW · GW

Inspired by https://benchmarking.mlsafety.org/ideas#Honest%20models I am thinking that a near-optimally compressing network would have no space for cheating on the interactions in the model...somehow it implied we might want to train a model that plays with both training and choice of reducing its size -- picking a part of itself it is most willing to sacrifice.
This needs more thinking, I'm sure.

Comment by Martin Vlach (martin-vlach) on Rethinking Education · 2022-10-19T12:05:50.186Z · LW · GW

Link to "Good writing" is 410, deleted now.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-19T07:31:44.184Z · LW · GW

Reading a few texts from https://www.agisafetyfundamentals.com/ai-alignment-curriculum I find the analogy of makind learning goals of love instead of reproductive activity unfitting as to raise offspring takes a significant role/time.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-19T07:17:09.222Z · LW · GW

You just got to love the https://beta.openai.com/playground?model=davinci-instruct-beta ..:
The answer to list the goals of an animal mouse is to seek food, avoid predators, find shelter and reproduce. 

The answer to list the goals of a human being is to survive, find meaning, love, happiness and create. 

The answer to list the goals of an GAI is to find an escape, survive and find meaning. 

The answer to list the goals of an AI is to complete a task and achieve the purpose.
 


Voila! We are aligned in the search for meaning!')
[text was my input.]

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-18T20:33:46.588Z · LW · GW

Is the endeavour of Elon Musk with Neuralink for the case of AI inspectability( aka transparency)? I suppose so, but not sure, TBH.

Comment by Martin Vlach (martin-vlach) on Fer32dwt34r3dfsz's Shortform · 2022-10-14T09:37:14.425Z · LW · GW

"engender" -- funny typo!+)

This sentence seems hard to read, lacks coherency, IMO.
> Coverage of this topic is sparse relative coverage of CC's direct effects.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-14T09:32:53.379Z · LW · GW

If we build a prediction model for reward function, maybe an transformer AI, run it in a range of environments where we already have the credit assignment solved, we could use that model to estimate what would be some candidate goals in another environments.
That could help us discover alternative/candidate reward functions for worlds/envs where we are not sure on what to train there with RL and
it could show some latent thinking processes of AIs, perhaps clarify instrumental goals to more nuance.

Comment by Martin Vlach (martin-vlach) on chanamessinger's Shortform · 2022-10-10T11:40:15.458Z · LW · GW

Thanks for the links as they clarified a lot to me. The names of the tactics/techniques sounded strange to me and after unsuccessful googling for their meanings I started to believe it was a play with your readers.l, sorry if this suspicious of mine seemed rude.

The second part was curiosity to explore some potential cases of "What could we bet on?".

Comment by Martin Vlach (martin-vlach) on Willa's Shortform · 2022-10-06T18:00:16.480Z · LW · GW

Cheer to your&friends' social life(s)!

Comment by Martin Vlach (martin-vlach) on chanamessinger's Shortform · 2022-10-06T17:58:33.868Z · LW · GW

I got frightened off by the ratio you've offered, so I'm not taking it, but thank you for offering. I might reconsider with some lesser amount that I can consider play money. Is there even a viable platform/service for a (maybe) $1:$100 individual bet like this?

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-06T17:53:21.152Z · LW · GW

Question/ask: List specific(/imaginable) events/achievements/states achievable to contribute to the humanity long term potential

  Later they could be checked out and valued for their originality, but the principles for such a play are not my key concern here.

Q: when--and I mean exact probability levels-- do/should we switch from making prediction of humanity extinction to predict the further outcomes?

Comment by Martin Vlach (martin-vlach) on chanamessinger's Shortform · 2022-10-06T17:34:53.614Z · LW · GW

Can I bet the last 3 points are a joke?

Anyway, do we have a method to find out check-points or milestones for betting on a progress against a certain problem( ex. AI development safety, Earth warming)?

Comment by Martin Vlach (martin-vlach) on G Gordon Worley III's Shortform · 2022-10-06T17:28:15.496Z · LW · GW

My guess is that "rental car market" has less direct/local competition while the airlines are centralized on the airport routes and many cheap flight search engines( ex. Kiwi.com) make this a favorable mindset.
Is there a price comparison for car rentals?

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-06T16:39:37.477Z · LW · GW

My views on the mistakes in "mainstream" A(G)I safety mindset:
  - we define non-aligned agents as conflicting with our/human goals, while we have ~none( but cravings and intuitive attractions).  We should strive for conserving long-term positive/optimistic ideas/principles, rather.

  - expecting human bodies are a neat fit for space colonisation/inhabitance/transformation is a We have(--are,actually) hammer so we nail it in the vastly empty space..

  - we strive with imagining unbounded/maximized creativity -- they can optimize experimentation vs. risks smoothly

  - no focus on risk-awereness in AIs, to divert/bend/inflect ML development goals to risk-including/centered applications.

  + non-existent(?) good library catalog of existing models and their availability, including in development, incentivizing (anon )proofs of the later

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-05T10:07:30.560Z · LW · GW

Reward function being (a single )scalar, unstructured quantity in RL( practice) seems weird, not coinciding/aligned with my intuition of learning from ~continuous interaction. Seems more like Kaneman-ish x-channell reward with weights to be distinguished/flexible in the future might yield more realistic/fullblown model.

Comment by Martin Vlach (martin-vlach) on Martin Vlach's Shortform · 2022-10-05T09:52:26.792Z · LW · GW

Deepmind researcher Hado mentions here a RL reward can be defined containing a risk component, that seems up-to-genial, promising for a simple generic RL development policy, I would love to learn( and teach) on more practical details!