Introducing Transluce — A Letter from the Founders
post by jsteinhardt · 2024-10-23T18:10:02.526Z · LW · GW · 2 commentsContents
Our approach: AI-driven tools for understanding AI Moving forward None 2 comments
We are launching an independent research lab that builds open, scalable technology for understanding AI systems and steering them in the public interest.
Transluce means to shine light through something to reveal its structure. Today’s complex AI systems are difficult to understand—not even experts can reliably predict their behavior once deployed. At the same time, AI is being adopted more quickly than any technology in recent memory. Given AI's extraordinary consequences for society, how we determine whether models are safe to release must be a matter of public conversation, and the tools for inspecting and assessing models should embody publicly agreed-upon best practices.
Our goal at Transluce is to create world-class tools for understanding AI systems, and to use these tools to drive an industry standard for trustworthy AI. To build trust in analyses of the capabilities and risks of AI systems, these tools must be scalable and open.
Scalability. AI results from the interaction of multiple complex data flows: training data, internal representations, behaviors, and user interactions. Current methods for understanding AI rely on significant manual labor from human researchers. We need scalable approaches that leverage AI to assist with understanding, by training AI agents to understand these complex data sources, explain them to humans, and modify the data in response to human feedback.
Openness. Companies building AI systems cannot be the primary arbiters of their safety, due to the conflict of interest with commercial priorities. To allow for meaningful public oversight, tools and processes for auditing AI systems should be openly validated, responsive to public feedback, and accessible to third-party evaluators. The best minds in the world should vet this technology and hone its reliability.
Transluce exists to address these needs. We will build AI-driven technology to understand and analyze AI systems, releasing it open-source so the community can understand and build upon it. We will first apply this technology to publicly analyze frontier open-weight AI systems, so the world can vet our analyses and improve their reliability. Once our technology has been openly vetted, we will work with frontier AI labs and governments to ensure that internal assessments reach the same standards as public best practices.
Today, we are releasing the first step towards this vision—a suite of AI-driven tools for automatically understanding the representations and behaviors of large language models. These tools scale to models ranging from Llama-3.1 8B to GPT-4o and Claude 3.5 Sonnet, and will be released open-source for the community to build on. You can read more about these tools in our release announcement, or read the detailed reports at Our Work.
Our approach: AI-driven tools for understanding AI
Humans struggle to understand AI systems because they are enormous and opaque—neuron activations are megabytes of arbitrary floating-point numbers, behaviors grow combinatorially with the space of input prompts, and training sets are at the scale of the Internet.
Our vision is to create AI-driven tools that direct massive computational power toward explaining these complex systems. We imagine a human trying to understand an AI system as an explorer situated in a vast cavern. We want teams of AI agents to map all the crevasses of the cavern, providing tendrils for the explorer to sense its overall structure, individual pieces within it, and how they fit together. The agents can exploit the vastness of the cavern as a source of information for training and improvement, putting us on the right side of the bitter lesson.
We’re sharing three demonstrations that start to illustrate this vision:
- an LLM pipeline that creates state-of-the-art feature descriptions for neuron activation patterns;
- an observability interface for interrogating and steering these features;
- a behavior elicitation agent that automatically searches for user-specified behaviors from frontier models, including Llama-405B and GPT-4o.
Each of these tools leverages AI agents trained to automatically understand other AI systems and surface those insights to humans.
Moving forward
We believe that responsible deployment of trustworthy AI systems is closely linked to understanding them in detail. As an independent non-profit lab, we can facilitate a public discussion about best practices for building this understanding. We will present ideas, get feedback from academia, the open-source community, companies, and the public, and iterate. We will build positive-sum partnerships with model providers, government auditors, and third-party users, which will be increasingly needed as both the complexity of deployments and of regulation increase. These partnerships will help ensure that internal assessments reach the standard of our publicly-vetted procedures.
At the same time, we have ambitious goals for our tech. We are scaling our methods to frontier models, with better agents to help us make sense of more complexity. We will combine our observability and elicitation technology, allowing users to specify search goals in terms of observability states, such as “features related to deception are firing internally but are not present in the output.” In the long run, we will build general-purpose frameworks for making sense of any complex data stream, including the training data and interactions between multiple agents.
We’re excited to start working on this mission in public. Whether you’re interested in following our work, joining our team, contributing to our projects, or donating resources to this work, we can’t wait to hear from you.
Jacob and Sarah
2 comments
Comments sorted by top scores.
comment by Gurkenglas · 2024-10-23T19:28:00.159Z · LW(p) · GW(p)
The public will Goodhart any metric you hand over to it. If you provide evaluation as a service, you will know how many attempts an AI lab made at your test.
comment by Chris_Leong · 2024-10-24T06:55:23.265Z · LW(p) · GW(p)
One thing I would love to know is how it'll work on Claude 3.5 Sonnet or GPT 4o given that these models aren't open-weights. Is it that you have access to some reduced level of capabilities for these?