
Share
Researchers at OpenAI's Superalignment team propose using weaker supervisors to control superhuman AI, offering early insights into aligning future advanced systems with human values and safety.
December 14, 2023
The OpenAI Superalignment team has introduced a new research direction aimed at addressing one of the most pressing challenges in AI safety: how to control and align superhuman AI systems using weaker supervisors. Their latest paper, "Weak-to-Strong Generalization," presents promising initial results that could pave the way for more robust alignment methods.
The development of superintelligence-AI vastly smarter than humans-is a real possibility within the next decade. However, ensuring that such advanced systems remain safe and beneficial to humanity is a significant challenge. Current alignment techniques, like reinforcement learning from human feedback (RLHF), rely on human supervisors who will be increasingly inadequate as AI models become more sophisticated.
For example, future superhuman models could generate complex and potentially dangerous code that even expert humans would struggle to understand. This disparity between the capabilities of the supervisor and the model is what makes alignment so difficult. The core question is: How can weak supervisors (like humans) effectively control and trust much stronger AI systems?
To tackle this challenge, OpenAI proposes an empirical analogy that can be studied today: can smaller, less capable models supervise larger, more capable models? This setup allows researchers to iteratively test and refine alignment methods without waiting for the advent of superhuman AI.
Generalization Across Models: The team found that a GPT-2-level model can elicit most of GPT-4’s capabilities, achieving performance close to GPT-3.5. This is significant because it demonstrates that a weaker model can guide a stronger one to perform correctly on tasks where the weaker model itself fails.
Empirical Progress: By using smaller models as supervisors, researchers can make iterative empirical progress today, which is crucial for developing robust alignment techniques for future superhuman AI.
To achieve these results, the team designed a setup where:

The process involves several steps:
This research opens up new avenues for addressing the superalignment problem. By leveraging the generalization properties of deep learning, researchers can develop methods that allow weaker supervisors to control stronger models effectively. This is particularly important as we move towards more advanced AI systems that will far surpass human capabilities.
The OpenAI Superalignment team plans to build on these initial findings by:
The weak-to-strong generalization approach represents a significant step forward in the quest for superalignment. By using smaller models to supervise larger ones, researchers can make tangible progress today while preparing for the challenges of tomorrow’s superhuman AI.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 December 2023
133 articles
Related Articles
Related Articles
More Stories