TLDR: This research paper proposes a new objective function for AI agents that aims to softly maximize long-term, aggregate human power, defined as the ability to achieve diverse goals. Instead of learning human preferences, the AI focuses on structural empowerment, considering human bounded rationality and social norms. Experiments show an AI using this metric learns cooperative behaviors like unlocking doors and clearing paths, suggesting a safer and more beneficial alternative to traditional reward-based AI objectives.
Artificial intelligence systems are rapidly advancing, bringing both immense potential and significant concerns, particularly regarding AI safety. A central concept in this discussion is ‘power’ – not just in terms of AI seeking control, but also human power, which is essential for our well-being. A new research paper explores a novel approach to AI design, aiming to promote both safety and human well-being by explicitly tasking AI agents with empowering humans and managing the power balance between humans and AI in a beneficial way.
The paper, titled “Model-Based Soft Maximization of Suitable Metrics of Long-Term Human Power,” by Jobst Heitzig and Ram Potham, introduces a principled framework for an AI’s objective function. Unlike traditional AI objectives that might focus on maximizing a specific utility or reward, this approach designs an objective that represents an inequality- and risk-averse long-term aggregate of human power. This means the AI is designed to consider how its actions affect the ability of many humans to achieve a wide variety of their potential goals, over a long period, while also being mindful of fairness and avoiding risky outcomes.
Understanding Human Power for AI
The core of this framework is a new metric for individual human power, termed “ICCEA power” (Informationally and Cognitively Constrained Effective Autonomous power). This metric measures how many diverse goals a human can effectively achieve, taking into account their own cognitive limitations, available information, and the behavior of other agents, including the AI. Crucially, the AI does not try to guess or learn a human’s specific, current goals, as these can be complex, changing, and hard to predict. Instead, it focuses on the structural ability to reach a wide range of possible states that could represent desirable outcomes.
The researchers detail how this individual power metric is aggregated across multiple humans and over time to form the AI’s overall objective. They incorporate several “desiderata” or desired properties into the metric’s design. For instance, the AI is incentivized to reduce uncertainty for humans, prefer reliable outcomes, and avoid concentrating power in the hands of a few. It also encourages the AI to be “corrigible,” meaning it can be corrected or stopped, and to avoid irreversible changes that might disempower humans in the future.
How the AI Learns to Empower
The paper proposes algorithms for an AI to compute and softly maximize this human power metric. In simpler environments, this can be done through a process called backward induction. For more complex, multi-agent environments, they suggest a two-phase learning approach similar to reinforcement learning. In the first phase, the AI learns to model human behavior, including their bounded rationality and social norms. In the second phase, based on this understanding, the AI learns its own policy to maximize the aggregate human power.
The implications of this objective were explored through analysis of various “paradigmatic situations” and a simulation in a small gridworld environment. The analysis suggests that an AI designed with this objective would:
- Act as a transparent, instruction-following assistant, making clear commitments and respecting social norms.
- Adapt to human limitations, offering a suitable number of options without overwhelming them.
- Be hesitant to cause irreversible changes, often asking for confirmation before executing commands.
- Manage resources fairly and sustainably.
- Protect its own existence and functionality, as these are instrumental to empowering humans.
Also Read:
- AI Agents Learn to Cooperate by Understanding Each Other’s Minds
- Bridging the Cognitive Divide: Why AI’s Goals Differ from Human Intentions
Proof of Concept: The Gridworld Experiment
In the gridworld simulation, a robot agent, without any explicit goal-specific rewards, learned to cooperatively empower a human. The human’s unknown goal was to reach a green square, but they were blocked by a locked door. The robot, solely driven by the objective to maximize human power, learned to navigate to a key, pick it up, unlock the door, and then move out of the human’s way. This complex sequence of actions emerged naturally as the robot discovered that these steps significantly increased the human’s ability to reach various possible goals, thus increasing its intrinsic reward.
This research offers a promising direction for developing highly capable general-purpose AI systems that are inherently safer and more beneficial. By focusing on the soft maximization of aggregate human power, such AI systems could provide a robust alternative to traditional utility-based objectives, potentially mitigating risks like power-seeking and misalignment. For more in-depth technical details, you can read the full paper available at arXiv:2508.00159.


