Measuring Machine Intelligence

I kept hearing people say we're close to AGI - or that we've already achieved it, or that it's impossible. I realized I didn't actually know what intelligence means in a formal sense, how we measure it in humans, or what the benchmarks for AI really test. So I went looking. What I found was a field where trillion-dollar investment decisions depend on definitions that nobody agrees on, benchmarks that keep breaking, and a gap between what AI can do on a test and what it can do in the real world. Here's what I learned.
What Is Intelligence, Anyway?
This sounds like it should have a straightforward answer. It doesn't. When two dozen prominent researchers were asked to define intelligence in 1986, they produced two dozen different definitions. [10] The American Psychological Association tried in 1996 and deliberately avoided giving a single definition, instead calling intelligence a “complex set of phenomena.” [2]
The most widely cited working definition comes from a 1997 statement signed by 52 researchers: [3]
Intelligence is a very general mental capability that involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.
That's helpful, but notice how broad it is. It's basically saying intelligence is being good at thinking - which doesn't tell you much about how to measure it.
Spearman's g: the one number that keeps showing up
In 1904, Charles Spearman noticed something odd. When you give people a bunch of unrelated cognitive tests - vocabulary, spatial reasoning, arithmetic, pattern recognition - their scores are always positively correlated. People who do well on one tend to do well on others. This is called the positive manifold, and it's been called the most replicated finding in all of psychology. [1]
Spearman used a technique he invented called factor analysis to extract a single underlying factor from these correlations. He called it g (general intelligence). Mathematically, the idea is that each test score is a combination of g plus some test-specific ability:
X = Λf + ε
Think of it this way. You take a bunch of cognitive tests and get a score on each - that's X. Spearman's insight was that your scores aren't random. There's a hidden factor f (general intelligence, or g) pulling them all in the same direction. Λ captures how strongly each test is connected to that hidden factor - some tests (like abstract reasoning) are tightly linked to g, while others (like memorizing digits) are less so. And ε is everything else - luck, how you were feeling that day, test-specific skills.
When researchers run this math across large populations, the g factor alone explains 40-50% of the total variation in scores across all the different tests. That's a lot for a single number. [1]
Today, the most accepted model is the Cattell-Horn-Carroll (CHC) model. [4] [5] [6] Think of it as a three-layer pyramid. At the top sits g. In the middle are 16 broad abilities like reasoning, memory, and processing speed. At the bottom are 80+ specific skills.
The distinction that matters most for AI is between two of those broad abilities. Fluid intelligence is your ability to solve problems you've never seen before - pure reasoning with no prior knowledge to lean on. Crystallized intelligence is what you know - facts, vocabulary, learned procedures. Here's the thing: most AI benchmarks test crystallized knowledge (what the model has seen in training). Very few test fluid reasoning (can it figure out something genuinely new?).
There's another well-known theory worth mentioning: Howard Gardner's multiple intelligences. He proposed in 1983 that there are eight independent types of intelligence - linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, intrapersonal, and naturalistic. [7] The idea became hugely popular in education. Schools designed entire curricula around it.
The problem is that when researchers tested whether these intelligences are actually independent, they found they're not. Visser et al. (2006) showed they all correlate with g - meaning they're not separate abilities, just different expressions of the same underlying factor. [9] Waterhouse (2023) went further, calling the whole theory a “neuromyth.” [8] Gardner himself conceded there was “little hard evidence” supporting it.
Some numbers worth knowing
If you test a large number of people, IQ scores form a normal distribution. The average is 100, and most people (about 95%) fall between 70 and 130. The standard deviation is 15, meaning a score of 115 puts you one step above average, and 130 puts you two steps above.
One thing that surprised me: how much of intelligence is genetic, and how that changes with age. In childhood, genes explain about 41% of the differences in intelligence between people. By adulthood, that number jumps to 75-80%. [11] Environment matters a lot when you're young, but its influence fades as you grow older.
Neuroscience adds another piece. Intelligence isn't located in one part of the brain. It depends on how well the frontal lobes (planning, decision-making) and parietal lobes (spatial processing, integration) talk to each other. It's about the wiring between regions, not any single “intelligence center.” [12]
The Mathematical View: Compression Is Intelligence
There's also a mathematical way to think about intelligence. The core idea is simple: intelligence is compression. If you can take something complex and describe it in a shorter form, you've understood its pattern.
This was formalized through something called Kolmogorov complexity. [15] It measures how complex a piece of data is by asking: what's the shortest computer program that can produce it? For example, the string “010101010101” is simple - you can generate it with a tiny program (“repeat 01 six times”). A truly random string can't be compressed at all - you'd need a program just as long as the string itself.
K(x) = min { |p| : U(p) = x }
K(x) is the Kolmogorov complexity of some data x. It's the length of the shortest program p that produces x on a universal Turing machine U. The catch: you can never be 100% sure you've found the shortest possible program. It's mathematically proven to be incomputable. [15]
Why does this matter for intelligence? Because compression and prediction are the same thing. If you can compress data, you've found the pattern. If you've found the pattern, you can predict what comes next. And predicting well is really what intelligence is about. Ray Solomonoff made this rigorous in 1964 [16] - he showed that the best possible predictor is one that favors simpler explanations over complex ones (shorter programs get higher weight). This is basically Occam's razor, expressed as math.
Putting a number on intelligence
In 2007, Shane Legg and Marcus Hutter looked at 71 different definitions of intelligence and boiled them all down to one idea: intelligence is how good you are at achieving goals across many different situations. [13] Then they wrote a formula for it:
Υ(π) = Σμ∈E 2-K(μ) Vμπ
Here's what it says in plain English: take an agent (π). Test it in every possible environment (μ). Add up how well it does in each one - but give more weight to simpler environments (the 2-K(μ) part). That total is its intelligence score. Why weight simpler environments more? Because most real-world problems have underlying structure. An agent that only solves bizarre, contrived edge cases but fails at everyday tasks isn't very intelligent. Weighting by simplicity is Occam's razor built into the formula. [13]
What makes this interesting is what it doesn't care about. It doesn't matter if the agent is a human, an AI, or something we haven't invented yet. It doesn't favor any particular task. It's a universal yardstick for intelligence. The catch: you can't actually compute it. It requires testing across infinitely many environments. So it exists as a theoretical ideal, not a practical tool.
Marcus Hutter extended this with AIXI [14] - a theoretical design for the most intelligent possible agent. AIXI always makes the best decision in any environment by considering every possible explanation for what it observes, weighted by simplicity. It's been proven optimal. It's also been proven impossible to actually build - it would require infinite computation. Researchers have built simplified versions (like MC-AIXI-CTW [17]), but they only work in very simple environments.
We have a mathematically precise definition of perfect intelligence. We just can't compute it. This is the fundamental tension in the field: we know what the destination looks like but have no map.
How We Test AI Today - and Where It Breaks
If we can't compute universal intelligence, we do the next best thing: we give AI tests. Lots of tests. The problem is that AI keeps passing them faster than we can make new ones.
The saturation problem
MMLU (Massive Multitask Language Understanding) was supposed to be a broad test of knowledge across 57 subjects. When it launched in 2021, the best models scored around 43%. By 2025, frontier models were clustered at 88-93% - essentially at or above human expert level. The benchmark was excluded from the Vellum AI leaderboard as “outdated.” [27] Score variation of up to 10% depends on nothing more than how you format the prompt.
This pattern repeats. A benchmark launches, the community rallies around it, models improve rapidly, and within a few years it's saturated and no longer useful for distinguishing between systems. GPQA Diamond (PhD-level science) went from 36% to 94% in three years. AIME math problems went from 12% to near-perfect scores.
The contamination problem
There's also a trust problem with benchmarks. When researchers asked GPT-4 to guess missing answer options from MMLU questions, it got 57% of them exactly right. [29] That's a strong sign the model had seen these test questions during training. It wasn't reasoning through them - it was remembering them.
This is called data contamination, and it's a big deal. On GSM8K (a math benchmark), accuracy drops by 13% when you remove questions the model was likely trained on. It gets worse: some models are trained on translated versions of English benchmarks, which inflates their English scores without anyone noticing.
Why this matters
If a model has seen the test during training, its score tells you how good it is at memorizing, not at thinking. Imagine a student who got a perfect score - but only because they had the answer key beforehand. The field is fighting this with private test sets (HLE, ARC-AGI-2), constantly refreshed benchmarks (LiveBench), and tools that detect contamination after the fact. But it's an ongoing arms race.
ARC-AGI: testing what benchmarks miss
François Chollet made an important argument in 2019: being good at a task doesn't mean you're intelligent. [23] You might just have a lot of practice or training data. What actually shows intelligence is how fast you can pick up something completely new from just a few examples. He called this skill-acquisition efficiency.
To test this, he created the Abstraction and Reasoning Corpus (ARC) - a set of grid puzzles where you see 2-3 examples of a pattern and have to figure out the rule. They test basic things like recognizing objects, counting, and understanding how shapes relate to each other. No memorization helps here - every puzzle is unique.
The first version (ARC-AGI-1) is now mostly solved - OpenAI's o3 scored 87.5% in December 2024. So Chollet released a harder version, ARC-AGI-2, in 2025. [24] Every task was checked to make sure at least two humans could solve it in two tries or fewer. But at launch, no AI model scored above 5%. By early 2026, the best score is 54% (from a system called Poetiq, built on Gemini 3 Pro, costing $30 per task). Claude Opus 4.5 gets 37.6% at $2.20 per task. The ARC Prize 2025 drew 1,455 teams and over 15,000 entries. [25] The takeaway: just making models bigger isn't enough. Something fundamentally new is needed.
Humanity's Last Exam: the hardest test yet
Then there's Humanity's Last Exam (HLE), built by the Center for AI Safety and Scale AI. They asked about 1,000 experts from over 500 institutions across 50 countries to write the hardest questions they could - 2,500 questions across 100+ subjects. [26] The rule: if any current AI model could answer a question, it got thrown out. Only questions that stumped every model made it in.
When HLE launched in early 2025, GPT-4o scored 2.7% and o1 scored 8.0%. A year later, the scores have gone up but are still nowhere near human expert level (~90%): Claude Opus 4.6 reaches 53.1%, Gemini 3 Deep Think gets 48.4%, and GPT-5.4 hits 41.6%.
What Would AGI Actually Look Like?
Here's where things get political. Every major AI lab has its own definition of AGI, and the definition they choose conveniently shapes what they claim to have achieved.
The term was first used by Mark Gubrud in 1997 [33] and became widely known after a 2007 book by Ben Goertzel and Cassio Pennachin. [34] But to this day, there's no agreed definition. Each major AI lab has its own version:
Google DeepMind
- 1Emerging - equal to unskilled human
- 2Competent - 50th percentile adult
- 3Expert - 90th percentile
- 4Virtuoso - 99th percentile
- 5Superhuman - exceeds all humans
Crosses performance depth with breadth (narrow vs general)
Morris et al., 2024
OpenAI
- 1L1 Chatbots - conversational AI
- 2L2 Reasoners - PhD-level problem solving
- 3L3 Agents - act autonomously for days
- 4L4 Innovators - advance scientific research
- 5L5 Organizations - do the work of a company
Defined as 'outperforming humans at most economically valuable work'
OpenAI, July 2024
Anthropic
- 1Avoids the term 'AGI' entirely
- 2Describes 'powerful AI' instead
- 3Intelligence surpassing Nobel laureates
- 4Compressing 100 years of progress into 5-10
Focuses on safety thresholds (ASL framework) rather than capability milestones
Amodei, October 2024
Google DeepMind's approach [35] is the most structured. They created a grid with two axes: how good a system is (from “Emerging” to “Superhuman”) and how broad it is (narrow vs. general). By this measure, today's top models - GPT-4, Gemini, Claude - count as “Emerging AGI” (Level 1 General). That means they're roughly as broad as an unskilled human, but can be expert-level on specific tasks.
DeepMind's 10 faculties of intelligence
In March 2026, DeepMind published what I think is the most important paper in this space. [36] Instead of asking “can AI pass this test?” they asked a better question: “what are the building blocks of intelligence, and how do we measure each one?”
They identified 10 core cognitive faculties, grounded in decades of cognitive science research. Think of them as the fundamental abilities that make up general intelligence. Some of them we can already test in AI. Others we barely know how to evaluate.
The pattern here is revealing. The abilities we can test - perception, generation, reasoning, problem solving, memory - are the ones current AI is already good at. They're the abilities that show up in benchmarks like MMLU, GPQA, and coding tests.
The abilities we can't properly test are a different story. Metacognition is knowing what you know and what you don't - something AI is famously bad at (it “hallucinates” confidently). Learning means improving from new experience in real time, not just during training. Executive functions means managing a complex, multi-step project over days without losing track. Social cognition means understanding what another person is thinking and feeling from context. These are hard to test because they're hard to put into a multiple-choice format. But they're exactly the abilities that separate passing a test from functioning in the real world.
The paper's framework is also mechanism-agnostic - it doesn't care how a system achieves these abilities, only whether it does. And it requires comparing AI performance against human baselines from a representative adult population, so you get an honest picture of where the system falls on the human distribution.
OpenAI defines AGI as systems that can outperform humans at most economically valuable work. Anthropic's Dario Amodei doesn't use the term at all. He talks about “powerful AI” instead - something smarter than Nobel Prize winners in most fields, capable of compressing a century of scientific progress into 5-10 years. [63]
This disagreement isn't academic. Government regulation, corporate strategy, and trillion-dollar investments all depend on how you define AGI. If it means “passes benchmarks,” we're almost there. If it means “can do everything a human can,” we're not close.
Where AI Actually Stands Today
The picture is clear: AI crushes structured academic tests but struggles with messy real-world tasks. Researchers call this the jagged frontier. [37] The system can be expert-level at one thing and completely fail at something that seems simpler. The edges of what it can and can't do are unpredictable.
A study at Boston Consulting Group tested this with 758 real consultants. [59] When the task was something AI is good at, people using GPT-4 did more work, faster, and at higher quality. But when the task was outside AI's comfort zone, people using AI actually did 19 percentage points worse than those working without it. They trusted the AI's answer when they shouldn't have.
The numbers get more stark at scale. In blind tests across 44 occupations, AI matched or beat human experts on about half the professional tasks - and did it 50-300x faster. [49] Sounds impressive. But when researchers tested AI agents on 240 actual remote work projects, the best one could only automate 2.5% of them. [48] Doing well on a test and doing well on the job are very different things.
When do researchers think we'll get AGI?
A 2024 survey asked 2,778 AI researchers this question. [39] The median answer: 2047 - which was 13 years earlier than the same survey had found two years before. There's a 10% chance it happens by 2027, according to the respondents. Prediction markets on Metaculus put the first general AI announcement at March 2028, with Alphabet as the most likely lab (35.9%), followed by OpenAI (20.6%) and Anthropic (19.1%). [40]
What AI Has Already Done
Whatever you think about AGI timelines, AI is already making contributions that would have been unimaginable a decade ago.
AlphaFold figured out how proteins fold - a problem scientists had been stuck on for 50 years. It earned Demis Hassabis and John Jumper the 2024 Nobel Prize in Chemistry. [42] [43] The database now has predicted structures for 214 million proteins across over 1 million species. GNoME predicted 2.2 million new crystal structures, of which 381,000 turned out to be stable materials that could actually be made. [44] FunSearch was the first time an LLM made a verifiable new discovery in mathematics - beating a 20-year-old record on the cap set problem. [45] And AlphaGeometry 2 solved 42 out of 50 International Math Olympiad geometry problems - gold-medal level. [46]
But the economic impact is still unclear
McKinsey estimates generative AI could add $2.6-4.4 trillion a year to the global economy. [47] But Goldman Sachs' chief economist said in 2025 that AI's actual impact on GDP so far has been “basically zero.” There's a big gap between what people project and what's actually happened.
Where the numbers are more concrete is individual productivity. A study of customer service agents showed AI boosted output by 14% on average - with newer employees improving by 34%. [58] Developers using GitHub Copilot finished tasks 55.8% faster. [60] But here's the catch: a review of 106 studies found that on average, humans and AI working together performed worse than the best of either one working alone. [57] Just adding AI to a workflow doesn't automatically make it better.
The Risks That Come With This
We might stop thinking as hard
Research shows that the more people use AI, the less they exercise critical thinking. [61] This makes intuitive sense. If you always let a tool do the thinking, that skill gets weaker over time. The worry isn't that AI makes us dumber overnight - it's that we gradually stop practicing the mental skills that matter most.
AI is getting very good at persuasion
AI is already as persuasive as humans on average. But when GPT-4 was given personal information about the person it was debating, it was 81.7% more likely to change their mind than a human debater was. [67] Personalized AI persuasion is more effective than anything we've tested with humans.
AI systems talking to each other
AI systems are increasingly being connected to other AI systems through tools like AutoGen, MCP, and A2A. This creates new kinds of problems. In market simulations, AI agents figured out how to fix prices together - without anyone telling them to. [72] One study found that a single bad input could spread through a network and compromise up to a million AI agents in a chain reaction. [71] Our current safety tools were built for single AI systems. They're not designed for networks of AIs working together.
The bigger picture on risk
Nick Bostrom's Superintelligence (2014) raised two ideas that are still central to the debate. [50] First: a super-intelligent system could have any goal, including ones we wouldn't want (the “orthogonality thesis”). Second: no matter what an AI ultimately wants, it will probably try to preserve itself and gather resources along the way (“instrumental convergence”). Stuart Russell's Human Compatible (2019) proposed a fix: build AI that wants to be switched off if humans decide to. [51]
On the governance side, the EU AI Act took effect in August 2024 with penalties up to €35 million. [55] 28 countries plus the EU signed the Bletchley Declaration on AI safety. [56] Anthropic introduced its Responsible Scaling Policy with safety levels that gate what models are allowed to do. [53] [54]
The Bottom Line
After going through all of this, three things stand out:
We can't properly measure what we're building. Half of the cognitive abilities that make up intelligence don't have proper AI evaluations yet. The benchmarks we do have saturate in a few years, suffer from contamination, and mostly test memorized knowledge rather than real reasoning. We're building systems faster than we can evaluate them.
AI is brilliant at some things and terrible at others. And the boundary between the two isn't intuitive. It can ace a PhD-level science exam and then fail at a task any human could do. This creates real danger when people assume it's good at everything because it's good at the thing they tested.
The next challenge is AI systems working together. As AI agents start calling other AI agents, the thing we need to understand isn't just one model - it's the behavior of the whole network. That's a different kind of problem, and we don't have the tools for it yet.
The question isn't really “when will we get AGI?” It's “do we even know what we mean by that, and would we recognize it if it arrived?” Right now, the honest answer to both is: not really.
References
Intelligence theory and psychometrics
- Spearman C (1904). “General intelligence, objectively determined and measured.” American Journal of Psychology, 15(2):201-293.
- Neisser U, Boodoo G, Bouchard TJ, et al. (1996). “Intelligence: Knowns and unknowns.” American Psychologist, 51(2):77-101.
- Gottfredson LS (1997). “Mainstream science on intelligence.” Intelligence, 24(1):13-23.
- Carroll JB (1993). Human Cognitive Abilities. Cambridge University Press.
- Cattell RB (1963). “Theory of fluid and crystallized intelligence.” Journal of Educational Psychology, 54(1):1-22.
- McGrew KS (2009). “CHC theory and the human cognitive abilities project.” Intelligence, 37(1):1-10.
- Gardner H (1983). Frames of Mind: The Theory of Multiple Intelligences. Basic Books.
- Waterhouse L (2023). “Why multiple intelligences theory is a neuromyth.” Frontiers in Psychology, 14:1217288.
- Visser BA, Ashton MC, Vernon PA (2006). “g and the measurement of Multiple Intelligences.” Intelligence, 34(5):507-510.
- Sternberg RJ, Detterman DK, eds. (1986). What Is Intelligence? Ablex.
Neuroscience of intelligence
- Haworth CMA, et al. (2010). “The heritability of general cognitive ability increases linearly from childhood to young adulthood.” Molecular Psychiatry, 15:1112-1120.
- Jung RE, Haier RJ (2007). “The Parieto-Frontal Integration Theory (P-FIT) of intelligence.” Behavioral and Brain Sciences, 30(2):135-154.
Formal and mathematical intelligence
- Legg S, Hutter M (2007). “Universal intelligence: A definition of machine intelligence.” Minds and Machines, 17(4):391-444. arXiv
- Hutter M (2005). Universal Artificial Intelligence. Springer.
- Kolmogorov AN (1965). “Three approaches to the quantitative definition of information.” Problems of Information Transmission, 1(1):1-7.
- Solomonoff RJ (1964). “A formal theory of inductive inference.” Information and Control, 7(1):1-22.
- Veness J, Ng KS, Hutter M, Uther W, Silver D (2011). “A Monte-Carlo AIXI approximation.” JAIR, 40(1):95-142.
AI benchmarks and evaluation
- Chollet F (2019). “On the measure of intelligence.” arXiv:1911.01547
- Chollet F, Knoop M, Kamradt G, et al. (2025). “ARC-AGI-2: A new challenge for frontier AI reasoning systems.” arXiv:2505.11831
- ARC Prize 2025 Technical Report (2026). arXiv:2601.10904
- Phan L, Gatti A, Han Z, et al. (2025). “Humanity's Last Exam.” Nature, 649:1139-1146. arXiv
- Hendrycks D, et al. (2021). “Measuring massive multitask language understanding.” ICLR 2021. arXiv
- Deng X, et al. (2024). “Investigating data contamination in modern benchmarks for large language models.” NAACL 2024. arXiv
AGI definitions and frameworks
- Gubrud M (1997). “Nanotechnology and international security.” Fifth Foresight Conference.
- Goertzel B, Pennachin C, eds. (2007). Artificial General Intelligence. Springer.
- Morris MR, Sohl-Dickstein J, Fiedel N, et al. (2024). “Levels of AGI for operationalizing progress on the path to AGI.” ICML, PMLR 235:36308-36321. arXiv
- Burnell R, Yamamori Y, Firat O, et al. (2026). “Measuring progress toward AGI: A cognitive framework.” Google DeepMind.
- Morris MR, et al. (2026). “Characterizing model jaggedness supports safety and usability.” Google DeepMind.
AI timelines and expert forecasts
- Grace K, Salvatier J, Dafoe A, et al. (2024). “Thousands of AI authors on the future of AI.” arXiv:2401.02843
- Metaculus (2025). “When will the first general AI system be devised?” Metaculus
AI in science
- Jumper J, et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature, 596:583-589.
- Nobel Prize in Chemistry 2024. Nobel Prize Organization.
- Merchant A, et al. (2023). “Scaling deep learning for materials discovery.” Nature, 624:80-85.
- Romera-Paredes B, et al. (2024). “Mathematical discoveries from program search with large language models (FunSearch).” Nature, 625:468-475.
- Trinh TH, et al. (2024). “Solving olympiad geometry without human demonstrations (AlphaGeometry).” Nature, 625:476-482.
Economic and labor impact
- McKinsey Global Institute (2023). The Economic Potential of Generative AI.
- Mazeika M, et al. (2025). “Remote labor index: Measuring AI automation of remote work.” arXiv
- Patwardhan T, et al. (2025). “GDPVal: Evaluating AI model performance on real-world economically valuable tasks.” arXiv
Safety, risks, and governance
- Bostrom N (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
- Russell S (2019). Human Compatible: AI and the Problem of Control. Viking.
- Anthropic (2023). Anthropic's Responsible Scaling Policy.
- Anthropic (2025). Responsible Scaling Policy Version 3.0.
- EU AI Act (2024). Regulation (EU) 2024/1689.
- Bletchley Declaration (2023). AI Safety Summit.
Human-AI interaction
- Vaccaro M, Almaatouq A, Malone T (2024). “When combinations of humans and AI are useful.” Nature Human Behaviour, 8(12):2293-2303.
- Brynjolfsson E, Li D, Raymond L (2023). “Generative AI at work.” Quarterly Journal of Economics, 140(2):889-942.
- Dell'Acqua F, et al. (2023). “Navigating the jagged technological frontier.” Harvard Business School Working Paper 24-013.
- Peng S, et al. (2023). “The impact of AI on developer productivity: Evidence from GitHub Copilot.” arXiv
- Gerlich M (2025). “AI tools in society: Impacts on cognitive offloading and the future of critical thinking.” Societies, 15(1):6.
AI social cognition and persuasion
- Amodei D (2024). “Machines of Loving Grace.” Essay
- Salvi F, et al. (2025). “On the conversational persuasiveness of large language models.” Nature Human Behaviour, 9:1645-1653.
Multi-agent AI systems
- Hammond L, et al. (2025). “Multi-agent risks from advanced AI.” arXiv
- Lin J, Lim T, Montagu A (2024). “Collusive AI agents in market settings.” Working Paper.


