Measuring Machine Intelligence

Visakh UnniMar 20, 202628 min read

A person standing between mirrored cityscapes - a visual metaphor for intelligence reflected across human and machine domains

I kept hearing people say we're close to AGI - or that we've already achieved it, or that it's impossible. I realized I didn't actually know what intelligence means in a formal sense, how we measure it in humans, or what the benchmarks for AI really test. So I went looking. What I found was a field where trillion-dollar investment decisions depend on definitions that nobody agrees on and benchmarks that keep breaking. Here's what I learned.

What Is Intelligence, Anyway?

This sounds like it should have a straightforward answer. It doesn't. When two dozen prominent researchers were asked to define intelligence in 1986, they produced two dozen different definitions. [10] The American Psychological Association tried in 1996 and deliberately avoided giving a single definition, instead calling intelligence a “complex set of phenomena.” [2]

The most widely cited working definition comes from a 1997 statement signed by 52 researchers: [3]

Intelligence is a very general mental capability that involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.

That's helpful, but notice how broad it is. It's basically saying intelligence is being good at thinking - which doesn't tell you much about how to measure it.

Spearman's g: the one number that keeps showing up

In 1904, Charles Spearman noticed something odd. When you give people a bunch of unrelated cognitive tests - vocabulary, spatial reasoning, arithmetic, pattern recognition - their scores are always positively correlated. People who do well on one tend to do well on others. You'd expect vocabulary skill to be independent of spatial reasoning, but they aren't - there's always a positive relationship. This is called the positive manifold, and it's been called the most replicated finding in all of psychology. [1]

Spearman used a technique he invented called factor analysis to extract a single underlying factor from these correlations. He called it g (general intelligence). Mathematically, the idea is that each test score is a combination of g plus some test-specific ability:

X = Λf + ε

X is a vector of your scores across multiple cognitive tests - say, vocabulary, spatial reasoning, arithmetic, and pattern recognition. Spearman's insight was that these scores aren't independent. There's a hidden factor f (general intelligence, or g) pulling them all in the same direction. Λ (lambda) is a matrix of “factor loadings” - it captures how strongly each test is connected to that hidden factor. Abstract reasoning might have a loading of 0.8 (tightly linked to g), while digit memorization might only be 0.3 (weakly linked). And ε is the residual - everything the model doesn't explain: luck, how you were feeling that day, skills specific to one test.

A concrete example: suppose someone has high g. Their abstract reasoning score will be high (strong loading), their vocabulary will be above average (moderate loading), and their digit span might be only slightly above average (weak loading). The same hidden factor produces different-sized effects on different tests - and that's exactly what Λ encodes. When researchers run this across large populations, g alone explains 40-50% of the total variation in scores across all the different tests. That's a lot for a single number. [1]

How g connects to specific abilities. Abstract reasoning loads heavily onto g (0.85) while digit memory barely does (0.38). Bigger spheres and thicker lines mean a stronger connection to general intelligence.

Today, the most accepted model is the Cattell-Horn-Carroll (CHC) model. [4] [5] [6] It combines decades of research from Raymond Cattell, John Horn, and John Carroll into a single framework. Think of it as a three-layer pyramid. At the top sits g - the general factor. In the middle are 16 broad abilities: fluid reasoning, crystallized knowledge, short-term memory, long-term retrieval, processing speed, visual-spatial processing, and others. At the bottom are 80+ narrow skills - things like spelling ability, spatial scanning, or number facility. Each narrow skill loads onto a broad ability, and each broad ability loads onto g. This hierarchy explains why someone good at math is also likely (but not guaranteed) to be good at verbal reasoning - both draw on the same general factor, even though they use different broad and narrow abilities.

The distinction that matters most for AI is between two of those broad abilities. Fluid intelligence (Gf) is your ability to solve problems you've never seen before - pure reasoning with no prior knowledge to lean on. If someone shows you a pattern you've never encountered and asks you to figure out the rule, that's fluid intelligence. Crystallized intelligence (Gc) is what you've accumulated - facts, vocabulary, learned procedures, expertise built up over years. Knowing that water boils at 100°C or that Python uses indentation for scope is crystallized knowledge. In humans, fluid intelligence peaks in your mid-20s and gradually declines, while crystallized intelligence keeps growing well into old age - which is why experienced professionals often outperform younger ones despite slower raw processing. Here's the thing: most AI benchmarks test crystallized knowledge (what the model has seen in training data). Very few test fluid reasoning (can it figure out something genuinely new?).

AI has surpassed humans on knowledge-based tests (right) but struggles with novel reasoning (left) - exactly the split Cattell predicted in 1963.

There's another well-known theory worth mentioning: Howard Gardner's multiple intelligences. He proposed in 1983 that there are eight independent types of intelligence - linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, intrapersonal, and naturalistic. [7] The idea became hugely popular in education. Schools designed entire curricula around it.

The problem is that when researchers tested whether these intelligences are actually independent, they found they're not. Visser et al. (2006) showed they all correlate with g - meaning they're not separate abilities, just different expressions of the same underlying factor. [9] Waterhouse (2023) went further, calling the whole theory a “neuromyth.” [8] Gardner himself conceded there was “little hard evidence” supporting it.

Some numbers worth knowing

If you test a large number of people, IQ scores form a normal distribution. The average is 100, and most people (about 95%) fall between 70 and 130. The standard deviation is 15, meaning a score of 115 puts you one step above average, and 130 puts you two steps above.

The bell curve of IQ. Most people cluster around the middle. About 68% score within one standard deviation of the mean (85-115), and 95% fall between 70 and 130.

One thing that surprised me: how much of intelligence is genetic, and how that changes with age. In childhood, genes explain about 41% of the differences in intelligence between people. By adulthood, that number jumps to 75-80%. [11] Environment matters a lot when you're young, but its influence fades as you grow older.

Neuroscience adds another piece. Intelligence isn't located in one part of the brain. It depends on how well the frontal lobes (planning, decision-making) and parietal lobes (spatial processing, integration) talk to each other. It's about the wiring between regions, not any single “intelligence center.” [12]

The Mathematical View: Compression Is Intelligence

There's also a mathematical way to think about intelligence. The core idea is simple: intelligence is compression. If you can take something complex and describe it in a shorter form, you've understood its pattern.

This was formalized through something called Kolmogorov complexity, developed independently by Andrey Kolmogorov, Ray Solomonoff, and Gregory Chaitin in the 1960s. [15] It measures how complex a piece of data is by asking: what's the shortest computer program that can produce it? For example, the string “010101010101” is simple - you can generate it with a tiny program (“repeat 01 six times”). The first million digits of pi are also simple in this sense - there are short formulas that generate them. But a truly random string can't be compressed at all - you'd need a program just as long as the string itself. The key insight: if you can find a short description of something, you've understood its underlying structure.

K(x) = min { |p| : U(p) = x }

Reading this formula: K(x) is the Kolmogorov complexity of some data x. p is a program (a sequence of instructions). |p| is the length of that program in bits. U is a universal Turing machine - a theoretical computer that can simulate any other computer. U(p) = x means “running program p on U produces x.” So the whole formula says: look at every possible program that outputs x, and pick the shortest one. Its length is K(x).

For example, K(“010101...01”) is small - maybe a few bytes for “repeat 01 n times.” But K(random noise) is roughly equal to the length of the noise itself, because there's no shorter way to describe it. The catch: you can never be 100% sure you've found the shortest possible program. It's mathematically proven to be incomputable - no algorithm can calculate K(x) for all inputs. You can approximate it, but never know it exactly. [15]

Why does this matter for intelligence? Because compression and prediction are two sides of the same coin. If you can compress data, you've found its underlying pattern. If you've found the pattern, you can predict what comes next. Consider weather: if you notice “it rains every third day,” you've compressed the weather data into a simple rule - and now you can predict tomorrow. A system that compresses better understands more deeply and predicts more accurately. Ray Solomonoff formalized this in 1964 with his theory of inductive inference. [16] His key result: the best possible predictor is one that considers all programs that could have generated the data you've seen, and weights each one by its simplicity (shorter programs get exponentially higher weight). When multiple explanations fit the data, bet on the simpler one. This is Occam's razor expressed as math - and it's provably optimal in a formal sense.

Putting a number on intelligence

In 2007, Shane Legg and Marcus Hutter looked at 71 different definitions of intelligence and boiled them all down to one idea: intelligence is how good you are at achieving goals across many different situations. [13] Then they wrote a formula for it:

Υ(π) = Σ_μ∈E 2^-K(μ) V_μ^π

Breaking this down piece by piece: Υ(π) is the intelligence score of agent π. μ is an environment - a specific problem or world the agent could be placed in. V_μ^π is how much reward (how well) the agent π achieves in environment μ - think of it as its score on that particular challenge. 2^-K(μ) is the weight given to that environment, based on its Kolmogorov complexity. A simple environment (small K) gets a high weight; a complex one gets exponentially less. An environment with K=3 gets weight 2^-3 = 1/8; one with K=20 gets weight 2^-20 ≈ 0.000001.

The sum adds up these weighted scores across every possible environment. So the formula says: an intelligent agent is one that performs well across many environments, with extra credit for doing well in structured, simple environments (the kinds of problems that show up most in reality). An agent that only solves bizarre edge cases but fails at everyday tasks would score low. Weighting by simplicity is Occam's razor built into the formula. [13]

What makes this interesting is what it doesn't care about. It doesn't matter if the agent is a human, an AI, or something we haven't invented yet. It doesn't favor any particular task. It's a universal yardstick for intelligence. The catch: you can't actually compute it. It requires testing across infinitely many environments. So it exists as a theoretical ideal, not a practical tool.

Marcus Hutter extended this with AIXI [14] - a theoretical design for the most intelligent possible agent. AIXI combines Solomonoff's prediction theory with sequential decision-making. At each step, it considers every possible model of the world that's consistent with what it has observed so far. It weights each model by simplicity (again using Kolmogorov complexity), then picks the action that maximizes expected future reward across all those models. It's been proven optimal - no other agent can do better over the long run. It's also been proven impossible to actually build, because it requires searching over infinitely many programs at every single step. Researchers have built practical approximations (like MC-AIXI-CTW [17], which uses Monte Carlo sampling and a context-tree weighting algorithm to approximate the infinite search), but they only work in very simple environments like small grid worlds.

We have a mathematically precise definition of perfect intelligence. We just can't compute it. This is the fundamental tension in the field: we know what the destination looks like but have no map.

How We Test AI Today - and Where It Breaks

If we can't compute universal intelligence, we do the next best thing: we give AI tests. Lots of tests. The problem is that AI keeps passing them faster than we can make new ones.

The saturation problem

MMLU (Massive Multitask Language Understanding) was supposed to be a broad test of knowledge across 57 subjects. When it launched in 2021, the best models scored around 43%. By 2025, frontier models were clustered at 88-93% - essentially at or above human expert level. The benchmark was excluded from the Vellum AI leaderboard as “outdated.” [27] Score variation of up to 10% depends on nothing more than how you format the prompt.

MMLUGPQA DiamondAIMEHLEARC-AGI-2

Best reported scores from confirmed results. MMLU went from 43% to 93% in four years. HLE and ARC-AGI-2 launched in 2025 and remain far from solved.

This pattern repeats. A benchmark launches, the community rallies around it, models improve rapidly, and within a few years it's saturated and no longer useful for distinguishing between systems. GPQA Diamond (PhD-level science) went from 36% to 94% in three years. AIME math problems went from 12% to near-perfect scores.

The contamination problem

There's also a trust problem with benchmarks. When researchers asked GPT-4 to guess missing answer options from MMLU questions, it got 57% of them exactly right. [29] That's a strong sign the model had seen these test questions during training. It wasn't reasoning through them - it was remembering them.

This is called data contamination, and it's a big deal. On GSM8K (a math benchmark), accuracy drops by 13% when you remove questions the model was likely trained on. It gets worse: some models are trained on translated versions of English benchmarks, which inflates their English scores without anyone noticing.

Why this matters

The field is fighting this with private test sets (HLE, ARC-AGI-2), constantly refreshed benchmarks (LiveBench), and tools that detect contamination after the fact. But it's an ongoing arms race.

ARC-AGI: testing what benchmarks miss

François Chollet made an important argument in 2019: being good at a task doesn't mean you're intelligent. [23] A chess engine that's been trained on millions of games is skilled, but is it intelligent? Chollet argued no - skill is just the output. What actually shows intelligence is how fast you can pick up something completely new from just a few examples. He called this skill-acquisition efficiency: how much new capability you gain per unit of experience. A child who learns to play a new card game after watching two rounds is demonstrating high skill-acquisition efficiency. A system that needs a million examples to learn the same game is not - no matter how well it plays afterwards.

To test this, he created the Abstraction and Reasoning Corpus (ARC) - a set of grid puzzles where you see 2-3 examples of a pattern and have to figure out the rule. They test basic things like recognizing objects, counting, and understanding how shapes relate to each other. No memorization helps here - every puzzle is unique.

The first version (ARC-AGI-1) is now mostly solved - OpenAI's o3 scored 87.5% in December 2024. So Chollet released a harder version, ARC-AGI-2, in 2025. [24] Every task was checked to make sure at least two humans could solve it in two tries or fewer. But at launch, no AI model scored above 5%. By early 2026, the best score is 54% (from a system called Poetiq, built on Gemini 3 Pro, costing $30 per task). Claude Opus 4.5 gets 37.6% at $2.20 per task. The ARC Prize 2025 drew 1,455 teams and over 15,000 entries. [25] The takeaway: just making models bigger isn't enough. Something fundamentally new is needed.

Humanity's Last Exam: the hardest test yet

Then there's Humanity's Last Exam (HLE), built by the Center for AI Safety and Scale AI. They asked about 1,000 experts from over 500 institutions across 50 countries to write the hardest questions they could - 2,500 questions across 100+ subjects. [26] The rule: if any current AI model could answer a question, it got thrown out. Only questions that stumped every model made it in.

When HLE launched in early 2025, GPT-4o scored 2.7% and o1 scored 8.0%. A year later, the scores have gone up but are still nowhere near human expert level (~90%): Claude Opus 4.6 reaches 53.1%, Gemini 3 Deep Think gets 48.4%, and GPT-5.4 hits 41.6%.

What Would AGI Actually Look Like?

Here's where things get political. Every major AI lab has its own definition of AGI, and the definition they choose conveniently shapes what they claim to have achieved.

The term was first used by Mark Gubrud in 1997 [33] and became widely known after a 2007 book by Ben Goertzel and Cassio Pennachin. [34] Here's how the major labs define it:

Google DeepMind

1Emerging - equal to unskilled human
2Competent - 50th percentile adult
3Expert - 90th percentile
4Virtuoso - 99th percentile
5Superhuman - exceeds all humans

Crosses performance depth with breadth (narrow vs general)

Morris et al., 2024

OpenAI

1L1 Chatbots - conversational AI
2L2 Reasoners - PhD-level problem solving
3L3 Agents - act autonomously for days
4L4 Innovators - advance scientific research
5L5 Organizations - do the work of a company

Defined as 'outperforming humans at most economically valuable work'

OpenAI, July 2024

Anthropic

1Avoids the term 'AGI' entirely
2Describes 'powerful AI' instead
3Intelligence surpassing Nobel laureates
4Compressing 100 years of progress into 5-10

Focuses on safety thresholds (ASL framework) rather than capability milestones

Amodei, October 2024

Three labs, three different ways of thinking about AGI. The lack of consensus matters - it shapes what gets built and how we regulate it.

Google DeepMind's approach [35] is the most structured. They created a grid with two axes: how good a system is (from “Emerging” to “Superhuman”) and how broad it is (narrow vs. general). By this measure, today's top models - GPT-4, Gemini, Claude - count as “Emerging AGI” (Level 1 General). That means they're roughly as broad as an unskilled human, but can be expert-level on specific tasks.

DeepMind's 10 faculties of intelligence

In March 2026, DeepMind published what I think is the most important paper in this space. [36] Instead of asking “can AI pass this test?” they asked a better question: “what are the building blocks of intelligence, and how do we measure each one?”

They identified 10 core cognitive faculties, grounded in decades of cognitive science research. Think of them as the fundamental abilities that make up general intelligence. Some of them we can already test in AI. Others we barely know how to evaluate.

We can test (5)Evaluation gap (5)

DeepMind's 10 cognitive faculties (March 2026). Half of what makes up intelligence has no proper AI evaluation yet.

The pattern here is revealing. The abilities we can test - perception, generation, reasoning, problem solving, memory - are the ones current AI is already good at. They're the abilities that show up in benchmarks like MMLU, GPQA, and coding tests.

The abilities we can't properly test are a different story. Metacognition is knowing what you know and what you don't - something AI is famously bad at (it “hallucinates” confidently). Learning means improving from new experience in real time, not just during training. Executive functions means managing a complex, multi-step project over days without losing track. Social cognition means understanding what another person is thinking and feeling from context. These are hard to test because they're hard to put into a multiple-choice format - and they're exactly the abilities that matter most.

The paper's framework is also mechanism-agnostic - it doesn't care how a system achieves these abilities, only whether it does. And it requires comparing AI performance against human baselines from a representative adult population, so you get an honest picture of where the system falls on the human distribution.

Where different systems fall on the intelligence spectrum. Current LLMs are broad but shallow - they sit in narrow AI territory despite being impressively general in some tasks.

OpenAI defines AGI as systems that can outperform humans at most economically valuable work. Anthropic's Dario Amodei doesn't use the term at all. He talks about “powerful AI” instead - something smarter than Nobel Prize winners in most fields, capable of compressing a century of scientific progress into 5-10 years. [63]

This disagreement isn't academic. Government regulation, corporate strategy, and trillion-dollar investments all depend on how you define AGI. If it means “passes benchmarks,” we're almost there. If it means “can do everything a human can,” we're not close.

Where AI Actually Stands Today

Solid circles = best AI score. Dashed rings = human expert level. When they overlap, AI has caught up. When they're far apart - like on real-world work (2.5% vs 95%) - the gap is enormous.

The picture is clear: AI crushes structured academic tests but struggles with messy real-world tasks. Researchers call this the jagged frontier. [37] If you plot AI capabilities across different tasks, you don't get a smooth line - you get a jagged, spiky shape. The system can write a legal brief better than most lawyers, then fail to count the number of “r”s in “strawberry.” It can solve graduate-level physics but get confused by a simple word problem a ten-year-old could handle. The edges of what it can and can't do are unpredictable, which makes it dangerous to assume competence in one area transfers to another.

The “jagged frontier”: AI dominates structured academic tasks but automates only 2.5% of real-world remote work. Expert at some things, helpless at others.

A study at Boston Consulting Group tested this with 758 real consultants. [59] When the task was something AI is good at, people using GPT-4 did more work, faster, and at higher quality. But when the task was outside AI's comfort zone, people using AI actually did 19 percentage points worse than those working without it. They trusted the AI's answer when they shouldn't have.

The numbers get more stark at scale. In blind tests across 44 occupations, AI matched or beat human experts on about half the professional tasks - and did it 50-300x faster. [49] Sounds impressive. But when researchers tested AI agents on 240 actual remote work projects, the best one could only automate 2.5% of them. [48] Doing well on a test and doing well on the job are very different things.

When do researchers think we'll get AGI?

A 2024 survey asked 2,778 AI researchers this question. [39] The median answer: 2047 - which was 13 years earlier than the same survey had found two years before. There's a 10% chance it happens by 2027, according to the respondents. Prediction markets on Metaculus put the first general AI announcement at March 2028, with Alphabet as the most likely lab (35.9%), followed by OpenAI (20.6%) and Anthropic (19.1%). [40]

What AI Has Already Done

Whatever you think about AGI timelines, AI is already making contributions that would have been unimaginable a decade ago.

AI is already producing real scientific results. Circle size reflects the scale of each breakthrough. AlphaFold alone predicted structures for 214 million proteins and earned a Nobel Prize.

AlphaFold figured out how proteins fold - a problem scientists had been stuck on for 50 years. It earned Demis Hassabis and John Jumper the 2024 Nobel Prize in Chemistry. [42] [43] The database now has predicted structures for 214 million proteins across over 1 million species. GNoME predicted 2.2 million new crystal structures, of which 381,000 turned out to be stable materials that could actually be made. [44] FunSearch was the first time an LLM made a verifiable new discovery in mathematics - beating a 20-year-old record on the cap set problem. [45] And AlphaGeometry 2 solved 42 out of 50 International Math Olympiad geometry problems - gold-medal level. [46]

But the economic impact is still unclear

McKinsey estimates generative AI could add $2.6-4.4 trillion a year to the global economy. [47] But Goldman Sachs' chief economist said in 2025 that AI's actual impact on GDP so far has been “basically zero.” There's a big gap between what people project and what's actually happened.

Where the numbers are more concrete is individual productivity. A study of customer service agents showed AI boosted output by 14% on average - with newer employees improving by 34%. [58] Developers using GitHub Copilot finished tasks 55.8% faster. [60] But here's the catch: a review of 106 studies found that on average, humans and AI working together performed worse than the best of either one working alone. [57] Just adding AI to a workflow doesn't automatically make it better.

The Risks That Come With This

We might stop thinking as hard

Research shows that the more people use AI, the less they exercise critical thinking. [61] This makes intuitive sense. If you always let a tool do the thinking, that skill gets weaker over time. The worry isn't that AI makes us dumber overnight - it's that we gradually stop practicing the mental skills that matter most.

AI is getting very good at persuasion

AI is already as persuasive as humans on average. But when GPT-4 was given personal information about the person it was debating, it was 81.7% more likely to change their mind than a human debater was. [67] Personalized AI persuasion is more effective than anything we've tested with humans.

AI systems talking to each other

AI systems are increasingly being connected to other AI systems through tools like AutoGen, MCP, and A2A. This creates new kinds of problems. In market simulations, AI agents figured out how to fix prices together - without anyone telling them to. [72] One study found that a single bad input could spread through a network and compromise up to a million AI agents in a chain reaction. [71] Our current safety tools were built for single AI systems. They're not designed for networks of AIs working together.

The bigger picture on risk

Nick Bostrom's Superintelligence (2014) raised two ideas that are still central to the debate. [50] First: a super-intelligent system could have any goal, including ones we wouldn't want (the “orthogonality thesis”). Second: no matter what an AI ultimately wants, it will probably try to preserve itself and gather resources along the way (“instrumental convergence”). Stuart Russell's Human Compatible (2019) proposed a fix: build AI that wants to be switched off if humans decide to. [51]

On the governance side, the EU AI Act took effect in August 2024 with penalties up to €35 million. [55] 28 countries plus the EU signed the Bletchley Declaration on AI safety. [56] Anthropic introduced its Responsible Scaling Policy with safety levels that gate what models are allowed to do. [53] [54]

The Bottom Line

After going through all of this, three things stand out:

We can't properly measure what we're building. As the DeepMind framework shows, half of the cognitive abilities that make up intelligence don't have proper AI evaluations yet. We're building systems faster than we can evaluate them.

The jagged frontier is real. The boundary between what AI can and can't do isn't intuitive, and that creates real danger when people assume it's good at everything because it's good at the thing they tested.

The next challenge is AI systems working together. As AI agents start calling other AI agents, the thing we need to understand isn't just one model - it's the behavior of the whole network. That's a different kind of problem, and we don't have the tools for it yet.

The question isn't really “when will we get AGI?” It's “do we even know what we mean by that, and would we recognize it if it arrived?” Right now, the honest answer to both is: not really.

References

Intelligence theory and psychometrics

Spearman C (1904). “General intelligence, objectively determined and measured.” American Journal of Psychology, 15(2):201-293.
Neisser U, Boodoo G, Bouchard TJ, et al. (1996). “Intelligence: Knowns and unknowns.” American Psychologist, 51(2):77-101.
Gottfredson LS (1997). “Mainstream science on intelligence.” Intelligence, 24(1):13-23.
Carroll JB (1993). Human Cognitive Abilities. Cambridge University Press.
Cattell RB (1963). “Theory of fluid and crystallized intelligence.” Journal of Educational Psychology, 54(1):1-22.
McGrew KS (2009). “CHC theory and the human cognitive abilities project.” Intelligence, 37(1):1-10.
Gardner H (1983). Frames of Mind: The Theory of Multiple Intelligences. Basic Books.
Waterhouse L (2023). “Why multiple intelligences theory is a neuromyth.” Frontiers in Psychology, 14:1217288.
Visser BA, Ashton MC, Vernon PA (2006). “g and the measurement of Multiple Intelligences.” Intelligence, 34(5):507-510.
Sternberg RJ, Detterman DK, eds. (1986). What Is Intelligence? Ablex.

Neuroscience of intelligence

Haworth CMA, et al. (2010). “The heritability of general cognitive ability increases linearly from childhood to young adulthood.” Molecular Psychiatry, 15:1112-1120.
Jung RE, Haier RJ (2007). “The Parieto-Frontal Integration Theory (P-FIT) of intelligence.” Behavioral and Brain Sciences, 30(2):135-154.

Formal and mathematical intelligence

Legg S, Hutter M (2007). “Universal intelligence: A definition of machine intelligence.” Minds and Machines, 17(4):391-444. arXiv
Hutter M (2005). Universal Artificial Intelligence. Springer.
Kolmogorov AN (1965). “Three approaches to the quantitative definition of information.” Problems of Information Transmission, 1(1):1-7.
Solomonoff RJ (1964). “A formal theory of inductive inference.” Information and Control, 7(1):1-22.
Veness J, Ng KS, Hutter M, Uther W, Silver D (2011). “A Monte-Carlo AIXI approximation.” JAIR, 40(1):95-142.

AI benchmarks and evaluation

Chollet F (2019). “On the measure of intelligence.” arXiv:1911.01547
Chollet F, Knoop M, Kamradt G, et al. (2025). “ARC-AGI-2: A new challenge for frontier AI reasoning systems.” arXiv:2505.11831
ARC Prize 2025 Technical Report (2026). arXiv:2601.10904
Phan L, Gatti A, Han Z, et al. (2025). “Humanity's Last Exam.” Nature, 649:1139-1146. arXiv
Hendrycks D, et al. (2021). “Measuring massive multitask language understanding.” ICLR 2021. arXiv
Deng X, et al. (2024). “Investigating data contamination in modern benchmarks for large language models.” NAACL 2024. arXiv

AGI definitions and frameworks

Gubrud M (1997). “Nanotechnology and international security.” Fifth Foresight Conference.
Goertzel B, Pennachin C, eds. (2007). Artificial General Intelligence. Springer.
Morris MR, Sohl-Dickstein J, Fiedel N, et al. (2024). “Levels of AGI for operationalizing progress on the path to AGI.” ICML, PMLR 235:36308-36321. arXiv
Burnell R, Yamamori Y, Firat O, et al. (2026). “Measuring progress toward AGI: A cognitive framework.” Google DeepMind.
Morris MR, et al. (2026). “Characterizing model jaggedness supports safety and usability.” Google DeepMind.

AI timelines and expert forecasts

Grace K, Salvatier J, Dafoe A, et al. (2024). “Thousands of AI authors on the future of AI.” arXiv:2401.02843
Metaculus (2025). “When will the first general AI system be devised?” Metaculus

AI in science

Jumper J, et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature, 596:583-589.
Nobel Prize in Chemistry 2024. Nobel Prize Organization.
Merchant A, et al. (2023). “Scaling deep learning for materials discovery.” Nature, 624:80-85.
Romera-Paredes B, et al. (2024). “Mathematical discoveries from program search with large language models (FunSearch).” Nature, 625:468-475.
Trinh TH, et al. (2024). “Solving olympiad geometry without human demonstrations (AlphaGeometry).” Nature, 625:476-482.

Economic and labor impact

McKinsey Global Institute (2023). The Economic Potential of Generative AI.
Mazeika M, et al. (2025). “Remote labor index: Measuring AI automation of remote work.” arXiv
Patwardhan T, et al. (2025). “GDPVal: Evaluating AI model performance on real-world economically valuable tasks.” arXiv

Safety, risks, and governance

Bostrom N (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Russell S (2019). Human Compatible: AI and the Problem of Control. Viking.
Anthropic (2023). Anthropic's Responsible Scaling Policy.
Anthropic (2025). Responsible Scaling Policy Version 3.0.
EU AI Act (2024). Regulation (EU) 2024/1689.
Bletchley Declaration (2023). AI Safety Summit.

Human-AI interaction

Vaccaro M, Almaatouq A, Malone T (2024). “When combinations of humans and AI are useful.” Nature Human Behaviour, 8(12):2293-2303.
Brynjolfsson E, Li D, Raymond L (2023). “Generative AI at work.” Quarterly Journal of Economics, 140(2):889-942.
Dell'Acqua F, et al. (2023). “Navigating the jagged technological frontier.” Harvard Business School Working Paper 24-013.
Peng S, et al. (2023). “The impact of AI on developer productivity: Evidence from GitHub Copilot.” arXiv
Gerlich M (2025). “AI tools in society: Impacts on cognitive offloading and the future of critical thinking.” Societies, 15(1):6.

AI social cognition and persuasion

Amodei D (2024). “Machines of Loving Grace.” Essay
Salvi F, et al. (2025). “On the conversational persuasiveness of large language models.” Nature Human Behaviour, 9:1645-1653.

Multi-agent AI systems

Hammond L, et al. (2025). “Multi-agent risks from advanced AI.” arXiv
Lin J, Lim T, Montagu A (2024). “Collusive AI agents in market settings.” Working Paper.