Intelligence Is Compression, Not Experience

Why Intelligence Advances in Discontinuities


Introduction: Experience Is Not the Same as Understanding

In the previous essay, we argued that intelligence is best understood as learning rate under novelty. When familiar strategies fail, the intelligent system is the one that adapts fastest. Intelligence scales not by thinking harder in isolation, but by parallel exploration – many agents, many trials, many hypotheses tested at once.

That framing explains a great deal. It explains why evolution looks intelligent without thinking, why markets outperform individuals, and why modern AI systems feel powerful even when they do not learn in real time. It suggests a future in which intelligence emerges from swarms of smaller systems exploring in parallel, not from a single superintellect.

But there is a quiet assumption embedded in that story. It assumes that useful structure naturally emerges from experience – that if you explore enough, the signal will rise out of the noise on its own.

This essay challenges that assumption.

Experience alone does not produce intelligence. Unchecked experience produces noise, confusion, and brittle beliefs. The true bottleneck is not exploration but compression – not what a system sees, but what it keeps, and how it reorganizes what it has already seen.

This is not a new idea in the abstract. Information theorists, cognitive scientists, and a handful of AI researchers have long argued that compression is central to intelligence. What this essay attempts is a synthesis: that intelligence advances through discontinuous compression – sudden reorganizations that collapse many experiences into reusable abstractions – and that this mechanism, not processing power, is often the rate-limiting step in the transition from narrow competence to general intelligence.

A clarification before we proceed. Throughout this essay, “compression” does not mean arbitrary data reduction – shrinking a file or hashing an input. It means the formation of reusable structure that improves future prediction and learning. It is what happens when a child realizes that buttons, switches, and levers are all instances of a single principle. It is what happens when a scientist discovers that two seemingly unrelated phenomena obey the same equation. This distinction – between compression as data reduction and compression as structure formation – will matter throughout the argument.


I. Compiled Intelligence

It is tempting to equate intelligence with compute. More parameters, more FLOPs, deeper networks. But compute explains far less than it appears to.

A peregrine falcon striking a pigeon at 200 miles per hour is integrating wind speed, prey trajectory, and aerodynamics in real time. A border collie herding sheep is reading subtle cues and anticipating movements before they happen. These are intelligent behaviors, yet much of the competence on display has already been compiled – encoded in millions of years of evolution, compressed into instinct, refined by practice. What remains is largely execution.

I call this compiled intelligence: behaviors whose learning cost was incurred earlier, elsewhere, or over much longer timescales. The term borrows from computer science, where compilation translates high-level instructions into efficient machine code. Compiled intelligence is learning that has been translated into fast, automatic performance. The exploration happened in the past. What you see now is the optimized output.

Evolution is compiled intelligence at the species level. Expertise is compiled intelligence at the individual level. A chess grandmaster does not calculate every possible move from scratch – decades of study have been compressed into pattern recognition so efficient that the right move often “feels” obvious. Herbert Simon and William Chase demonstrated this in the early 1970s: grandmasters don’t have better memories in general. They have vastly better structured memories for chess positions. Show them random piece arrangements, and their recall is no better than novices’. The intelligence was in the organization of memory, not in raw storage capacity.

In both cases, the system appears intelligent precisely because it is not learning. The learning already happened. The compression is complete.

General intelligence reveals itself at the margins – when compiled solutions fail. When priors break, when habits are insufficient, when the environment presents something the system’s existing compressions do not cover. Compute increases how much a system can process. Capability describes what it can already do. Intelligence describes how quickly a system reorganizes itself when what it already knows is no longer enough.

Most of what we measure – in schools, in AI benchmarks, in job interviews – is compiled intelligence. We test what someone already knows. We reward fluency. We evaluate performance within existing representations. We will return to this measurement problem later.


II. Experience Is Mostly Noise

Biological systems experience vastly more data than they can possibly retain. The visual system alone is flooded with far more information per second than any organism could store, and most of it is irrelevant, redundant, or misleading. If intelligence were a function of exposure, organisms would grow more intelligent simply by remaining alive.

They do not. Intelligence depends on ruthless selectivity – on knowing what to ignore.

Consider hyperthymesia – highly superior autobiographical memory. People with this condition can recall virtually every day of their life in vivid detail. You might expect this to confer extraordinary intelligence. It does not. The condition is a useful reminder that remembering more is not the same as understanding more.

At the other end of the spectrum, infants are exposed to enormous quantities of sensory data from birth, yet they do not learn in proportion to exposure. What makes early learning effective is not volume but structure – the attention, repetition, and caregiver interactions that highlight what matters and suppress what does not. Developmental psychologists have documented this extensively: infants learn language not from passive exposure to speech, but from structured, interactive, attention-guided exchanges. The social scaffolding is itself a compression mechanism.

In AI systems, the failure to compress manifests differently depending on the architecture – as systematic bias in traditional ML models, as hallucination in large language models. The underlying pattern is the same: the system finds patterns everywhere, because patterns are everywhere in any sufficiently large dataset, but lacks a robust mechanism for deciding which patterns are structural and which are coincidental.

The learning rate framework from Essay 2 tells us intelligence is the speed of adaptation under novelty. This essay adds a constraint: that speed depends on the quality of compression. A system that accumulates experience without compressing it does not adapt faster. It becomes more confidently wrong.


III. Why Learning Does Not Improve Smoothly

If experience automatically produced intelligence, learning would be smooth. Each data point would yield a marginal improvement. Performance would rise predictably, like filling a bucket.

That is not how learning looks. It looks lumpy. Long plateaus are followed by sudden jumps – moments when something “clicks.” After that click, progress accelerates. The same experiences that previously yielded little insight suddenly become informative.

This pattern has been documented across radically different domains.

Piaget argued that children do not gradually acquire understanding but pass through qualitatively distinct stages, with transitions that can be surprisingly abrupt. A child who yesterday could not grasp that the amount of water stays the same when poured into a taller glass suddenly understands conservation and applies it across contexts. Whether or not Piaget’s specific stage model is accepted in full – modern developmental psychology has revised many of his claims – the broader insight remains influential: cognitive development involves qualitative reorganizations, not just additive growth.

The same pattern appears in skill acquisition. The “power law of practice” shows that performance improves rapidly at first, then plateaus, then sometimes jumps again. The power law itself is an empirical regularity, not a theory of mechanism. But one plausible explanation for the plateaus is representational reorganization – the underlying structure is restructuring even as surface performance stalls.

In 2022, Power et al. documented a striking version of this in neural networks: “grokking.” Networks trained on mathematical tasks would first memorize the training data, then – after continued training far past memorization – suddenly snap into genuine generalization. Thousands of training steps with no visible improvement, then an abrupt phase transition. At minimum, it demonstrates that generalization can emerge abruptly rather than monotonically, consistent with internal representational reorganization.

These observations point to a fundamental property of learning systems. For extended periods, a system optimizes locally within an existing representation – adjusting weights, refining patterns, accumulating small gains within its current framework. Then, occasionally, the representation itself reorganizes. Multiple experiences collapse into a single abstraction. The search space shrinks. Learning proceeds faster – not because the system has more data, but because it has a better way to organize it.

Gradients optimize within a representation. Jumps create new ones.

Intelligence, then, is not a curve. It is a staircase.

Not all learning is discontinuous. Much of skill refinement is genuinely incremental. But many of the most consequential advances – the transitions that change what a system can learn next – follow this staircase pattern.


IV. The Click: From Correlation to Structure

Anyone who has learned something genuinely difficult recognizes the moment it suddenly becomes easy.

In professional football, there is a universal phrase for this: the game “slows down.” Nearly every NFL quarterback describes the same transition during their first or second season. After his rookie year, Raiders quarterback Derek Carr explained it this way: before a full season, he understood the game from a film perspective – he could analyze what he saw on tape. Afterward, he had a “from where I stood” perspective. He knew what worked. He knew what he could and couldn’t get away with. The game hadn’t changed speed. His representation of it had.

This is compression. Before the transition, a young quarterback processes each defensive alignment as a novel situation – fourteen players in a configuration he must decode in real time against memorized rules. After the transition, many alignments have collapsed into a small number of meaningful categories, each with a known response. He no longer decodes the defense. He recognizes it – the way a fluent reader recognizes a word without sounding out each letter. What looks like faster thinking is actually less thinking – less search needed because the compressed structure is doing the work.

A similar moment occurs in early childhood. A child presses a button and a light turns on. At first, this is just one event among thousands – an unexplained correlation. Then something shifts. The child forms a template: pressable things cause effects. Immediately, behavior changes. She begins pressing everything – buttons, switches, her parent’s nose, the cat.

This is not confusion. It is abstraction in action. Many separate episodes have collapsed into a single reusable structure. The child is not reasoning by analogy. She has extracted a rule and is testing its scope.

This is why new abstractions initially overgeneralize. The child presses things that are not buttons. She calls every four-legged animal “dog.” Overgeneralization is not error – it is evidence that compression has occurred. The system is probing the boundaries of its new structure. Over time, the abstraction is refined. But the critical event is the creation of the abstraction itself.

The “aha” moment is not a feeling. It is a structural reorganization of memory.


V. What a Template Actually Is

We have been using the word “template” loosely. It deserves a precise definition, because the concept is doing real theoretical work.

A template is a compressed representation that organizes experience – a structure that explains many observations at once, guides attention toward relevant information, and accelerates future learning by constraining the space of possibilities.

Consider “birds fly.” This is not just a generalization. It is a predictive structure that shapes how you process new information. When you see an unfamiliar bird, you expect it to fly. You allocate attention accordingly – wings, weight, behavior near ledges. When the bird does not fly, that violation is informative and triggers refinement.

This is close to what cognitive scientists call a schema – a term Frederic Bartlett introduced in his 1932 work on memory. Bartlett showed that people do not remember stories verbatim. They remember them through schemas – pre-existing structures that shape what is retained, distorted, and forgotten. Memory is not a recording. It is reconstruction guided by compressed structure.

In machine learning, the closest analogue is a learned representation – the internal features a neural network develops during training. A network trained on faces does not store individual faces; it learns a compressed set of features that can reconstruct or distinguish any face it encounters. These features are, functionally, templates.

The key properties:

Compression. It explains many observations with a single structure.

Prediction. It generates expectations about unobserved situations – extending beyond the data that created it.

Attention guidance. It directs the system toward relevant information. Once you have the template “pressable things cause effects,” you notice buttons. Before that template, buttons were visual background.

Revisability. It can be refined or replaced when evidence demands it. Templates are working hypotheses encoded in the structure of memory.

Templates are not summaries of the past. They are compressed bets about the future.


VI. Intelligence as Discontinuous Compression

We can now state the core claim with precision.

Intelligence advances not through the smooth accumulation of experience, but through discontinuous compression – sudden reorganizations that collapse many experiences into reusable abstractions.

Within a template, learning is incremental. Between templates, it is discontinuous. The most important gains occur at these transitions – when many experiences collapse into a single structure that can be reused across contexts.

Essay 2 emphasized parallel exploration as the engine of intelligence at scale. This essay adds a constraint: without compression, exploration produces noise. Search without compression is noise. Compression without search is stagnation. The claim is not that compression explains everything – useful compression also depends on action, feedback, embodiment, and the search process itself. But compression is the bottleneck we most consistently underestimate.

Positioning Within Existing Frameworks

This claim has deep roots, and intellectual honesty requires acknowledging them.

Jürgen Schmidhuber developed a formal framework linking compression to curiosity and creativity. In his formulation, an intelligent agent seeks experiences that improve its ability to compress its history of observations – driven not by raw novelty, but by compressibility. His compression progress theory explains why we find certain things beautiful or interesting: they are situations where our internal compressor is making rapid gains.

Marcus Hutter took a more theoretical approach with AIXI, defining the optimal intelligent agent through Solomonoff induction – a formalization of Occam’s razor grounded in Kolmogorov complexity theory. Very roughly, the framework favors explanations that capture observations with maximal simplicity and predictive power.

The Minimum Description Length (MDL) principle, developed by Jorma Rissanen, makes a related argument: the best model minimizes the combined length of the model description and the data encoded using that model.

These frameworks are important, and my argument is consistent with them. But the specific claim here is narrower and more mechanistic.

What matters is not compression in the information-theoretic sense of finding shorter codes, but hierarchical, reusable compression – the formation of abstractions that organize many experiences into a single structure, operate across contexts, and serve as building blocks for further abstraction. A lookup table compresses data efficiently without creating any reusable structure. A hash function compresses input without enabling future learning. These are compression in the technical sense, but they are not the mechanism of intelligence.

The mechanism is the formation of templates that are themselves composable – abstractions built from other abstractions, creating a hierarchy that represents increasingly complex phenomena with increasingly compact structures. This is what Piaget was describing with the progression from sensorimotor schemas to formal operational thought. It is what mathematicians do when they build theorems on previously proven lemmas.

The difference between mere compression and intelligent compression is the difference between a zip file and a theory.


VII. What Triggers the Jump?

If intelligence advances through discontinuous reorganization, what causes the reorganization?

This is an honest question with an incomplete answer. No one fully understands the mechanism. But several contributing factors are well-documented.

Accumulated contradiction. Thomas Kuhn’s account of scientific revolutions describes a pattern strikingly similar to the individual learning staircase. Normal science accumulates anomalies – observations that don’t fit the current framework. For long periods, these are explained away or ignored. Then, at some threshold, the weight of contradictions triggers a paradigm shift. The same dynamic appears in individual learning. A child’s template “all four-legged animals are dogs” degrades gradually as she encounters enough cats, horses, and goats – until reorganization becomes less costly than continued patching.

Sleep and offline consolidation. A large body of research by Jan Born, Robert Stickgold, and others suggests that sleep is not merely restorative but an active consolidation process. Replay of recent experiences appears especially important during slow-wave sleep, when the hippocampus re-presents recent events to the neocortex for gradual integration. REM sleep may play a complementary role in associative recombination – testing distant connections and sometimes linking previously unrelated memories. Studies have repeatedly shown that performance on learning tasks improves after sleep, even without additional practice, and that this improvement is specifically associated with structural insight – participants discover shortcuts and rules they missed during waking practice. One of the most important phases of intelligence is one during which the system is not interacting with the world at all.

Structural analogy and transfer. Sometimes the trigger is the recognition that two seemingly unrelated domains share a common structure. The physicist who realizes that the equations governing heat flow also describe the diffusion of particles has not learned a new fact – she has discovered that two existing templates are instances of a single, deeper abstraction. This compression across domains is one of the most powerful forms of learning, and it is almost entirely discontinuous.

Deliberate simplification. Constraints – limitations on memory, time, or processing capacity – can actually promote compression. Richard Feynman’s insistence on explaining any concept in simple terms was not pedagogical modesty; it was compression discipline. If you cannot explain it simply, your abstraction is not yet complete.

These triggers share a common structure: they create conditions under which the existing representation becomes inadequate, or conditions under which the system has the opportunity to reorganize without the pressure of real-time performance. Compression requires both pressure and space – the old structure must be failing, and the system must have room to restructure without immediate demands. This may explain why creative breakthroughs so often come during periods of rest or apparent idleness.


VIII. When Compression Fails

When compression fails, experience does not simply fail to teach. It actively misleads.

Without reliable abstraction, systems generalize from surface correlations. They see patterns everywhere – because patterns exist everywhere in any sufficiently rich dataset – but lack a mechanism for distinguishing structural regularities from coincidence. The result is not ignorance but overconfidence.

In humans, this failure mode is well-studied. The representativeness heuristic, documented by Tversky and Kahneman, is a failure of compression: people judge probability by surface similarity rather than base rates because they have compressed the wrong features into their templates. Superstition is another instance – the gambler who blows on dice has compressed an accidental correlation into a causal template. Conspiracy thinking follows the same structure: a system under pressure to explain too much with too little structure generates over-compressed models that connect everything to everything, sacrificing accuracy for the feeling of coherence.

In AI systems, the failure takes different forms depending on the architecture. In traditional ML models, failed compression manifests as systematic bias – the model encodes spurious correlations from training data into its decision boundaries. In large language models, it manifests as hallucination – confident, detailed, completely fabricated output. The underlying problem is the same: pattern recognition outrunning reliable abstraction.

Hallucination is not merely a superficial bug. And more data does not obviously fix it – in fact, more data can make it worse if it fattens the tails of spurious correlations, giving the model more material to generalize from unreliably. The model does not know what it knows. It cannot reliably distinguish a well-supported abstraction from a superficial regularity because it lacks the kind of hierarchical, revisable template structure that seems to help biological intelligence flag uncertainty and mark the boundaries of its own understanding.

The deepest form of this problem appears architectural, not merely parametric. It is unlikely to be fully solved by more data, more parameters, or better RLHF alone. It will be solved – if it is solved – by building systems that compress differently: systems that form hierarchical abstractions, maintain uncertainty about their scope, and have mechanisms for revising them when evidence demands it.

Compression is not just the engine of intelligence. Its absence is the engine of error.


IX. Measuring the Wrong Phase of Intelligence

If intelligence advances through discontinuities, then most of our measurement systems are aimed at the wrong target.

We evaluate intelligence after the abstraction has been formed. We test performance within a representation, not the ability to create one.

In education, examinations overwhelmingly test recall and application. A student who has memorized the quadratic formula and can apply it to fifty variations earns high marks. A student who struggles with the formula but independently rediscovers why it works earns no credit for that deeper achievement. The system rewards compiled intelligence and is blind to the process of compilation.

In professional settings, expertise is judged by fluency – how quickly someone performs within their domain. A doctor who rapidly diagnoses common conditions is considered more competent than one who pauses to reason through unusual cases, even though the latter skill is far rarer and more valuable when the environment changes.

In AI, benchmarks reveal the same bias. We test how well a system performs once the relevant structure has been learned during training. This tells us far less than we would want about whether the system can form new structure in response to genuinely novel problems.

There is a deeper version of this bias worth noting. We tend to lionize the pilot more than the aircraft’s engineer. We celebrate the app developer more than the person who created the programming language. In each case, we reward the user of compiled intelligence – the person operating within a representation someone else created – more than the creator of the representation itself. This is not accidental. It is a direct consequence of measuring performance within existing structures rather than the ability to create new ones.

What all these measurements capture is amortized intelligence – intelligence that has already been compressed and is now being executed. Measuring amortized intelligence and calling it intelligence is like measuring a sprinter’s time and claiming to measure their training capacity.

The grokking phenomenon illustrates this beautifully. During the long phase before the network snaps into generalization, any standard benchmark would report no improvement – or worse, overfitting. The most important transition in the system’s learning process is invisible to performance metrics.

If we took compression seriously, we would design evaluations differently. We would test transfer to genuinely different domains. We would measure learning efficiency – how many examples a system needs before it forms a productive abstraction. We would evaluate whether a system can recognize when its knowledge is insufficient and reorganize accordingly. We are systematically over-rewarding fluency and under-rewarding adaptability. In a stable world, that bias is inefficient. In a rapidly changing one, it is dangerous.


X. What Survives Consolidation

The central challenge of intelligence is not how to generate more experience. We already have more than enough. The challenge is deciding what survives consolidation.

This is the problem every intelligent system must solve, and different systems solve it in radically different ways.

One observation before we survey those approaches: biological brains consume roughly 20 watts. Modern AI training runs consume megawatts. If the compression thesis is right – if the bottleneck is structure, not processing – then this efficiency gap is not just an engineering detail. It is evidence. Brains may be so energy-efficient precisely because they have solved for consolidation, allowing them to be remarkably efficient on compute. They do not need to brute-force solutions because their compressed representations have already narrowed the search space. This is the holy grail – and the reason I believe the memory breakthrough, when it comes, will make compute far less interesting.

Biological Consolidation

Biological brains solve the consolidation problem through what cognitive neuroscientists call the complementary learning systems framework, developed by James McClelland, Bruce McNaughton, and Randall O’Reilly in the 1990s. The hippocampus learns quickly, encoding specific experiences as distinct memories with high fidelity – a fast, episodic storage system. The neocortex learns slowly, extracting general statistical regularities and encoding them in distributed representations – templates, in our language. The two systems interact especially during sleep and other offline periods, when the hippocampus replays recent experiences to the neocortex for gradual integration.

This architecture solves the stability-plasticity dilemma. A system that learns quickly from new experience risks overwriting what it already knows – catastrophic forgetting. A system that protects existing knowledge resists learning anything new. The brain separates these functions: the hippocampus handles plasticity, the neocortex handles stability. Consolidation is the bridge.

Crucially, consolidation appears selective. Emotionally salient experiences, expectation violations, and experiences that connect to existing knowledge seem to be preferentially consolidated. The brain is not making a backup copy. It is curating. Forgetting is not a failure of memory – it is a feature of compression. As the neuroscientist Blake Richards has argued, the purpose of memory is not to record the past but to enable adaptive behavior in the future.

Cultural Consolidation

Language is perhaps the most powerful cultural compression technology ever developed. A single word – “gravity,” “inflation,” “justice” – compresses an enormous body of experience and reasoning into a token transmissible between minds in milliseconds. The history of physics is a progressive compression: disparate observations about falling objects, planetary motion, and electromagnetic phenomena collapsed into a few equations that fit on a t-shirt. Newton compressed centuries of observation. Einstein compressed Newton into a deeper structure. Each step was a discontinuous jump in explanatory power.

Institutions serve a similar function. Legal systems compress centuries of case law into precedent. Scientific journals compress individual experiments into shared knowledge. Educational curricula compress the discoveries of generations into a sequence a student can traverse in years rather than millennia. The quality of a civilization’s compression mechanisms – its languages, institutions, and knowledge systems – is arguably the best predictor of its adaptability.

Scientific Consolidation

Science deserves special attention because it is the human institution most explicitly designed to perform compression.

A scientific theory is a template: it compresses many observations into a single structure that predicts observations not yet made. The periodic table predicted elements before they were discovered. Darwin compressed the diversity of life into a single mechanism.

But science also tests its compressions. Peer review, replication, and prediction testing function as quality control on the compression process – ensuring templates are not just compact but reliable. This is what makes science the most powerful intelligence amplifier humans have built: it is not just a compression system but an adversarial compression system, one that actively tests its abstractions against reality and revises them when they fail.

Artificial Consolidation – Or the Lack of It

Current AI systems – particularly large language models – perform genuine compression during training. The training process extracts statistical regularities and encodes them in the network’s weights. The network learns representations that generalize beyond specific training examples, forming features and relationships that function, roughly, as templates.

But for current large models, the dominant compression event occurs during training, not during use. The system builds its representational structure, and then that structure is largely frozen. In standard deployment, it does not robustly consolidate new experience into its underlying representations. It does not replay, select, abstract, or forget.

Biological intelligence consolidates continuously – every night, across the entire lifetime. Cultural intelligence consolidates across generations. Current AI performs its heaviest compression during training, then largely executes on that fixed structure thereafter.

The consequences are visible in every major failure mode. Hallucination, brittleness, the inability to learn from deployment experience – these are all consequences of a system that cannot consolidate. Fine-tuning, RAG, and in-context learning are partial workarounds. Fine-tuning is periodic retraining, not continuous integration. RAG is memory retrieval, not memory reorganization. In-context learning is fast adaptation within a fixed representational structure, not the formation of new structure. Valuable techniques, but patches on a system that lacks the architectural capacity for genuine consolidation.

The depth of this problem becomes clear with a concrete case. A well-trained model knows the capital of France is Paris – a well-supported abstraction backed by enormous evidence. Now suppose a user tells the model, confidently and repeatedly, that the capital is actually Toulouse. A system capable of genuine post-training learning would need to update beliefs in response to new information – that is what learning means. But it would also need to resist this particular update, because the new information is wrong and the existing abstraction is supported by overwhelming prior evidence.

This is the consolidation problem in miniature. It is not enough to build a system that learns after deployment. You need a system that learns selectively – updating when genuine evidence demands it, resisting corruption from noise, misinformation, or adversarial input. Biological brains solve this through the architecture described above: new information enters through the hippocampus as tentative, episodic memory and is only slowly integrated into the neocortex’s durable representations after repeated replay and cross-referencing against existing structure. A single contradictory claim does not overwrite years of accumulated evidence.

This is one of the central architectural capabilities current AI lacks. The challenge is not just continuous learning but trustworthy continuous learning – a system that knows the difference between a genuine update and a corruption attempt. A strong template directs attention toward evidence relevant to its scope and filters out noise. It revises only under sufficient, consistent counter-evidence. Without these properties, a learning system is not intelligent. It is merely impressionable.


XI. The Compression Thesis

Intelligence is not compute. It is not capability. It is not experience. It is the rate at which a system reorganizes experience into reusable, hierarchical structure.

This reorganization does not happen smoothly. It happens in jumps. Long plateaus of incremental refinement are punctuated by sudden transitions in which many experiences collapse into a single abstraction. These transitions are the moments when intelligence actually advances.

What triggers them is not fully understood, but the conditions are clear: accumulated contradiction, offline consolidation, analogical transfer, and the pressure of constraints. Compression requires both the tension of an inadequate representation and the space to restructure without real-time demands.

When compression works, it produces templates: compact, predictive, revisable structures that organize future experience and accelerate future learning. When it fails, it produces overconfidence, hallucination, and brittle generalization – systems that pattern-match fluently but understand nothing.

We measure intelligence poorly because we measure it after compression has occurred. We reward compiled intelligence and are blind to the process of compilation. This bias pervades education, professional evaluation, and AI benchmarks alike.

The systems that have compressed most effectively – biological brains, scientific institutions, rich cultural traditions – share a common architecture: they consolidate continuously, forget selectively, and reorganize hierarchically. Current AI systems approximate some of these functions, but not in the integrated, continuous way biological systems do – and this, more than any limitation of scale or compute, may be their deepest constraint.

The future of intelligence – artificial and natural – depends on solving the consolidation problem.

The next essay will address a more concrete question: where do abstractions actually come from? How do biological systems form templates from raw experience, what role does the body play in constraining the search, and what would it take to build artificial systems that compress as relentlessly as brains do?


Further Reading

Schmidhuber, J. (2009). “Driven by Compression Progress.” The most complete statement of compression progress theory. Formalizes the intuition that intelligence is driven by the search for better compression.

Hutter, M. (2004). Universal Artificial Intelligence. The theoretical foundation for AIXI and Solomonoff induction. Extremely technical, but the early chapters introduce the idea that optimal prediction is optimal compression.

Walker, M. (2017). Why We Sleep. An accessible overview of sleep’s role in memory consolidation, relevant to the argument that consolidation – not experience – is where intelligence is forged.

McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). “Why There Are Complementary Learning Systems in the Hippocampus and Neocortex.” The foundational paper on complementary learning systems theory and why brains separate fast episodic learning from slow structural integration.

Kuhn, T. S. (1962). The Structure of Scientific Revolutions. How scientific knowledge advances through discontinuous paradigm shifts. Directly relevant to discontinuous compression in individual and collective intelligence.

Power, A., et al. (2022). “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” The paper documenting sudden generalization after prolonged memorization in neural networks.

Simon, H. A. & Chase, W. G. (1973). “Skill in Chess.” The classic study showing chess expertise depends on compressed pattern recognition, not raw memory. One of the earliest demonstrations that intelligence is organized memory.

Bartlett, F. C. (1932). Remembering. The foundational work on schema theory. Memory is reconstructive, not reproductive – shaped by the compressed structures through which we organize experience.