Essay 4
Where Do Abstractions Come From?
Introduction: The Tesseract Problem
We cannot visualize a four-dimensional object. But we can see its shadow – a three-dimensional projection that preserves some of the original structure while losing the dimension that makes it what it is. We can study the shadow, rotate it, measure it. We can learn real things from it. But the projection is irreversible. No amount of studying the shadow will let us reconstruct the object that cast it.
Language is a projection of reality. It preserves enormous amounts of structure – relationships, categories, sequences, causes and effects. An entire civilization’s experience, compressed into text. But the projection is lossy and irreversible. Consequence is lost. Embodied constraint is lost. The felt texture of experience is lost.
An LLM trained on text is studying that shadow.
This essay asks: how much can you learn from a shadow? And what, specifically, is lost in the projection – the thing that separates knowledge from understanding, and correlation from wisdom?
The previous essay argued that intelligence advances through compression – the formation of reusable abstractions that organize experience. But it left a question open: where do those abstractions actually come from? What turns raw experience into structure?
The answer, I believe, is that abstractions emerge from the interaction between a predicting system and a constraining reality. And the body – far from being a mere vessel for the brain – is compressed evolutionary intelligence that gives this interaction a massive head start.
I. Knowledge vs. Wisdom
Michel de Montaigne wrote: “We can be knowledgeable with other men’s knowledge, but we cannot be wise with other men’s wisdom.”
This distinction is the backbone of this essay. Knowledge transfers through language. Wisdom requires experience – prediction meeting consequence.
A child told “the stove is hot” has knowledge. She can repeat it, apply it to novel questions (“Is the oven hot too?”), and pass it along to others. A child who has touched the stove has something else entirely. Her abstraction is grounded in consequence – emotionally weighted, deeply revisable, and connected to a felt prediction error that no verbal description can fully transmit.
Both children can answer the question “Is the stove dangerous?” correctly. But the second child’s abstraction is structurally different. It was forged by a prediction meeting reality. She reached for the stove expecting it to feel like every other surface she’d touched. Reality corrected the prediction. The gap between expectation and outcome – the surprise – is what forced the reorganization of her mental model. This is Essay 3’s staircase, now grounded in mechanism: the jump happens when reality forces a compression event.
Karl Friston’s active inference framework formalizes this idea. In his formulation, organisms are prediction machines that minimize surprise – either by updating their models of the world or by acting on the world to make it conform to their predictions. Abstractions form at the point where prediction fails and the system must reorganize. The deeper the prediction error, the more durable the resulting abstraction.
This is why children touch everything. They’re not just curious. They’re running experiments. Every grab, press, pull, and taste is a prediction being tested against reality. Observation tells you what happens. Intervention tells you what you can make happen. The difference is the difference between correlation and causation – and it maps precisely onto the difference between knowledge and wisdom.
Passive observation produces weak abstractions. You watch someone ride a bicycle and form a model of how it works. Intervention produces strong ones. You fall off the bicycle and your model restructures around the physics of balance in a way no amount of watching could achieve.
This is the fundamental limitation of learning from the shadow. No matter how rich the text, no matter how many examples, no matter how much data – a system that learns only from observation is building correlational models. A system that learns from intervention is building causal ones. The grammar looks similar. The grounding is different.
II. The Body as Compressed Evolutionary Intelligence
Evolution didn’t just build brains. It outsourced intelligence into bodies.
This is one of the most underappreciated facts about biological intelligence. We tend to think of the body as a vehicle – the brain does the thinking, the body carries it around. But the body is doing enormous computational work that we don’t count as “intelligence” because it doesn’t look like thinking.
Consider the passive dynamic walker. In 1990, Tad McGeer demonstrated that a simple mechanical device – no motors, no sensors, no microprocessor – could walk stably down a slope using nothing but gravity and the geometry of its legs. The Cornell passive dynamic walker, built by Steve Collins and Andy Ruina, has knees and arms, walks with a remarkably humanlike gait, and uses zero computation. The physics of the body is the computation.
This isn’t a party trick. It reveals something deep about how biological intelligence works. Walking is one of the most complex motor tasks we perform – it requires precise coordination of dozens of muscles, real-time balance adjustments, and adaptation to terrain. Yet much of this complexity is handled by the mechanical properties of our legs, tendons, and joints. The brain doesn’t compute walking from scratch. It guides a system that is already mechanically predisposed to walk. Evolution encoded millions of years of locomotion intelligence into the structure of the body itself.
The examples multiply. The human eye doesn’t deliver raw pixel data to the brain – the retina performs substantial preprocessing, extracting edges, detecting motion, and adjusting for lighting conditions before any signal reaches the visual cortex. Cockroaches navigate complex terrain with minimal neural computation because their legs mechanically adapt to obstacles – the intelligence is in the morphology. The structure of the human hand – opposable thumbs, independent finger control, dense tactile nerve endings – creates affordances that shape what abstractions are even possible. Our hands don’t just execute our plans. They create the conditions under which certain plans become thinkable.
Proprioception – the sense of where your body is in space – provides a continuous stream of highly structured, pre-compressed data about the physical world. You don’t compute your arm’s position from first principles. The body tells you, constantly, through a channel that is already structured by millions of years of evolutionary optimization.
The principle is this: embodiment reduces the search space. A brain in a body has far fewer hypotheses to test about the world than a brain in a void. The body eliminates vast classes of impossible or irrelevant possibilities before cognition even begins. This is why biological intelligence is so energy-efficient. The brain runs on 20 watts not only because it compresses ruthlessly (Essay 3) but because the body pre-compresses reality before it reaches the brain. The brain doesn’t start from raw data. It starts from data that has already been structured by an organism exquisitely adapted to its environment.
This connects directly to the core thesis of this series: that intelligence is constrained search, and that the bottleneck is compression, not compute. The body is a compression engine – millions of years of evolutionary search results encoded as physical constraints that narrow the hypothesis space before a single neuron fires. When we say biological brains are five orders of magnitude more energy-efficient than current AI, we’re not comparing like with like. The brain has a body doing enormous work for free. Current AI has to compensate for the absence of that body with brute-force compute.
III. What the Shadow Loses
Return to the tesseract. We can now be specific about what is lost when reality is projected onto language.
Consequence is lost. Text describes outcomes but doesn’t impose them. A model that reads about a thousand failed experiments has correlations about failure. A scientist who has spent a year pursuing a dead-end hypothesis and had to restructure her entire research program has an abstraction grounded in consequence. The model’s correlation is statistical. The scientist’s abstraction is forged by prediction error – and prediction error, as we argued above, is the mechanism through which the strongest abstractions form.
Interventional structure is lost. Text contains descriptions of causal events, but the causal structure – what happens when you act – is not directly recoverable from observation alone. You can read that “flipping the switch turns on the light.” But from observation alone, you cannot distinguish this from “the light turns on whenever someone is near the switch.” The interventional structure – the fact that your action causes the effect – requires you to flip the switch yourself. This is why Judea Pearl’s causal hierarchy places observation, intervention, and counterfactual reasoning at different levels. Language lives primarily at the observational level.
Embodied constraint is lost. The physical priors that narrow search don’t transfer through language. You can describe balance to someone who has never stood upright, but the description doesn’t give them balance. The body’s pre-compression of physical reality – the proprioceptive stream, the mechanical affordances, the evolutionary encoding of physics into morphology – cannot be transmitted linguistically. At best, language transmits a pointer: “this is what balance is like.” The pointer is useful. It is not the thing itself.
Multimodal grounding helps, but doesn’t close the gap. Adding images and video to training data adds information but not consequence. A model that has seen a million videos of people touching hot surfaces has better correlations about stoves. It still doesn’t have the abstraction that forms when your prediction (“this surface will be cool”) meets reality (“it’s not”). The multimodal data is richer shadow. It is still shadow.
The honest caveat: we don’t know the full extent of what’s recoverable from the shadow. Reinforcement learning in simulated environments does produce grounded behavior – agents that learn to walk, manipulate objects, and navigate spaces through simulated consequence. The question is whether simulation can be rich enough to substitute for real-world embodied interaction. This is genuinely open. My intuition is that simulation gets you surprisingly far for narrow tasks but hits a ceiling for general intelligence, because the simulation itself is a model of reality – and models, like shadows, are lossy projections.
IV. What This Means for AGI
If the thesis of the last three essays is right – that intelligence requires search, compression, and grounded consequence – then a single large language model, no matter how powerful, is not a path to artificial general intelligence. It is a path to extraordinarily capable pattern matching over the shadow of reality. Which is valuable. But it is not AGI.
This is not a contrarian position for its own sake. It’s a structural prediction from the framework we’ve been building. If abstractions require consequence to form (this essay), and compression is the bottleneck (Essay 3), and intelligence scales through parallel search (Essay 2), then AGI requires systems that interact with reality, compress that interaction into reusable structure, and do so at scale.
What does that look like? I think it looks like distributed, embodied intelligence.
No single human has humanity’s intelligence. No single human was going to build civilization alone. Human intelligence is collective – parallel and serial, distributed across billions of embodied agents, each interacting with reality in their own context, each compressing their experience into abstractions that get shared through language (which we’ll explore in the next essay). The intelligence is in the network, not in any individual node.
If AGI is to approach or exceed human intelligence, it likely needs a similar structure – but with an important nuance. Not all intelligence requires embodiment. Math doesn’t need a body. Logical reasoning doesn’t. Estimation, planning, and abstract inference operate just fine without physical interaction. Much of what we value about human cognition happens entirely inside the head.
But these non-embodied capabilities are built on top of abstractions that originally required embodiment to form. The child who learns physics through her body – falling, throwing, balancing – can later reason about physics abstractly, on paper, without touching anything. The embodiment was the substrate. It provided the grounded abstractions that non-embodied reasoning then operates on. Remove the substrate, and the abstractions that depend on it never form in the first place.
So the claim is not that AGI requires every agent to be embodied all the time. It’s that embodiment is the substrate – the layer that generates the grounded, consequential abstractions on which everything else is built. You need a combination: embodied agents that interact with reality and form grounded abstractions, and non-embodied agents that reason over those abstractions, extend them, and apply them in domains where physical interaction isn’t needed. The embodied layer feeds the non-embodied layer. Without it, the non-embodied layer is reasoning over shadows.
This is the argument from Essay 2 – AGI as a civilization in silicon – now extended with a new claim: the civilization needs bodies, not just software. Humanoid robots, or at minimum physically embodied agents that interact with the world in ways that generate consequential feedback, may be a necessary condition for general intelligence. Not sufficient – you also need compression, memory, language, and the ability to share abstractions across the network. But necessary, because certain foundational abstractions cannot form from the shadow alone.
Simulation and the Breadth Problem
An important clarification: the distinction that matters is not physical vs. simulated. It’s consequential interaction vs. observation. Simulation is not a competing alternative to physical embodiment – it’s another form of consequential interaction, and it actually strengthens the thesis.
A simulated agent whose predictions are corrected by a physics engine is experiencing real consequence – virtual consequence, but consequence nonetheless. If the simulation is accurate, the abstractions formed inside it are genuinely grounded. Reinforcement learning in simulated environments already produces agents that learn to walk, manipulate objects, and navigate spaces. Sim-to-real transfer – training in simulation, deploying in the physical world – is an active and increasingly productive research area. Every serious robotics effort uses simulation to accelerate training and physical embodiment for validation. They’re not competing paths. They’re stages of the same pipeline.
The real question is about breadth. For narrow domains – locomotion, object manipulation, specific motor tasks – simulation works well because we understand the relevant physics well enough to model it. But general intelligence requires abstractions formed across an enormously broad slice of reality. And here’s the circularity: to simulate reality broadly enough to learn general abstractions from it, you may need to already understand reality deeply enough to build the simulation. You need the knowledge to create the environment that would teach you the knowledge.
Physical embodiment skips this circularity. You don’t need to model reality – you just interact with it. Reality is its own simulation, running at perfect fidelity, for free. It’s unlimited in breadth, perfectly accurate, and costs nothing to render.
The practical trade-off, then: simulation is faster, cheaper, and more controllable, but limited by the fidelity and breadth of the model. Physical embodiment is slower and more expensive, but provides perfect fidelity and unlimited breadth. For narrow domains, simulation wins. For general intelligence, physical embodiment may be necessary simply because no simulation is broad enough – and building one that is would require solving most of the problem you’re trying to solve.
In practice, the path to AGI likely uses both. Simulated environments for rapid training in well-understood domains. Physical embodiment for the open-ended, broad-spectrum learning that no simulation can fully capture. And the non-embodied reasoning layer – language models, planning systems, abstract inference – operating over the grounded abstractions that both provide.
The Architecture Question
An open question follows: what is the right architecture for these agents? There are several possibilities, and I don’t think we know the answer yet.
One option is many small specialized models – each robot running its own fast-updating model, trained on its own embodied experience, specializing in a narrow domain where rapid learning is possible. This aligns with the argument from Essay 2 about small, fast-updating models beating massive frozen ones.
Another option is a shared large model with domain-specific activation – every agent runs the same foundation model but only specializes on a narrow scope, like a mixture-of-experts architecture where different embodied contexts activate different expert modules.
A third is a hybrid – a shared foundation model providing basic world knowledge and language understanding, with small specialized models on top that learn from each agent’s specific embodied experience and update continuously.
The right answer likely varies by domain. A fleet of warehouse robots performing repetitive manipulation tasks might work well with small specialized models. A general-purpose household assistant might need the hybrid approach. The point is that this is an architectural question we should be investigating seriously – and the fact that most AI research focuses on scaling disembodied language models suggests we’re exploring the wrong part of the search tree.
V. The Prediction That Matters
This essay, combined with the previous three, makes a specific, testable prediction: systems that learn from embodied, consequential interaction with reality will form more robust and generalizable abstractions than systems that learn from observational data alone, even when the observational data is orders of magnitude larger.
This prediction has implications.
For AI development, it means the path to AGI runs through embodied interaction with reality – whether physical or simulated – not through ever-larger language models alone. Not because language models are useless – they’re extraordinarily valuable as the non-embodied reasoning layer. But the substrate they reason over must be grounded in consequence, and there’s a ceiling on what the shadow contains.
For the compute-vs-memory debate that runs through this series: embodiment is part of the answer to why biological brains are so efficient. They don’t just compress better (Essay 3). They start from pre-compressed input (this essay). Current AI systems are compute-hungry partly because they’re compensating for the absence of embodiment – trying to brute-force their way to abstractions that the body provides for free.
For the question of who builds AGI: if embodied interaction is necessary, then AGI cannot be built by a single lab training a single model. It requires distributed agents in diverse environments, each contributing unique embodied experience to the collective. This reinforces the argument from Essay 2: AGI will emerge from millions of agents managed by millions of people, not from one company scaling one model.
The next essay examines the technology that allowed humans to partially overcome the limitations of individual embodied experience – a compression tool so powerful we forget it’s technology at all. Language enabled us to share the products of embodied intelligence across minds and generations. But the sharing is lossy, and the losses matter. That’s where we go next.
Further Reading
Friston, K. (2010). “The Free-Energy Principle: A Unified Brain Theory?” The most accessible statement of Friston’s active inference framework. Dense but foundational – it formalizes the idea that organisms minimize prediction error through a combination of perception and action.
Pfeifer, R. & Bongard, J. (2006). How the Body Shapes the Way We Think. The most thorough treatment of embodied intelligence and morphological computation. Essential for understanding why bodies are computational devices, not just vehicles for brains.
Collins, S., Ruina, A., Tedrake, R., & Wisse, M. (2005). “Efficient Bipedal Robots Based on Passive-Dynamic Walkers.” Science. The paper on the Cornell passive dynamic walker. Demonstrates that remarkably humanlike walking can emerge from physics alone, without motors or control systems.
Clark, A. (2015). Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Andy Clark’s synthesis of predictive processing and embodied cognition. Argues that the brain is fundamentally a prediction machine, with perception and action as two sides of the same coin.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. The foundational text on causal inference. Essential for understanding why correlation and causation are structurally different and why interventional experience may be necessary for genuine causal reasoning.
Montaigne, M. de (1580). Essays. The source of the knowledge-vs-wisdom distinction that frames this essay. Montaigne’s broader project – rigorous self-examination as a path to understanding – remains remarkably relevant to questions about machine intelligence.
Brooks, R. (1991). “Intelligence Without Representation.” Rodney Brooks’s provocative argument that intelligence emerges from the interaction between an agent and its environment, not from internal representations. A founding text of embodied AI.
McGeer, T. (1990). “Passive Dynamic Walking.” International Journal of Robotics Research. The original paper on passive dynamic walkers. Demonstrates that stable locomotion can emerge from mechanical dynamics alone, without any computation or control.