Embodied Cognition & Intelligence
Why intelligence isn't just computation — how physical embodiment shapes cognition and what that means for AI.
Try to explain the concept of “heavy” to someone who has never lifted anything. Not the physics definition (mass times gravitational acceleration). The felt sense. The way your shoulders tighten when you pick up a box that’s heavier than expected. The way “heavy” bleeds into metaphor: heavy news, heavy heart, heavy responsibility. The way a child learns “heavy” by picking up rocks of different sizes, dropping a few on their toes, and slowly calibrating their grip and their understanding simultaneously.
Now try to teach that to a language model.
You can give it the dictionary definition. You can give it thousands of sentences containing the word “heavy.” You can even give it the physics equations. The model will use “heavy” correctly in conversation. It will generate plausible sentences about heavy objects. But something is missing. The model’s understanding of “heavy” is a statistical ghost: the pattern of co-occurrences without the physical experience that gives those co-occurrences their meaning.
This observation, that cognition is deeply shaped by having a body, is the central claim of embodied cognition. It’s one of the most important ideas in cognitive science, and it’s becoming increasingly relevant as we try to figure out why AI systems are brilliant at some things and bizarrely incompetent at others.
The body thinks
The traditional view of cognition, inherited from Descartes and reinforced by the early cognitive science of the 1960s and 70s, treats the mind as a computer. Perception is input. Action is output. Cognition is the computation in between. The body is a peripheral device, like a keyboard and a monitor connected to the real action happening in the CPU.
Embodied cognition rejects this picture. It argues that the body isn’t peripheral to cognition. It’s constitutive of it. The way you think is shaped by the body you have, the environment you’re in, and the actions available to you.
This isn’t a vague philosophical claim. It’s backed by decades of experimental evidence.
Holding a warm cup makes you rate other people as “warmer.” Williams and Bargh (2008) found that participants who briefly held a hot cup of coffee rated a target person as having a warmer personality than participants who held a cold cup. The physical sensation of warmth primed the abstract concept of interpersonal warmth. The body’s sensory state influenced a cognitive judgment.
Nodding your head makes you agree more. Wells and Petty (1980) asked participants to test headphones by moving their heads vertically (nodding) or horizontally (shaking) while listening to an editorial. Nodders agreed more with the editorial than shakers, even though the head movements were supposedly unrelated to the content.
Leaning affects risk assessment. Eerland, Guadalupe, and Zwaan (2011) found that people standing on a platform tilted slightly to the left estimated the Eiffel Tower to be shorter than people standing upright, consistent with the left-to-right mental number line (smaller numbers on the left).
These experiments are weird. They suggest that abstract thinking isn’t as abstract as we assumed. It’s tangled up with bodily states in ways that the “mind as computer” metaphor can’t accommodate.
The enactive revolution
The intellectual foundation for embodied cognition was laid in 1991 by Francisco Varela, Evan Thompson, and Eleanor Rosch in The Embodied Mind. They proposed what they called the “enactive” approach to cognition:
Cognition isn’t the representation of a pre-given world by a pre-given mind. It’s the enactment of a world and a mind based on a history of structural coupling between an organism and its environment.
That’s dense, so let me unpack it.
Pre-given world. The traditional view assumes there’s an objective world out there, and the job of cognition is to build an accurate internal model of it. The enactive view says that the world an organism experiences is partly constituted by how the organism acts on it. A frog’s world contains “small dark moving things” (flies) and “large dark looming things” (predators), not because the world has those categories, but because the frog’s perceptual-motor system carves the world at those joints.
Structural coupling. An organism and its environment co-evolve. The organism’s body and nervous system are shaped by the environment through evolution and development. The organism’s actions, in turn, shape the environment it encounters. This creates a feedback loop: the body shapes cognition, cognition shapes action, action shapes the environment, the environment shapes the body.
Enactment. Cognition isn’t something that happens inside the head. It’s something an organism does. Perceiving is an activity. Thinking is an activity. Meaning isn’t retrieved from storage. It’s enacted through interaction.
Varela, Thompson, and Rosch were drawing on diverse sources: phenomenology (Merleau-Ponty’s analysis of perception), Buddhist philosophy (mindfulness as direct experience), and biology (autopoiesis, the self-producing nature of living systems). The combination was unusual and initially met with skepticism from mainstream cognitive science. Three decades later, their core insights have been substantially vindicated.
Gibson’s affordances: the world tells you what to do
James Gibson, an ecological psychologist working several decades before Varela and colleagues, had a related but distinct insight. He argued that perception isn’t about building internal representations of the world. It’s about directly perceiving affordances: the action possibilities that the environment offers to an agent with a particular body.
A chair affords sitting (for a human). A branch affords perching (for a bird). A handle affords grasping. A cliff affords falling. These affordances aren’t properties of the objects alone or the agent alone. They’re relational: they exist at the interface between the agent’s body and the environment’s structure.
This is a profound idea with direct relevance to AI. In the traditional computational view, an AI system perceives the world by building an internal model and then planning actions against that model. In Gibson’s view, perception and action are directly coupled: you perceive the world in terms of what you can do in it.
Consider how a toddler navigates a cluttered room. They don’t build a 3D model of the room, compute paths, and execute motor plans. They perceive walkable surfaces, graspable edges, and climbable furniture directly. Their perception is structured by their body’s capabilities: a surface that affords walking for a standing toddler doesn’t afford walking for a crawling infant. Same surface, different affordances, different perception.
Gibson’s framework explains something that keeps tripping up AI systems: the seemingly effortless way humans navigate physical environments. It’s not that humans are doing complex computation quickly. It’s that they’re perceiving the environment in terms that are already action-relevant. The computation is, in a sense, already done by the body-environment coupling.
Lakoff’s conceptual metaphors: thinking through the body
George Lakoff and Mark Johnson extended embodied cognition into the domain of abstract thought. Their central claim, in Metaphors We Live By (1980) and Philosophy in the Flesh (1999), is that abstract concepts are systematically structured by metaphorical mappings from bodily experience.
We don’t just use metaphors as literary devices. We think in metaphors, and those metaphors are grounded in bodily experience.
| Abstract domain | Bodily source domain | Examples |
|---|---|---|
| Affection | Warmth | “She’s a warm person,” “a cold reception” |
| Status | Vertical space | “High status,” “feeling down,” “top of the hierarchy” |
| Time | Space/motion | “Looking ahead,” “the future is in front of us,” “looking back” |
| Morality | Cleanliness | “Dirty politics,” “clean conscience,” “pure intentions” |
| Difficulty | Weight | “Heavy burden,” “light reading,” “weighed down by problems” |
| Understanding | Seeing | “I see what you mean,” “clear explanation,” “murky reasoning” |
| Quantity | Vertical extent | “Prices went up,” “stocks fell,” “high numbers” |
These aren’t arbitrary. They’re grounded in universal bodily experiences. Every human experiences warmth from other bodies (being held as an infant), vertical space (standing up means having more power than lying down), weight (carrying things), and vision (seeing objects). The metaphorical mappings preserve the structure of the bodily experience: more warmth corresponds to more affection, higher position to higher status, greater weight to greater difficulty.
The neural evidence has been accumulating. A 2023 review of neuroimaging studies found that processing metaphorical language activates sensorimotor brain regions consistent with the source domain. Reading about “grasping an idea” activates motor areas involved in physical grasping. Reading about “a warm personality” activates regions involved in processing temperature.
This creates an interesting problem for AI. Language models learn metaphors from text, and they use them correctly. But they learn the mapping without the ground. They know that “warm” maps to “friendly” because those words co-occur in similar contexts. But they don’t know why warmth maps to friendliness (because the bodily experience of warmth is associated with the bodily experience of social closeness). The mapping is intact, but the motivating experience is absent.
Does this matter? For generating text, probably not. For truly understanding what the text means, it might.
The grounding problem: AI’s missing floor
The symbol grounding problem, identified by Stevan Harnad in 1990, asks a simple question: how do the symbols in a computational system get their meaning?
In a traditional AI system (and in a language model), symbols (words, tokens) are defined in terms of other symbols. “Cat” is defined by its relationships to “animal,” “fur,” “meow,” “pet,” and so on. But none of these symbols are connected to anything outside the system. It’s definitions all the way down. Harnad compared this to trying to learn Chinese from a Chinese-to-Chinese dictionary when you don’t know any Chinese.
Human concepts are grounded differently. The concept “cat” is connected to the sight of cats, the feel of fur, the sound of purring, the motor program for petting, the emotional response of affection (or allergic dread). These connections give the symbol its meaning. Without them, you have a complex web of symbol-to-symbol relationships that can produce correct outputs without understanding what any of them refer to.
A 2025 study published in Nature Human Behaviour tested this directly. The researchers compared concept representations in large language models with human concept representations and found that LLMs recovered non-sensorimotor features of concepts (taxonomic relationships, functional properties) but systematically failed to recover sensorimotor features (what things look, feel, smell, taste, and sound like). The similarity between LLM-derived and human-derived representations showed a “gradual decrease” in sensorimotor domains.
This is exactly what embodied cognition theory predicts. The non-sensorimotor aspects of meaning can be captured through statistical co-occurrence in text because they reflect relational properties (cats are animals, cats are pets, cats are smaller than dogs). The sensorimotor aspects can’t be captured this way because they reflect properties of direct bodily interaction (what fur feels like, what purring sounds like).
Does intelligence require a body?
This is the big question, and the debate has intensified in 2025-2026 with the rapid advance of language models.
There are three main positions:
Strong embodiment: intelligence requires a body. Proponents argue that genuine understanding, causal reasoning, and common sense depend on sensorimotor grounding that can’t be acquired from text alone. Without a body, you can simulate intelligence but not achieve it. This is the position of traditional embodied cognition researchers and some AI critics.
Grounding without embodiment: intelligence requires grounding but not necessarily a physical body. A January 2026 paper with exactly this title argued that the key properties of the physical world (gravity, object permanence, spatial relationships) can be inferred from digital representations such as images and audio. You don’t need a physical body to learn about the physical world. You need grounded data, sensory information that carries information about physical structure. A robot body is one way to get this data. Video and audio are other ways.
Weak embodiment: embodiment helps but isn’t necessary. Language models demonstrate impressive cognitive capabilities without any embodiment. Maybe they don’t truly “understand” in the philosophical sense, but they can reason, plan, generate creative solutions, and pass tests that were designed to measure understanding. Maybe the philosophical concept of “true understanding” is doing less work than embodied cognition theorists think.
My view is closest to the second position: grounding matters, embodiment is one way to achieve grounding, but it’s not the only way.
Here’s why. When I watch language models fail, they tend to fail in ways that are consistent with a grounding deficit, not an embodiment deficit specifically. They struggle with spatial reasoning (because text doesn’t convey spatial structure well). They struggle with physical causation (because text describes outcomes, not mechanisms). They struggle with novel sensorimotor concepts (because these can’t be learned from description alone).
But they do fine with social reasoning, emotional concepts, logical inference, mathematical proof, and many other forms of abstract cognition. These are domains where grounding through language might actually be sufficient, because the concepts are defined relationally rather than experientially.
The question isn’t “does the system have a body?” The question is “does the system have access to data that carries the relevant causal and structural information about the domain it’s reasoning about?”
Robotics: the embodiment test bench
If embodiment matters for intelligence, then embodied AI systems (robots) should develop cognitive capabilities that disembodied systems can’t achieve. Is there evidence for this?
The field of embodied AI has seen significant advances in 2025. Tesla, Figure, and 1X have deployed humanoid robots capable of natural walking, object manipulation, and real-time environmental adaptation. These aren’t pre-programmed automata. They learn from physical interaction, much as human children do (albeit much more slowly and with much less flexibility).
The most interesting development is the emergence of “world models” for embodied agents. A 2025 survey in IEEE Circuits and Systems Magazine traced the evolution from language models to world models, where embodied AI systems learn predictive models of their physical environment through interaction. The system doesn’t just learn what words mean. It learns what happens when you push things, stack things, drop things. It develops something analogous to physical intuition.
These embodied systems do develop capabilities that disembodied language models lack. They can predict the physical consequences of actions they’ve never performed before (novel object manipulation). They can generalize physical knowledge across objects with similar properties. They can learn affordances directly: this thing is graspable, that surface is walkable, this container can hold things.
But here’s the catch: it’s not clear whether these capabilities require physical embodiment specifically, or just the kind of rich, causally structured data that physical interaction provides. Could you get the same capabilities by training on very high-quality video of physical interactions? Maybe. The 2026 paper arguing for “grounding without embodiment” makes a plausible case that you could.
The 4E framework: a broader view
Contemporary embodied cognition has evolved beyond the simple claim that “cognition needs a body.” The field now recognizes four related ways in which cognition extends beyond the brain:
Embodied. Cognitive processes are shaped by the body’s form, sensory systems, and motor capabilities. A creature with eyes on the sides of its head (like a rabbit) perceives and reasons about space differently than one with forward-facing eyes (like a human).
Embedded. Cognition takes place in a physical and social environment that provides structure and resources. You think differently in a library than in a forest, not just because the information is different, but because the environment scaffolds different cognitive processes.
Extended. Cognitive processes can extend beyond the brain and body to include external tools and artifacts. When you do long division on paper, the paper isn’t just an output device. It’s part of the cognitive process. When you use a smartphone as an external memory, your cognitive system includes the phone. Andy Clark and David Chalmers argued this in their influential 1998 paper “The Extended Mind.”
Enactive. Cognition is the enactment of a world through sensorimotor interaction, per Varela, Thompson, and Rosch. Meaning isn’t represented. It’s performed.
These four aspects interact. An embodied agent (a human) embedded in an environment (an office) extends its cognition through tools (a computer) and enacts understanding through interaction (typing, reading, thinking).
For AI systems, this framework raises interesting questions:
| 4E Dimension | Current AI status | What it implies |
|---|---|---|
| Embodied | Mostly disembodied; some robotics | Physical form shapes available cognitive strategies |
| Embedded | Embedded in digital environments (context windows, tool access) | Digital embedding provides its own form of scaffolding |
| Extended | Uses tools (search, code execution, file systems) | Tool use is a genuine form of extended cognition |
| Enactive | Generates responses through interaction, but not through sensorimotor coupling | Interaction with users is a limited form of enaction |
The interesting observation is that AI systems are partially embodied in the 4E sense even without physical bodies. They’re embedded in digital environments. They extend through tool use. They enact meaning through conversational interaction. The embodiment they’re missing is specifically the sensorimotor kind: direct causal contact with the physical world.
What embodiment means for AI development
The practical implications of embodied cognition for AI development are becoming clearer:
Multimodal training matters. Language-only training produces systems with systematic gaps in sensorimotor concepts. Adding vision, audio, and video helps because these modalities carry information about physical structure that text alone doesn’t convey. This isn’t a nice-to-have. It’s addressing a fundamental representational deficit.
Interaction beats observation. There’s a difference between learning from watching (video) and learning from doing (robotics). Active interaction produces richer representations because the agent gets to observe the consequences of its own interventions, which provides causal information that passive observation doesn’t. This maps onto the distinction between observational and interventional learning in causal inference.
Affordance perception may be trainable. If Gibson is right that perception is structured by action possibilities, then giving AI systems the ability to act in environments (even simulated ones) should change how they perceive and represent those environments. There’s early evidence for this from robotics research, where embodied agents develop more structured and generalizable representations than agents that learn from observation alone.
Metaphorical reasoning needs grounding. Lakoff’s work suggests that abstract reasoning is scaffolded by concrete experience. If AI systems lack the concrete experience, their abstract reasoning may be superficially correct but structurally shallow. This could explain some of the “confident but wrong” failure modes of language models: the surface form of the reasoning looks right, but the underlying structure is based on statistical association rather than experiential grounding.
The philosophical stakes
Embodied cognition isn’t just a technical issue for AI developers. It raises deep questions about the nature of intelligence and understanding.
If intelligence is fundamentally embodied, then a disembodied AI, no matter how capable, is doing something qualitatively different from what humans do when they think. It might produce the same outputs, but the internal process is different in ways that matter. This is related to John Searle’s Chinese Room argument, but with a twist: the issue isn’t about symbol manipulation per se. It’s about whether meaning requires grounding in bodily experience.
If intelligence is not fundamentally embodied, if grounding can be achieved through diverse data rather than physical interaction, then the path to human-level AI might be shorter than embodied cognition theorists think. You don’t need to build a robot body. You need to build training pipelines that expose models to rich, causally structured data from multiple modalities.
My instinct, based on building systems that coordinate multiple AI agents, is that the truth is somewhere between these positions. Embodiment isn’t strictly necessary for intelligence, but the absence of embodiment creates specific, predictable gaps that become apparent when you push systems beyond their training distribution. The systems are intelligent but differently intelligent, strong where linguistic and relational reasoning suffice, weak where physical intuition and causal understanding are required.
The question isn’t whether AI systems will eventually close these gaps. They probably will, through some combination of multimodal training, embodied robotics, and architectural innovations we haven’t invented yet. The question is whether closing the gaps changes what the systems are doing in a philosophically interesting way, or just makes them better at producing correct outputs.
Varela, Thompson, and Rosch would say the former. The enactive view holds that intelligence isn’t about producing correct outputs. It’s about enacting a world through embodied interaction. A system that produces correct outputs without enacting anything is doing something different from thinking, regardless of how impressive the outputs are.
Whether that distinction matters for practical purposes is, I think, the most important open question at the intersection of cognitive science and AI. And the only way to answer it is to keep building systems, keep testing them against the real world, and keep paying attention to where they break. The breaking points are where the theories get tested.