Research → Product Translation
Bridging the gap between research findings and shipped products
In 2017, eight researchers at Google published “Attention Is All You Need.” It described a new neural network architecture called the Transformer. The paper was good. Clear writing, strong results, solid benchmarks. It showed that you could replace recurrence and convolutions with self-attention and get better translation quality with less training time.
Seven years later, that architecture is the foundation of a multi-hundred-billion-dollar industry. GPT-4, Claude, Gemini, Llama, Mistral. Every frontier model runs on Transformers. OpenAI alone is valued at over $150 billion. The combined market cap of companies built on variations of that 2017 paper is difficult to even estimate.
But here’s what doesn’t get talked about enough: Google itself didn’t ship the product that captured most of the value. OpenAI did. And OpenAI didn’t just read the paper and build ChatGPT the next week. It took five years of scaling experiments (GPT-1 in 2018, GPT-2 in 2019, GPT-3 in 2020), a pivot to reinforcement learning from human feedback, a massive infrastructure buildout, a controversial partnership with Microsoft, and a consumer product launch in November 2022 before any of it mattered to anyone outside of ML research.
The gap between “published research result” and “product people use” is enormous. Most research never crosses it. The ones that do take years, require different skills than the ones that produced the research, and often the value gets captured by someone other than the original researchers.
This is what I want to talk about. Not the platitudes about “innovation” or “disruption.” The actual mechanics of why brilliant research dies in the lab, what it takes to drag a finding from a paper into production, and why the people who are best at research are frequently the worst at shipping products.
The graveyard is bigger than the showcase
The success stories are well-known. Transformers became ChatGPT. CRISPR gene editing went from a 2012 Doudna-Charpentier paper to 359 active companies and $84.5 billion in disclosed funding. mRNA vaccines spent a decade in research before Moderna and BioNTech turned them into COVID vaccines in under a year, validating a technology that most pharma companies had dismissed. AlphaFold predicted protein structures with such accuracy that it won the Nobel Prize in 2024, and Isomorphic Labs (spun out of DeepMind in 2021) raised $600 million in April 2025 to put AI-designed drugs into human trials.
These are the stories people tell at conferences. They are not representative.
For every Transformer paper that spawns an industry, there are thousands of papers with strong results that go nowhere. Not because the research was bad. Because translation is a different discipline than discovery, and almost nobody is good at both.
MIT’s 2025 study, “The GenAI Divide: State of AI in Business,” put numbers on this. Despite $30-40 billion in enterprise spending on generative AI, 95% of organizations are seeing no business return. Only 5% of AI pilot programs achieve rapid revenue acceleration. The rest stall, delivering little to no measurable impact. And these aren’t even research translations. These are companies trying to deploy existing, commercially available AI tools into their workflows. If deploying someone else’s finished product fails 95% of the time, imagine the success rate when you’re starting from a research paper.
Gartner’s numbers tell a similar story from a different angle: 85% of AI initiatives never make it to production. For every 33 prototypes a company builds, four make it into production. That’s an 88% failure rate at the prototype-to-production stage alone.
The question isn’t “does research create value?” It does. The question is: why does the value creation pipeline leak so badly, and where exactly are the holes?
The valley has a topology
People love the phrase “valley of death” to describe the gap between research and product. It’s catchy. It’s also too simple. A valley implies a single dip you need to cross. The reality is more like a mountain range with multiple passes, each requiring different equipment and different guides.
Here’s how I think about the stages, and where the failures cluster:
| Stage | What happens | Primary failure mode |
|---|---|---|
| Research (TRL 1-3) | Basic principles observed, concept formulated, proof of concept | None. This is where things work. |
| Prototype (TRL 4-5) | Component validation in lab, system validation in simulated environment | “Works on my machine.” Results don’t replicate outside controlled conditions. |
| Demonstration (TRL 6-7) | System demonstrated in relevant environment, prototype in operational environment | Engineering tradeoffs destroy research assumptions. Latency, cost, reliability requirements conflict with model performance. |
| Deployment (TRL 8-9) | System complete, qualified, proven in operational environment | Organizational inertia. Integration with existing systems. Edge cases at scale. Regulatory and compliance barriers. |
NASA developed Technology Readiness Levels (TRL 1-9) in the 1970s to formalize this progression. The framework was designed for hardware, but it maps surprisingly well to software and AI systems. The key insight is that each level requires different competencies, different incentives, and different organizational structures. The people who are excellent at TRL 1-3 (researchers) are rarely the same people who excel at TRL 7-9 (production engineers). And the most dangerous gap, the one where most projects die, is TRL 4-6: the transition from “it works in a controlled setting” to “it works in the real world.”
This is where the valley of death actually lives. Not between “research” and “product” as abstract concepts, but in the specific, unglamorous work of making something that works in a lab also work when the data is messy, the latency budget is 200ms, the users are unpredictable, and the system needs to run 24/7 without a researcher babysitting it.
Why researchers don’t ship (and shouldn’t have to)
There’s a persistent fantasy in tech that the ideal employee is a “full-stack” person who can research a novel algorithm, implement it in production-grade code, deploy it at scale, and write the blog post about it. This person barely exists. And the few who do exist usually aren’t optimizing for any one of those things. They’re spread too thin to do world-class research or world-class engineering.
Research and production engineering optimize for fundamentally different things:
| Dimension | Research | Production |
|---|---|---|
| Success metric | Novel insight, publication, citation | Revenue, reliability, user satisfaction |
| Time horizon | Months to years | Weeks to quarters |
| Failure tolerance | High. Most experiments should fail. | Low. Downtime costs money. |
| Data assumptions | Clean, curated, benchmarked | Messy, drifting, incomplete |
| Code quality | Good enough to reproduce results | Maintainable, tested, monitored |
| Optimization target | State-of-the-art accuracy on benchmark | Good enough accuracy at acceptable cost and latency |
| Feedback loop | Peer review (months) | User metrics (hours) |
These aren’t just different priorities. They’re different cognitive modes. A researcher asking “what’s possible?” and an engineer asking “what’s reliable?” are applying different heuristics to the same problem. The researcher wants to push the frontier. The engineer wants to hold the line.
Neither is wrong. But asking someone to do both simultaneously is like asking a sprinter to also run the marathon. The training regimes conflict.
The damage happens when organizations pretend this gap doesn’t exist. When a VP reads a paper from the research team, gets excited, and says “great, ship it by Q3.” The researchers try to productionize their own work because nobody else understands it well enough. They underestimate infrastructure, overestimate how well their approach generalizes beyond the benchmark, and skip the boring-but-critical work of error handling, monitoring, graceful degradation, and user experience. The project ships late, buggy, and fragile. Or it never ships at all.
The four walls of the translation gap
Through reading case studies and talking to people who’ve done this across different domains, I’ve come to think of the translation gap as having four distinct barriers. Most failed translations hit at least two of them.
Wall 1: The distribution mismatch
This is the most common killer in AI/ML specifically. Research models are evaluated on test sets drawn from the same distribution as training data. Production environments are out of distribution by default.
A computer vision model trained on ImageNet performs well on ImageNet-style images. Deploy it in a warehouse with different lighting, angles, backgrounds, and motion blur, and accuracy drops 20-40%. A language model fine-tuned on curated dialogue performs well on benchmarks. Give it to real users who misspell words, paste in HTML fragments, ask questions in Hinglish, and hit edge cases the training data never covered, and it degrades in ways that benchmarks never predicted.
The a16z piece on the “physical AI deployment gap” nails this for robotics specifically: a manipulation policy trained in a lab encounters fundamentally different conditions in a real factory. The gap isn’t a bug. It’s structural.
Closing this wall requires obsessive instrumentation. You need to know exactly how your production data differs from your research data, and you need feedback loops that catch distribution drift before it becomes a user-facing failure. Most research teams don’t build this infrastructure because it’s not what they’re optimizing for.
Wall 2: The engineering tax
Research code is written to validate ideas. Production code is written to serve users. The gap between these two is not “a bit of refactoring.” It’s often a complete rewrite.
Research code typically has:
- Hard-coded paths, magic numbers, and implicit dependencies
- No error handling beyond “crash and look at the stack trace”
- Performance characteristics that are acceptable for a single experiment but not for a real-time service
- Dependencies on specific library versions pinned to whatever was on the researcher’s machine
- No monitoring, no alerting, no graceful degradation
Turning this into a production system means building:
- Input validation and sanitization
- Proper error handling with fallbacks
- Latency-optimized inference paths (quantization, batching, caching)
- Monitoring dashboards that track model performance in real time
- A/B testing infrastructure
- Rollback mechanisms
- Data pipelines that handle the messy reality of production data
This is the “what looked 90% complete is actually 10% done” problem that practitioners describe. The research result is the easy part. The infrastructure around it is the hard part. And crucially, it’s invisible work. Nobody gets a promotion for writing the monitoring dashboard. Nobody publishes a paper about the input sanitization layer.
Wall 3: The incentive misalignment
Researchers are incentivized to publish papers and advance the state of the art. Engineers are incentivized to ship reliable products. Product managers are incentivized to hit quarterly metrics. These incentives don’t just fail to align. They actively conflict.
A researcher who spends six months productionizing their own work has six months fewer papers to show at review time. Their peers who stayed in the lab and published are advancing faster. In academia, this is career suicide. In industry research labs (Google Research, Meta FAIR, DeepMind), the incentive structures are slightly better but still heavily tilted toward publication.
This creates a selection effect. The researchers who are best at translation tend to leave pure research and join product teams, founding startups, or take hybrid roles. The ones who stay in research are (rationally) the ones who value publications over products. Over time, research labs become excellent at producing papers and poor at producing products, not because the people are incompetent, but because the system selects for paper-production.
Xerox PARC is the canonical cautionary tale. In the 1970s, PARC invented the graphical user interface, the laser printer, Ethernet, and the WYSIWYG word processor. These were not incremental advances. They were foundational technologies that shaped the next fifty years of computing. But Xerox, a copier company, couldn’t see how to turn them into copier-adjacent products. Steve Jobs visited PARC in 1979, saw the GUI, and built the Macintosh. Charles Simonyi left PARC for Microsoft and built the Office suite. The laser printer was the one PARC invention that Xerox actually commercialized, because it was close enough to the copier business that existing incentive structures could accommodate it.
The lesson isn’t that PARC researchers were bad at translation. It’s that Xerox’s organizational structure, incentive system, and strategic vision couldn’t metabolize innovations that didn’t fit the existing business model. The innovations that aligned with Xerox’s core business shipped. Everything else was a gift to competitors.
Wall 4: The context collapse
Research papers communicate findings. They don’t communicate the hundreds of small decisions, failed attempts, parameter sensitivities, and contextual knowledge that the researchers accumulated during the work. When someone else tries to implement the paper, they’re working from a lossy compression of the original knowledge.
This is why replication crises exist in multiple fields. The paper says “we used a learning rate of 3e-4.” It doesn’t say “we tried fifty learning rates and this was the only one that worked, and we suspect it’s sensitive to batch size in ways we didn’t fully characterize.” The paper says “our model achieves 94.2% accuracy on benchmark X.” It doesn’t say “we ran the experiment twelve times and reported the best result, and the variance across runs was 3.1 percentage points.”
For AI specifically, this context collapse is devastating. Modern ML systems have enormous hyperparameter spaces. Reproducing a result without the tacit knowledge of the original team can take weeks of compute and engineering time, if it works at all. Google’s own experience illustrates this: when they deployed the Transformer architecture in Google Translate, they didn’t use a pure Transformer. They used a hybrid transformer-encoder/RNN-decoder architecture in 2018, and didn’t move to a fuller Transformer-based system until 2020. The original paper’s architecture needed significant adaptation for production, and even the team that invented Transformers took three years to fully deploy them in their own translation product.
The translator: a role that barely exists
If the gap between research and product is this wide, you’d think every serious organization would have people dedicated to bridging it. Some do. Most don’t. And even the ones that do rarely formalize the role.
The “research translator” (or “applied research scientist” or “research engineer” depending on the org) sits between the research team and the product engineering team. They read papers, understand the math, and can also write production code. They know which research results are robust enough to deploy and which are benchmark-gaming artifacts. They understand both the researcher’s “but the accuracy!” and the engineer’s “but the latency!”
This person is extraordinarily rare, for the same reason that the research-production gap exists in the first place: the skills required are cultivated by different career paths. A PhD program trains you to do research. A software engineering career trains you to ship code. The translator needs deep competence in both, plus the political skill to navigate two teams with different incentives and vocabularies.
DeepMind’s structure offers one model. Their research engineers are explicitly positioned as bridges between research and engineering. The job descriptions mention “translating complex AI technical capabilities into actionable product specifications” and “translating complex concepts for diverse audiences.” This is the translator role, though DeepMind calls it something else.
The problem is that the translator role is hard to evaluate. A researcher’s output is papers. An engineer’s output is shipped code. What’s a translator’s output? “Helped a research result become a product” doesn’t fit neatly into any performance review framework. It’s the kind of contribution that’s visible in retrospect but invisible in real-time, which means translators are perpetually undervalued, underpromoted, and at risk of being eliminated in the next reorg.
Case studies in translation (and anti-translation)
AlphaFold: The gold standard
AlphaFold is the clearest recent example of successful research-to-product translation, and it’s instructive to look at why.
DeepMind published AlphaFold 2 in 2020, demonstrating that AI could predict protein structures with accuracy comparable to experimental methods. This was a genuine breakthrough, not an incremental improvement. It won CASP14 (the protein structure prediction competition) by a margin that shocked the field.
But predicting protein structures is research. Making drugs is a product. DeepMind’s leadership recognized that these require different organizations, and in 2021, they spun out Isomorphic Labs as a separate company with a separate mandate: take the underlying AI capabilities and build a drug discovery platform.
The key decisions that made this work:
-
Organizational separation. Isomorphic Labs is not a team within DeepMind. It’s a separate company with its own leadership, its own funding, and its own incentive structure. Researchers at DeepMind can keep doing research. People at Isomorphic can focus on product.
-
Patient capital. Drug discovery takes a decade. Isomorphic raised $600 million in April 2025 with no expectation of near-term revenue. The investors (led by Thrive Capital) understand the timeline.
-
Strategic partnerships. Isomorphic signed collaborations with Novartis and Eli Lilly in 2024, getting both pharmaceutical expertise and clinical trial infrastructure that a pure AI company doesn’t have.
-
Iterative model improvement. AlphaFold 3 (released May 2024) wasn’t just a research upgrade. It was designed with drug discovery applications in mind, modeling interactions between proteins and small molecules, not just protein structures in isolation.
The result: Isomorphic is preparing to put AI-designed drugs into human trials, with oncology candidates first. Whether the drugs work is an open question. But the translation machinery, taking a research result and building an organization, partnership structure, and product pipeline around it, is as good as it gets.
mRNA vaccines: The decade-long overnight success
Katalin Kariko published her foundational work on modified nucleosides in mRNA in 2005. For the next fifteen years, she struggled to get funding, got demoted by UPenn, and watched most of the scientific establishment dismiss mRNA therapeutics as impractical.
Moderna was founded in 2010 to commercialize mRNA technology. BioNTech was founded in 2008. Both spent over a decade refining manufacturing processes, solving stability problems (mRNA degrades rapidly), developing lipid nanoparticle delivery systems, and running small-scale trials for diseases nobody was urgently worried about.
Then COVID-19 hit in early 2020, and suddenly the entire world needed a vaccine immediately. Moderna and BioNTech had the platform. They had solved the hard translation problems. They hadn’t shipped a blockbuster product yet, but the manufacturing capability, the clinical trial expertise, the regulatory relationships, and the delivery technology were all in place.
The Moderna COVID vaccine went from sequence selection to first human dose in 63 days. That speed was only possible because of fifteen years of unglamorous translation work that preceded it.
The lesson: translation is slow infrastructure work that looks like waste until the moment it becomes essential. Kariko won the Nobel Prize in 2023. The prize was for the research. But the lives saved were because of the translation.
The GPT timeline: Seven years from paper to product
The Transformer timeline is worth spelling out in detail because it illustrates how non-linear translation actually is:
| Date | Event | TRL equivalent |
|---|---|---|
| June 2017 | “Attention Is All You Need” published | TRL 1-2: Basic principles |
| June 2018 | OpenAI publishes GPT-1 (117M parameters) | TRL 3: Proof of concept |
| Feb 2019 | GPT-2 (1.5B parameters), “too dangerous to release” | TRL 4: Lab validation |
| May 2020 | GPT-3 (175B parameters), API access only | TRL 5-6: System validation |
| 2020 | Microsoft licenses GPT-3 exclusively | Business model emerging |
| March 2022 | InstructGPT paper (RLHF alignment) | TRL 7: Critical product insight |
| Nov 2022 | ChatGPT launches | TRL 8-9: Production deployment |
Five and a half years passed between the original paper and a product anyone outside ML could use. During that time, the critical translation steps weren’t just “make it bigger” (though that helped). They were:
- Infrastructure buildout: Training GPT-3 required building custom distributed training infrastructure that didn’t exist when the Transformer paper was published.
- Safety and alignment: Raw GPT-3 was powerful but uncontrolled. RLHF (reinforcement learning from human feedback), described in the InstructGPT paper in March 2022, was the product insight that made ChatGPT usable by non-experts.
- Interface design: Wrapping a language model in a chat interface seems obvious in retrospect. It wasn’t. The API-only access model for GPT-3 (2020-2022) limited adoption to developers. The chat interface opened it to everyone.
- Organizational bet: OpenAI bet the company on scaling Transformers before it was obvious that scaling would work. That bet could have failed.
The Transformer paper itself was necessary but nowhere near sufficient. Without five years of translation work, it would be a well-cited paper, not a product category.
The organizational physics of translation
Having laid out the barriers and the case studies, I want to propose a framework for thinking about what makes translation succeed or fail at the organizational level. This is based on patterns I’ve seen across the examples above and in other research-to-product transitions.
The three conditions
Successful research-to-product translation requires three things to be true simultaneously. Miss any one of them and the translation stalls.
SUCCESSFUL TRANSLATION
|
+-----------+-----------+
| | |
TECHNICAL ORGANIZATIONAL MARKET
READINESS CAPACITY PULL
| | |
"Can it work "Can we "Does anyone
outside the build it?" want it?"
lab?"
Technical readiness means the research result is robust enough to survive contact with real-world data, latency constraints, and edge cases. Not all research is. Some results are inherently fragile: they work on specific benchmarks with specific preprocessing pipelines and fall apart otherwise. Identifying which results are robust and which are brittle is itself a skill, and it’s one that neither researchers nor engineers are naturally trained in.
Organizational capacity means the company has people who can do the translation work, a structure that supports it, and incentive systems that don’t punish it. This is where most failures happen. The technology is ready. The market exists. But the organization can’t execute because the research team and the engineering team don’t talk to each other, nobody’s job description includes “make research work in production,” and the VP who championed the project left.
Market pull means someone will pay for (or at least use) the result. This seems obvious, but plenty of technically impressive research solves problems that nobody has, or solves real problems in ways that don’t fit into existing workflows. Google Glass was technically impressive, research-backed computer vision and AR, but the market pull wasn’t there. Nobody wanted to wear a camera on their face in public. The technology worked. The organization shipped it. The market said no.
All three conditions were present for the successes: AlphaFold had technical readiness (the accuracy was transformative), organizational capacity (DeepMind spun out a dedicated company), and market pull (pharma companies spend billions on drug discovery and are desperate for better tools). ChatGPT had technical readiness (RLHF made it usable), organizational capacity (OpenAI was structured as a product company, not a research lab), and market pull (everyone has questions they want answered and text they want written).
All three conditions were absent in some combination for the failures: Xerox PARC had technical readiness but lacked organizational capacity (Xerox couldn’t metabolize non-copier innovations) and didn’t see market pull (they didn’t understand that personal computing was a market). The 95% of enterprise AI projects that fail per MIT’s study mostly have market pull (the company wants AI to work) but lack technical readiness (the tools don’t fit their data and workflows) and organizational capacity (nobody knows how to integrate AI into existing processes).
The translation team
If I were building a translation function from scratch (and had budget), here’s the team I’d want:
| Role | What they do | Why they’re critical |
|---|---|---|
| Applied Researcher | Reads papers, reproduces results, identifies which findings are robust enough for production | Prevents the team from building on fragile foundations |
| Systems Engineer | Builds the inference infrastructure, handles latency/throughput/reliability | Research code doesn’t scale. This person makes it scale. |
| Data Engineer | Builds pipelines that handle production data (messy, drifting, incomplete) | The distribution mismatch kills most deployments. This person manages it. |
| Product Designer | Designs the interface between the technology and the user | A powerful model behind a bad interface is useless. See: GPT-3 API vs. ChatGPT. |
| Domain Expert | Understands the problem domain deeply enough to evaluate whether the solution actually works | ML metrics and real-world performance are different things. This person catches the gap. |
Notice what’s not on this list: the original researchers. They should be available for consultation, but they should not be doing the translation work. Their time is more valuable doing research, and their skills are not optimized for production engineering. Trying to make researchers into production engineers wastes their best capabilities.
This is the opposite of what most companies do. Most companies hand the research paper to the research team and say “make it a product.” This fails for all the reasons I’ve described: different skills, different incentives, different cognitive modes.
The AI-specific translation problem
AI and ML research has a particularly vicious translation problem because of a dynamic I’ll call the “demo-production gap.” An AI demo is trivially easy to make impressive. A production AI system is extraordinarily hard to make reliable.
You can build a compelling demo in an afternoon: take a pre-trained model, cherry-pick some inputs, show the impressive outputs, skip the failure cases. The demo looks like magic.
Then you try to deploy it.
The model hallucinates on 15% of inputs. Latency is 3x your budget because nobody optimized the inference path. The training data doesn’t cover your actual use case. Users find adversarial inputs within hours. The model’s behavior drifts as the underlying data distribution changes. You need monitoring, but nobody built it. You need a fallback for when the model fails, but the product was designed assuming the model always works.
S&P Global found that in 2025, 42% of companies abandoned most of their AI initiatives, up from 17% the previous year. The spike correlates with the period after the initial ChatGPT hype when companies tried to deploy generative AI and discovered the demo-production gap firsthand.
The MIT study identifies a specific mechanism: AI tools’ inability to retain feedback, adapt to specific contexts, or improve performance over time. Unlike a human employee who accumulates institutional knowledge, most deployed AI systems are static. They don’t learn from the specific patterns of your organization. This means the gap between demo performance and production performance doesn’t close over time. It persists, or widens as the data drifts.
The companies in the 5% that succeed share a pattern: they treat AI deployment as a systems problem, not a model problem. They invest in data infrastructure before model development. They build monitoring and feedback loops from day one. They start with narrow, well-defined use cases rather than ambitious “transform the business” initiatives. They buy or partner rather than build from scratch, achieving 67% success rates compared to roughly 22% for internal builds.
What actually works: A practitioner’s checklist
After reading through the research, the case studies, and the failure analyses, here’s what I think the evidence supports for anyone trying to translate research into products:
Start with the problem, not the paper. The most common translation failure mode is “we have this cool research, let’s find a use case for it.” The success stories work the other direction: “we have this painful problem, and this research result might solve it.” AlphaFold worked because protein structure prediction is a real bottleneck in drug discovery. ChatGPT worked because “answer my question in natural language” is a universal human need. Google Glass failed because “display information on a head-mounted screen” was a solution looking for a problem.
Separate the research team from the translation team. Let researchers research. Hire (or form) a dedicated translation team with production engineering skills, domain expertise, and a product orientation. The researchers consult. They don’t execute the translation.
Build instrumentation before you build features. The distribution mismatch wall kills silently. If you don’t have monitoring that tells you how production data differs from research data, and alerting that fires when model performance degrades, you won’t know your deployment is failing until users complain. By then, trust is already damaged.
Budget 10x the research timeline for translation. The Transformer paper took six months of research. The translation to ChatGPT took five years. mRNA research started in the 1990s. The vaccines shipped in 2020. AlphaFold 2 was published in 2020. Isomorphic Labs is entering human trials in 2025-2026. If your research took a year, budget a decade for translation. You might be pleasantly surprised, but you won’t be unpleasantly surprised.
Use TRL as a communication tool. Technology Readiness Levels give researchers, engineers, product managers, and executives a shared language for “how close is this to shipping?” Without it, a researcher says “it’s basically done” (meaning TRL 3: proof of concept works) and a product manager hears “ship it next quarter.” The TRL framework makes the remaining work visible.
Acquire the tacit knowledge, not just the paper. If you’re translating someone else’s research, the paper is the tip of the iceberg. The failed experiments, the parameter sensitivities, the preprocessing steps that aren’t documented, the hardware-specific quirks, all of this is tacit knowledge that lives in the researchers’ heads. Hire them as consultants. Visit their lab. Run their code on their machine before you try to run it on yours.
Build the boring infrastructure first. Data pipelines. Monitoring. Error handling. Fallback mechanisms. A/B testing. Rollback capability. None of this is exciting. All of it is necessary. The MIT study found that enterprise AI projects fail not because the models are bad, but because the operational infrastructure around them doesn’t exist. Build the ops before you build the model.
The real bottleneck is organizational, not technical
I’ve been talking about this as if it’s primarily a technical problem. It’s not. The technical challenges are real, but they’re solvable. We know how to build monitoring systems. We know how to optimize inference latency. We know how to handle data drift.
The hard problem is organizational. It’s convincing a research lab to let someone else productionize their work. It’s creating incentive structures that reward translation (not just publication and not just shipping). It’s getting researchers and engineers to respect each other’s constraints instead of dismissing them. It’s maintaining patience and funding through the years-long timeline that translation requires.
The 95% failure rate in MIT’s study isn’t a technical failure rate. It’s an organizational failure rate. The tools work. The companies don’t know how to integrate them. That’s a people problem, a process problem, and an incentive problem, all dressed up as a technology problem.
The organizations that crack this, the Isomorphic Labs and the OpenAIs, don’t crack it by being smarter about technology. They crack it by building structures that acknowledge the gap and staff it deliberately. They treat translation as a first-class discipline, not an afterthought that researchers are expected to do in their spare time.
The counter-argument worth taking seriously
There is a legitimate contrarian position here: maybe the translation gap is a feature, not a bug. Maybe the high failure rate is the system working as designed.
The argument goes like this: research is inherently speculative. Most research should not become products, because most research explores dead ends, marginal improvements, or solutions to problems that don’t exist at scale. The “valley of death” is actually a filter that prevents companies from wasting resources productionizing research that isn’t ready.
There’s truth in this. Not every paper should be a product. The problem isn’t that the filter exists. The problem is that the filter is stupid. It doesn’t select for “research that would make a great product.” It selects for “research that happens to be done by people who also have productionization skills, in organizations that happen to have the right incentive structures, at a time when the market happens to be ready.” That’s a lot of “happens to.” The best research often has the hardest time crossing the valley, because breakthrough results are the ones that require the most organizational adaptation to deploy.
A smarter filter would evaluate research on technical readiness, organizational fit, and market pull explicitly, then invest in translation infrastructure for the things that pass. Instead, most organizations rely on chance encounters between researchers who have a result and product teams who have a problem, and then wonder why translation is so rare.
The coming decade
The translation problem is about to get both harder and easier.
Harder, because the frontier of AI research is moving faster than translation infrastructure can keep up. New model architectures, new training techniques, new capabilities emerge monthly. By the time a translation team finishes productionizing one research result, three more promising results are waiting. The backlog grows faster than the throughput.
Easier, because the tooling for deployment is improving rapidly. Model serving frameworks, inference optimization libraries, monitoring platforms, and evaluation harnesses are all more mature than they were even two years ago. The engineering tax for translation is dropping, even as the volume of research worth translating is rising.
The organizations that will capture the most value in the next decade won’t be the ones that produce the best research, or the ones with the best engineering. They’ll be the ones with the best translation machinery: the people, processes, and organizational structures that systematically turn findings into products.
The Transformer paper was published by Google researchers. The product that changed the world was built by OpenAI. The research created the value. The translation captured it. That pattern will repeat, and the question for every organization doing research is: will you be the one who publishes the paper, or the one who ships the product?
The answer depends entirely on whether you treat translation as a real discipline or a happy accident.