Skip to content
· 18 min read

Metrics That Matter

Choosing the right metrics and avoiding Goodhart's Law in product development.

In 2016, Wells Fargo employees opened approximately 1.5 million deposit accounts and 565,000 credit card accounts without customer authorization. Employees were charged fees they didn’t agree to. Cars were repossessed. Homes went into foreclosure. The bank paid $3 billion in fines.

The metric that caused all of this? Cross-sell ratio: the number of products each customer held. Wells Fargo’s leadership decided that cross-selling was the engine of profitable growth, set aggressive per-employee targets, tied compensation to those targets, and watched as the number climbed beautifully upward.

The metric went up. The bank was destroyed.

This is Goodhart’s Law at industrial scale: when a measure becomes a target, it ceases to be a good measure. And while most product teams won’t create fraudulent bank accounts, the underlying failure mode is everywhere. Teams pick metrics, optimize for them, and end up in places they never intended to go. Not because they’re malicious, but because choosing the right metric is one of the hardest problems in product management, and almost nobody treats it with the seriousness it deserves.

Goodhart’s Law is not a warning. It’s a law.

Let me be precise about what Goodhart’s Law actually says, because it’s more subtle than the popular version.

Charles Goodhart, a British economist, originally formulated this in 1975 in the context of monetary policy: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” The phrasing that became famous is Marilyn Strathern’s restatement: “When a measure becomes a target, it ceases to be a good measure.”

The reason this is a law and not just a cautionary tale is that the mechanism is universal. Here’s why it always happens:

  1. You observe a correlation between a metric and an outcome you care about
  2. You make the metric a target
  3. People (or systems) find ways to improve the metric that don’t improve the outcome
  4. The correlation between metric and outcome weakens or breaks entirely
  5. You’re now optimizing for something that no longer predicts what you care about

This isn’t a bug in human behavior. It’s a fundamental property of optimization in complex systems. The moment you apply pressure to a metric, the system reorganizes to reduce that pressure in whatever way is cheapest, and the cheapest way to improve a metric is almost never the way you intended.

The cobra effect is the classic illustration. During British colonial rule in India, the government offered a bounty for dead cobras to reduce the cobra population. People started breeding cobras. The government canceled the bounty. The breeders released their now-worthless cobras. The cobra population increased.

Product teams recreate this pattern constantly, just with dashboards instead of snakes.

The anatomy of a bad metric

Before I talk about how to choose good metrics, let me dissect what makes metrics go wrong. There are specific structural properties that make a metric dangerous.

Vanity metrics: impressive but useless

Eric Ries coined the term “vanity metrics” in The Lean Startup, and the concept has become so widely discussed that most product people think they understand it. They usually don’t.

A vanity metric isn’t just a metric that looks good. It’s a metric that only looks good. It goes up over time as a natural function of growth and doesn’t tell you whether you’re actually getting better at anything.

Vanity metric Why it’s misleading Actionable alternative
Total registered users Only goes up, says nothing about engagement Monthly active users (MAU) or activation rate
Page views Inflated by confused users clicking around Task completion rate, pages per completed task
App downloads Many downloads, few active users Day 7 or Day 30 retention
Total revenue Can grow while unit economics deteriorate Revenue per user, customer lifetime value, payback period
Social media followers Can be bought, doesn’t predict business impact Engagement rate, referral-driven signups

The test for whether a metric is vanity: can it go up while your product gets worse? If yes, it’s vanity. Total registered users can increase while your product hemorrhages active users. Page views can increase while users get more confused and less satisfied. Downloads can increase while retention craters.

Actionable metrics pass a different test: can you make a specific product decision based on this number? If your Day 7 retention drops from 40% to 35%, that tells you something actionable about onboarding quality. If your total registered users increases from 1 million to 1.1 million, that tells you… almost nothing about whether your product is improving.

Proxy metrics: close but not quite

This is where metrics get genuinely tricky. A proxy metric is a metric you track because the thing you actually care about is hard to measure directly. You want to measure “user satisfaction” but you can’t, so you track “NPS score” or “session length” or “support ticket volume” as proxies.

Proxies are necessary. Product teams can’t directly measure most of the things they care about (user happiness, product-market fit, long-term retention, word-of-mouth). They need proxies. The danger is forgetting that the proxy is a proxy.

A data team at a content platform noticed that users who consumed multiple content types (articles, videos, podcasts) converted to paid subscriptions at much higher rates than single-type consumers. Obvious product decision: push users toward multi-type consumption. The team ran experiments to surface more content variety. Multi-type consumption went up. Revenue went down. Pages per visit dropped. Users were being pushed into behaviors they didn’t want, and they responded by disengaging.

The proxy (multi-type consumption) was correlated with the outcome (conversion) but wasn’t causal. The users who naturally consumed multiple content types were power users who would have converted anyway. Pushing casual users to behave like power users didn’t make them into power users. It made them annoyed.

This is the proxy metric trap: confusing correlation with causation, optimizing for the proxy instead of the underlying outcome, and watching the proxy improve while the thing you actually care about gets worse.

Lagging indicators masquerading as leading ones

A lagging indicator tells you what already happened. A leading indicator tells you what’s about to happen. Most product teams think they’re tracking leading indicators. Most of them are actually tracking lagging indicators and reacting too slowly to change course.

Metric Leading or lagging? Why it matters
Monthly revenue Lagging By the time revenue drops, the underlying problem has been festering for months
NPS score Lagging Satisfaction today reflects decisions made 3-6 months ago
Churn rate Lagging Users decided to leave weeks or months before they actually cancel
Feature adoption (first 48 hours) Leading Early adoption patterns predict long-term usage
Support ticket themes Leading New categories of complaints often signal emerging product problems
Activation rate for new cohorts Leading Changes in activation predict future retention and revenue

The practical problem: most dashboards are full of lagging indicators because they’re easy to measure and feel definitive. Revenue is a number. NPS is a number. Churn rate is a number. Leading indicators are often qualitative, fuzzy, or require more sophisticated analysis (cohort tracking, behavioral segmentation, pattern detection in support tickets).

Teams that rely on lagging indicators are always reacting to problems that have already calcified. Teams that invest in leading indicators can intervene before the damage is done.

The North Star metric: useful framework, dangerous if misunderstood

The North Star metric framework, popularized by Amplitude and Sean Ellis, has become the default vocabulary for product metric discussions. The core idea is simple: identify a single metric that best captures the value your product delivers to users, and orient your entire team around improving it.

Examples from well-known companies:

Company North Star metric Why it works
Spotify Time spent listening Directly reflects user value: more listening = more value received
Airbnb Nights booked Captures the core transaction for both hosts and guests
Slack Daily active users Reflects habitual use of the communication tool
Facebook (early) Users adding 7 friends in 10 days Predicted long-term retention; indicated network value threshold

The framework is useful because it forces alignment. When the entire product team agrees on one metric, every feature discussion has a common denominator: “does this move the North Star?” It prevents the fragmentation that happens when different teams optimize for different metrics that occasionally conflict.

But the framework has a dangerous failure mode that I see constantly: teams treat the North Star metric as the only metric that matters. This leads to exactly the kind of tunnel vision that Goodhart’s Law predicts.

Facebook’s early North Star metric (7 friends in 10 days) was brilliant for predicting retention. But as the company optimized relentlessly for engagement metrics, they created features that drove time-on-site while eroding user trust, mental health, and ultimately public perception. The engagement numbers looked great. The product was becoming corrosive. Internal research, eventually leaked by whistleblower Frances Haugen in 2021, showed that Facebook’s own researchers knew Instagram was making body image issues worse for one in three teen girls. The engagement metrics never captured that.

This brings us to the single most important concept in product metrics, and the one I see teams skip most often.

Counter-metrics: the missing piece

A counter-metric (sometimes called a guardrail metric) is a metric you monitor to make sure your North Star optimization isn’t causing unintended harm. It’s the thing that tells you when you’ve gone too far.

Every North Star metric needs at least one counter-metric. Here’s the pattern:

North Star What you’re optimizing for Counter-metric What you’re protecting
Daily active users Engagement User-reported satisfaction / NPS Making sure engagement comes from value, not addiction
Time in app Depth of use Task completion rate Making sure time spent is productive, not confused
Conversion rate Revenue Refund rate, churn in first 30 days Making sure conversions represent genuine purchases
Content published Creator activity Content quality score, consumer engagement per post Making sure quantity doesn’t destroy quality
Feature adoption Usage of new features Support ticket volume, abandonment rate Making sure adoption reflects value, not confusion

Without counter-metrics, you’re driving with one eye closed. You know where you’re going but you can’t see what you’re hitting along the way.

The organizational challenge is that counter-metrics are politically uncomfortable. Nobody wants to be the person who says “yes, our key metric went up 15%, but our counter-metric went down 8%, so maybe this isn’t the win we think it is.” That person kills the celebration. That person is also usually right.

Metric trees: making the system visible

A North Star metric with counter-metrics is a good start. But for any product of real complexity, you need to see the full system of metrics and how they relate. This is where metric trees come in.

A metric tree decomposes your North Star metric into the inputs that drive it. Each input can be further decomposed. The tree makes the causal structure of your product visible and shows you exactly where to intervene.

Here’s a simplified example for a SaaS product:

North Star: Weekly Active Users (WAU)
│
├── New users activated this week
│   ├── Signups (marketing/sales input)
│   ├── Activation rate (product input)
│   │   ├── Onboarding completion rate
│   │   ├── Time to first value
│   │   └── First session depth
│   └── Channel mix (which channels produce highest-quality signups)
│
├── Returning users this week
│   ├── Day 7 retention (early retention)
│   ├── Day 30 retention (habit formation)
│   ├── Feature stickiness (% of users using core feature weekly)
│   └── Notification/email engagement rate
│
└── Resurrected users this week
    ├── Win-back campaign effectiveness
    ├── New feature announcements reaching dormant users
    └── Seasonal/event-driven reactivation

The value of this tree isn’t the diagram itself. It’s the conversations it forces. When WAU drops, the tree tells you where to look. Is it a new user acquisition problem? An activation problem? A retention problem? A resurrection problem? Without the tree, the conversation is vague (“we need to grow faster”). With the tree, the conversation is specific (“our activation rate dropped from 45% to 38% in the last cohort, let’s investigate onboarding changes”).

Metric trees also reveal dependencies between teams. Marketing owns the top of the tree (signups). Product owns the middle (activation, retention). Customer success owns the bottom (resurrection, churn prevention). When the tree is visible, cross-functional collaboration becomes natural because everyone can see how their piece connects to the whole.

The metrics-that-lie taxonomy

After years of watching product teams get metrics wrong, I’ve identified five categories of metric lies. These aren’t exhaustive, but they cover the majority of metric failures I’ve observed.

Lie 1: The average that hides everything

Averages are the most popular way to lie with metrics because they compress a complex distribution into a single number that feels precise.

Average session length: 8 minutes. Sounds reasonable. But the distribution is bimodal: 60% of users spend 30 seconds and bounce, and 40% spend 20+ minutes deeply engaged. The “average” user, spending 8 minutes, doesn’t exist. The product actually has two completely different user populations with completely different behaviors, and the average obscures both.

The fix: always look at distributions, not just averages. Histograms, percentile breakdowns (p50, p90, p99), and segmented cohorts tell you what’s actually happening. The median (p50) is almost always more useful than the mean for product metrics.

Lie 2: The ratio that flatters

“Our conversion rate is 12%!” Impressive, until you learn that the denominator is “visitors who reached step 3 of onboarding” rather than “all visitors.” The ratio is mathematically correct but practically misleading because it excludes the 80% of visitors who never got to step 3.

Every ratio has a numerator and a denominator. Changing either one changes the ratio. When someone presents a flattering ratio, check the denominator. The denominator is where metrics get laundered.

Lie 3: The trend that ignores seasonality

“Engagement is up 20% month over month!” Maybe. Or maybe it’s December and engagement always goes up in December for your product. Year-over-year comparisons, seasonal adjustment, and multi-year trend analysis prevent this lie. Month-over-month changes for metrics with any seasonal pattern are nearly meaningless without context.

Lie 4: The correlation presented as causation

“Users who enable notifications retain at 2x the rate of users who don’t. We should push notifications harder.” Or maybe users who are already highly engaged are more likely to enable notifications, and pushing notifications on disengaged users will just annoy them. This is the proxy metric trap again, and it’s the most common analytical error in product management.

The fix: run experiments. If you think X causes Y, test it. Don’t just observe the correlation and declare causation.

Lie 5: The metric that excludes the dead

Survivorship bias in metrics. “Our average user session length has increased by 30% over the past year!” Yes, because the users with short session lengths churned. You didn’t make the experience better. You just lost the users who found it worst. The surviving users were always highly engaged.

The fix: include churned users in your cohort analyses. Track how the full population behaves over time, not just the survivors.

Picking metrics: a practical framework

Here’s the framework I use when setting up metrics for a product or feature. It’s not original (it draws heavily from work by Amplitude, Reforge, and others), but the specific combination is what I’ve found to work in practice.

Step 1: Start with the job, not the metric

Before picking any metrics, articulate what job the product does for users. Not what features it has. What progress it enables. The metric should measure how well the product does that job.

If the product helps users find information quickly, the metric should reflect speed and accuracy of finding, not page views. If the product helps teams coordinate work, the metric should reflect coordination quality, not messages sent.

Step 2: Choose one North Star, add two to three counter-metrics

The North Star captures the core value. The counter-metrics protect against over-optimization. Keep it simple. Teams that track 30 metrics track zero metrics in practice, because nobody can hold 30 numbers in their working memory.

North Star: The single metric that best captures user value
Counter 1:  Protects user experience quality
Counter 2:  Protects business sustainability
Counter 3:  (Optional) Protects long-term trust/brand

Step 3: Build the metric tree

Decompose your North Star into its input metrics. For each input, identify who owns it (which team, which function) and what levers they have to improve it. This turns an abstract number into a concrete set of team-level goals.

Step 4: Define the “metric diet”

Not all metrics need to be monitored at the same frequency. Some metrics need daily monitoring (core engagement, error rates). Some need weekly review (retention cohorts, feature adoption). Some need monthly or quarterly assessment (NPS, brand perception, LTV).

Cadence What to track Why
Daily Error rates, core engagement, conversion funnel Catch fires early
Weekly Retention cohorts, feature adoption, support ticket themes Spot trends before they become problems
Monthly NPS, revenue per user, LTV, churn analysis Strategic health check
Quarterly Brand perception, competitive positioning, market share Long-term trajectory

Step 5: Set up anomaly detection, not just dashboards

Dashboards are retrospective. You look at them after something has already happened. Anomaly detection is prospective: it alerts you when a metric deviates from its expected range before the damage accumulates.

This doesn’t require sophisticated ML. A simple statistical process control chart (mean plus/minus two standard deviations from the trailing 30-day baseline) catches most meaningful anomalies. If your Day 1 retention drops below two standard deviations from its trailing average, something changed and you need to investigate.

The politics of metrics

I want to address something that most metrics discussions avoid: the politics.

Metrics are not neutral. Choosing which metric to optimize is a power decision. The person who defines the metric defines what success looks like, which implicitly defines what failure looks like, which determines what gets funded and what gets cut.

In most organizations, metric selection happens in one of three ways:

  1. Executive decree. A senior leader picks the metric based on their mental model of the business. This can be brilliant (Jeff Bezos choosing customer satisfaction metrics) or disastrous (Wells Fargo choosing cross-sell ratio).

  2. Bottom-up emergence. Individual teams pick their own metrics, which are then aggregated. This tends to produce locally rational, globally incoherent metric systems where different teams optimize for conflicting things.

  3. Negotiated consensus. Cross-functional teams agree on shared metrics. This tends to produce watered-down metrics that nobody objects to and nobody is excited about.

None of these is perfect. The least-bad approach I’ve seen is a combination: executive direction on the North Star (“we care about user value, and here’s how we define it”), team-level autonomy on input metrics (“your team owns activation rate; you decide how to improve it”), and shared counter-metrics that prevent any team from optimizing their input at the expense of the whole.

The hardest part of metric politics isn’t choosing the metric. It’s having the courage to act on what the metric tells you, especially when the metric says that a project someone cares about isn’t working. Metrics are easy to love when they validate your decisions. They’re much harder to love when they challenge them. The real test of a data-driven organization isn’t whether it collects data. It’s whether it acts on data that’s uncomfortable.

A specific case study: the retention metric trap

Let me walk through a failure I’ve seen repeated across multiple companies because it illustrates several of these problems at once.

A SaaS product team chose “30-day retention” as their primary metric. Reasonable choice. They noticed that users who completed onboarding in the first session retained at 3x the rate of users who didn’t. Obvious intervention: make onboarding completion the focus.

The team redesigned onboarding to be shorter and more guided. Onboarding completion rate jumped from 40% to 70%. The team celebrated.

30-day retention didn’t move.

What happened? The users who were completing the old onboarding (40%) were the motivated, high-intent users who would have retained regardless. The new onboarding brought the casual, low-intent users through completion, but completing onboarding didn’t make them high-intent. The onboarding wasn’t the cause of retention. Both onboarding completion and retention were effects of a third variable: user intent.

The team had confused correlation with causation, optimized for a proxy (onboarding completion), and achieved the proxy goal while missing the actual goal (retention).

The fix was painful. They went back to qualitative research, interviewed churned users, and discovered that the real retention problem wasn’t onboarding. It was a specific workflow that users needed but couldn’t figure out how to set up. Users would complete onboarding, try to do the thing they came for, fail, and leave. The fix was making that specific workflow discoverable and frictionless. Retention improved.

But notice: the data alone couldn’t have told them this. The data said “onboarding completion correlates with retention.” Only qualitative research revealed the actual causal mechanism. This is why metrics and research must work together. Metrics tell you what is happening. Research tells you why.

What I actually look at

I’ll end with the specific metrics I care about most for any product I’m working on. This isn’t universal, but it’s the starting set I reach for.

Must-track:
  - Activation rate (% of signups who reach first value moment)
  - Day 1, Day 7, Day 30 retention (cohorted)
  - Time to first value (minutes/hours from signup to "aha")
  - Core action frequency (how often users do THE THING the product is for)
  - NPS or CSAT (quarterly, with qualitative follow-up)

Counter-metrics:
  - Support ticket volume (per user, per feature)
  - Rage clicks / error encounters
  - Time on task for core workflows (going up is bad)

Leading indicators:
  - New cohort activation trends (is activation getting better or worse?)
  - Feature adoption curves for recent releases
  - Support ticket theme clustering (new categories = emerging problems)

The specific numbers will be different for every product. But the structure is always the same: a core metric that captures user value, counter-metrics that protect against over-optimization, and leading indicators that give you advance warning.

Metrics are tools. Like all tools, they can build things or break things depending on how you use them. The companies that break things with metrics aren’t the ones that measure too little. They’re the ones that measure the wrong thing, treat the measurement as truth, and forget that behind every number is a human being doing something for a reason that the number doesn’t capture.

Measure what matters. But know that what matters is always more than what you can measure.

Continue Reading

PKM Systems That Actually Ship

Next page →