Token Leaderboards Alone Are a Terrible Way to Measure AI Adoption

If your top “AI power user” burned a billion tokens last month, do you actually know what your business got back?

An executive desk where a monitor shows a soaring token-usage chart beside a near-empty business-impact gauge.

In April, the story everyone in tech was passing around was Meta’s internal token leaderboard. An employee built a dashboard nicknamed “Claudeonomics” that ranked Meta’s 85,000+ employees by how many AI tokens they consumed. It had badges. It had titles like “Token Legend” and “Cache Wizard.” In a single 30-day window, employees collectively burned through more than 60 trillion tokens, and the top-ranked individual averaged 281 billion tokens, usage that, priced at standard rates, could have cost the company over $1.4 million for one person.

Two days after the story broke, the leaderboard was gone.

281Btokens consumed in a single month by Meta's top-ranked individual, roughly $1.4M at standard rates.

Meta isn’t alone. OpenAI and Microsoft run internal token leaderboards. Amazon, JPMorgan, and Disney have reportedly deployed AI usage leaderboards to push adoption. In one case a Disney employee interacted with Claude 460,000 times in nine days. The trend even has a name now: tokenmaxxing. And Nvidia’s Jensen Huang has floated the idea that every engineer will eventually have an “annual token budget.”

I get why leaders love this metric. It’s objective. It’s automated. It scales across thousands of employees and fits neatly on an executive dashboard. When you’ve spent millions on AI tooling, “look how much we’re using it” feels like proof.

But usage is not adoption. And it’s definitely not value.

What a token leaderboard actually measures

A token is just a unit of text an AI model processes: a chunk of text the model reads or writes. It is not exactly a word. A token might be a whole short word like “cat,” part of a longer word like “un” or “believable,” a piece of punctuation, some whitespace, or a fragment of code, JSON, or Markdown. When you paste a long document, transcript, codebase, or chat history into an AI model, you are spending tokens. Those tokens come in two kinds.

Every token is one of two kinds

Input tokens

What you send to the model.

Your promptPasted documentsChat historyTool results

→AI→

OUT

Output tokens

What the model generates back.

AnswersSummariesCodePlansAnalysis

Counting tokens tells you the engine is running. It tells you nothing about whether any freight is being delivered. As The Decoder put it in its coverage of the Meta leaderboard, measuring token consumption is like judging a truck driver by how much gas they burn.

The unintended consequences show up fast, and they’re predictable to anyone who has spent time in change management. At Meta, employees reportedly left AI agents running for hours doing nothing useful, just to climb the rankings. Call it what it is: gaming the metric. And the pattern goes deeper than that:

It inflates your costs while hiding your waste. Leaderboards incentivize people to throw the most expensive frontier models at trivial tasks. Qodo’s CEO compared it to tracking miles walked while ignoring calories consumed.

It punishes your most skilled users. Here’s the part almost nobody talks about: using tokens efficiently is itself a skill. A practitioner who writes a tight, well-scoped prompt with the right context gets a better answer in 2,000 tokens than a novice gets in 200,000 tokens of flailing back-and-forth. Knowing which model to use for which task, how to structure context so the AI doesn’t have to guess, and when to start a fresh session instead of dragging a bloated conversation forward are exactly the behaviors you want to spread through your organization. A token leaderboard ranks those people at the bottom.

It breeds quiet resentment. Tie visible rankings to performance perception, and Meta has now tied reviews to “AI-driven impact,” and you’ve created theater, not transformation. People perform usage instead of rethinking their work.

95%of enterprise GenAI pilots produce zero measurable P&L impact (MIT Project NANDA).

This matters because the stakes are real. MIT’s Project NANDA found that 95% of enterprise GenAI pilots produce zero measurable P&L impact. Those organizations weren’t failing to use AI. They were failing to connect usage to outcomes. A token leaderboard is the purest possible expression of that failure: maximum visibility into activity, zero visibility into value.

The AI adoption gap, in four numbers

GenAI pilots with zero measurable P&L impact (MIT)

AI pilots that never reach production (IDC)

AI projects failing to deliver intended value (RAND)

Organizations with systemic AI governance maturity (IBM)

Sources: MIT NANDA 2025; IDC 2024; RAND 2024; IBM IBV

Measure the pyramid, not the meter

To be clear: you should measure usage. Who’s logging in, who’s active weekly, where consumption is concentrated, that’s your foundation. It tells you whether anyone showed up. But it’s the floor, not the ceiling. At Emerging Learning Solutions we measure adoption with a three-tier framework we call the Adoption Impact Model™.

The Adoption Impact Model™

Most organizations stop measuring at the bottom.

LEVEL 3 · IMPACT

What the business got back. Hours saved, quality, revenue, ROI.

LEVEL 2 · BEHAVIORS

What people actually did. Was it good? Was it intended, on-task, safe?

LEVEL 1 · USAGE

Who logged in. Tokens, seats, sessions. Necessary, but the weakest signal.

↑Value of the signal

Level 1, Usage. Did people show up? Active users, frequency, tokens, seats. Necessary, cheap to collect, and the weakest signal in the building. This is where token leaderboards live, and where most measurement programs stop.

Level 2, Behaviors. What did people actually do? This is the level that predicts everything else. Was the work good? Was it intended and on-task? Are people using AI for the workflows you redesigned, or for writing birthday limericks? Are they verifying outputs before they ship? Are the efficient, skillful patterns spreading from your best users to everyone else? Level 2 is harder to measure. It takes work sampling, output review, manager observation, and honest conversation, which is exactly why most organizations skip it.

Level 3, Impact. What did the business get back? Hours saved and where they were reinvested. Cycle time on the monthly close. Error rates. Customer response time. Revenue per head. ROI against the tooling spend. This is the level your CFO actually cares about, and you can only get here if Level 2 is healthy, because impact comes from changed behavior, not from consumption.

The leaderboard era is measuring Level 1 and declaring victory. It’s the equivalent of evaluating a training program by counting who walked into the classroom.

Effective token use is a skill, so treat it like one

One more reframe before the takeaway. If tokens are the new unit of work, then token efficiency is the new literacy. The best practitioners in your organization already do four things that never show up on a leaderboard:

1
Scope the task before opening the chat window.
The best practitioners decide what they actually need before they start, so the model isn't doing expensive discovery at frontier prices. A tight, well-defined ask beats a sprawling conversation every time.
2
Give the model curated context, not everything.
Hand it the right document and the right examples instead of pasting your entire drive and hoping. You would not give a colleague 200 pages and say 'help?'. You would point them to the few pages that matter.
3
Match the model to the task.
Save frontier models for frontier problems. Most day-to-day work runs perfectly well, and far cheaper, on a smaller model. Knowing which to reach for is a skill in itself.
4
Restart clean when a conversation degrades.
Recognize when a thread has gone sideways and start fresh rather than burning tokens fighting a bloated, confused context. Dragging a broken conversation forward is the most common way novices waste tokens.

None of that shows up on a leaderboard. All of it shows up in output quality and cost per outcome. If you want a competition, rank teams on outcomes per token, not tokens. Though honestly, what works better than competition is teaching these behaviors deliberately and measuring whether they spread.

Where to go from here

If your AI measurement strategy today is a usage dashboard, you’re tracking consumption and calling it adoption. The fix isn’t complicated, but it does require designing for it: define the Level 2 behaviors you actually want, instrument them, and draw the line to Level 3 impact before your CFO asks you to.

The hardest part of AI isn’t the technology. It’s the transition, and you can’t manage a transition you’re only measuring at the bottom of the pyramid.