← helena.cool

LPS: A metric to rule them all

There is a question that follows me everywhere. It came up at every Learning Design event I went to. It pops in my LinkedIn feed from time to time. It is the kind of question that sounds simple until you actually try to answer it, and then it isn't simple at all.

How do you know if someone is actually learning?

I have been asking myself this for the better part of ten years, across roles as a Learning Designer and then as a Product Manager in edtech. And right now, sitting between jobs with a vocabulary review tool I built myself and some free time to kill, I decided to tackle it.

The type of product you use to assess learning success and motivation matters here. Svenska trainer focuses on lexical items, so flashcard apps like Anki or Quizlet can focus on recall accuracy and perhaps add a layer of confidence (this is what svenska trainer does), but an app like Busuu or Duolingo will need a more complex framework since it assesses all language skills. And lastly in apps like Speak focusing on speaking skills, LPS assessment gets even more nuanced i.e. a speech recognition exercise might say you're wrong but a native speaker still understands what you say. So, how do you measure motivation and language proficiency in a way that is honest? I used to think completion rates were the key. Add some tests, a few engagement signals, et voilà! Your Learning Proficiency Score is ready. But oh no. Far from reality.

I gathered old notes, read extensively, and set my Learning Designer brain on fire. What came out is the framework I want to share today. It is a first draft, not a finished product. But I think it is a more honest attempt than most of what I have seen, and I hope it is useful to others wrestling with the same problem.

Why the usual suspects fall short

Let me be specific about what's wrong with the metrics most of us default to.

Lesson completion is corrupted by content design. A short, easy lesson will have higher completion than a long, difficult one, but the difficult lesson might be doing far more learning work. If you optimise for completion, you end up optimising for easy. That's the opposite of what you want.

Retention and conversion are real business metrics and they matter enormously, but they're lagging indicators that sit too many steps away from any individual learning moment. They're influenced by pricing, marketing, onboarding, competition, seasonality. Attributing a retention shift to a curriculum change is genuinely hard. Using them as a proxy for learning quality is even harder. Story of my edtech PM life…

Streaks are probably the most psychologically interesting failure mode. They work briefly, and for some learner types, but they're extrinsically motivated by design. The goal becomes keeping the streak, not making progress. And when learners break one, the psychological cost often triggers churn/ abandonment entirely.

The deeper problem with all of these metrics is that they measure behaviour around learning rather than learning itself. What we actually want to know is: did something new and durable enter this person's head? Did the app cause it? Do they know it happened?

A different starting point: productive struggle

Learning science has a useful concept here. Researchers in educational psychology, particularly in the work around desirable difficulties by Robert Bjork at UCLA, most accessibly laid out in Bjork & Bjork (2011) and updated in their 2020 paper in the Journal of Applied Research in Memory and Cognition, have long established that the conditions that produce the feeling of learning and the conditions that produce actual learning are often in tension. Easy, fluent practice feels good. Effortful, imperfect practice sticks.

This gives us a more useful frame than "did they complete the lesson?" The frame is: was the learner productively struggling? Not so easy they were sailing through the lesson, not so hard they were lost, but genuinely challenged and succeeding with effort.

This reframe has a practical implication. Instead of measuring outputs (completion, time), we should be measuring the quality of the learning moment, which means tracking both what the learner actually did and how they felt about it. That's the foundation of this framework I worked out.

The two-layer structure

The framework I've been developing has two layers that operate in parallel across every skill type.

The objective layer captures demonstrated performance: what did the learner actually get right or wrong, on items of what difficulty, over what time period? This is the hard evidence of learning.

The subjective layer captures felt progress: did the learner know they knew something? Did they predict their own performance accurately? Do they feel more capable than they did before? This is the emotional evidence of learning, and it matters more than it sounds, because felt progress is what drives intrinsic motivation and return visits.

The key insight is that both layers together give you something neither can give alone. A learner with high objective performance but declining felt confidence is in trouble, they're learning but not knowing they're learning, which is a retention risk. A learner with rising felt confidence but flat objective performance is also in trouble, they think they're learning but they're not, and the gap will eventually surface.

What you're looking for is the learner who scores well and can accurately predict when they will. The gap between prediction and actual, what I call the Learning Confidence Gap, is one of the most sensitive signals in the whole framework.

What this looks like across skill types

One of the things that makes language learning metrically interesting is that it's not one skill, it's four, each with different measurement properties. The canonical four are listening, speaking, reading and writing. Before going through each, one note on vocabulary: it is not a fifth skill so much as the connective tissue underneath all of them. Lexical knowledge feeds comprehension, production, fluency and accuracy simultaneously. In measurement terms it is also the most tractable, spaced repetition gives you a natural daily assessment loop, item by item, difficulty-weighted, with the confidence gap easy to instrument by asking the learner to predict before answering. Think of vocabulary as the baseline signal that updates every day while the four skills update more slowly.

Listening is the skill most products underinvest in metrically, probably because it is harder to assess than reading. The objective layer can draw on comprehension tasks tied to audio input: inference questions, detail recall, gist identification, scored against a rubric. What makes listening distinctive is that the confidence gap here often runs in the direction of overconfidence at lower levels: beginners tend to believe they understood more than they did, because they caught enough words to construct a plausible meaning. Tracking the delta between "I understood this" and what the comprehension task reveals is a sensitive early signal. Signal is weekly, tied to listening tasks rather than passive audio exposure, which tells you almost nothing.
Speaking is where I'd argue the subjective layer matters more than the objective one. Automatic speech recognition can give you phoneme accuracy data, but learners distrust it, and they are not entirely wrong to. A machine can tell you your /r/ is off; it cannot tell you whether a native speaker would understand you in context. What actually produces the motivational payoff is the moment a learner hears their own recording and thinks: that sounded okay. The most powerful instrument here is longitudinal self-comparison, letting learners hear where they were four weeks ago versus where they are now. That is more convincing than any ASR score. Beyond pronunciation, speaking also has a fluency dimension: average turn length, pause frequency, self-repair rate. These are trend metrics rather than absolute targets, because there is no universal "good" turn length. What you are watching for is direction over time, normalised to each learner's own baseline. Signal is bi-weekly.
Reading sits in a more comfortable measurement space than listening or speaking, because the input is stable and the assessment is easier to design. The objective layer draws on comprehension tasks, inference, detail, vocabulary in context, scored by difficulty level and CEFR band. The confidence gap is particularly clean here: ask the learner before they read how well they expect to understand the text, then measure actual comprehension. Calibration improves as learners develop stronger metacognitive awareness of their own reading level, which is itself a learning outcome worth tracking. Signal is weekly.
Writing is the hardest skill to score objectively at scale, and honesty about that is important. The assessment unit is a performance rather than an item, which means you need a rubric: grammar accuracy, lexical range, task completion. That rubric needs to be shared with the learner before the task, not just used to score it afterwards. Why? Because rubric transparency is what makes the confidence prediction meaningful. If a learner does not know what good looks like before they write, their pre-task confidence is measuring anxiety, not skill awareness. Post-task self-evaluation anchored to the same rubric dimensions gives you the confidence gap, and a delta of more than two points in either direction, self-eval versus actual score, flags a calibration problem worth investigating. Signal is weekly.

One composite metric to rule them all: Learning Progress Score

The four skills, listening, speaking, reading and writing, combine into a single composite I've been calling the Learning Progress Score (LPS). Vocabulary underpins all four as a baseline signal rather than a standalone track, feeding into each skill's objective and subjective sub-metrics. The whole thing rolls up with suggested starting weights: listening and reading at 20% each (weekly signal, comprehension-based), speaking at 25% (bi-weekly, but carries both pronunciation and fluency dimensions), writing at 25% (weekly, richest objective scoring via rubric), and vocabulary as a daily cross-cutting signal weighted at 10% that effectively boosts whichever skill it feeds most directly in a given session.

A few things I want to be clear about here.

The weights are a starting point, not a law. A conversation-focused app should weight speaking much higher. A reading or exam-prep product might shift the balance toward reading and writing. The point is to make the weighting decision explicitly, with a hypothesis about what you are optimising for, rather than letting it happen by accident.

The thresholds for each sub-metric need to be level-relative, not absolute. A 65% recall accuracy on vocabulary means very different things at A1 and C1. At A1, it is underperformance. At C1, where items are harder and interference between similar words is greater, it is expected, and in some ways healthy. Holding every learner to the same absolute bar would generate constant false alarms at advanced levels. The same principle applies to listening comprehension: a B2 learner parsing authentic fast speech at 70% is doing well; an A2 learner at the same score on a graded text is not.

For speaking markers specifically: everything is trend-based. There is no universal baseline for turn length or pause rate. What you are tracking is each learner's improvement relative to their own starting point, not against a population average.

The business hypothesis underneath all of this

I want to be honest about why this matters beyond the intellectual satisfaction of a better metric.

The underlying hypothesis is simple: if learners actually learn, they come back. If they come back consistently, some of them eventually pay for your product. The causal chain runs from learning to motivation to retention to conversion, but the chain only works if the first link holds. If the product isn't generating real learning progress, no amount of engagement design or streak mechanics will produce durable retention. Engagement without learning is just noise with a nice UI.

The Learning Progress Score is an attempt to measure whether that first link is holding. Not perfectly, metrics never measure the thing itself, only proxies for it, but more honestly than completion rates and time-on-app.

If it works, it becomes a leading indicator for retention before retention data is available. A cohort with high LPS in their first two weeks should retain better at 60 days than a cohort with low LPS and high completion. That's a testable hypothesis, and testing it is where I'd want to start.

Where I'd go next

Knowing a learner is struggling is only the first step. The real question is what the product does about it, does it slow down, offer extra practice, change the content type, ask the learner how they're feeling? Mapping those responses to each threshold state, for each skill, is the next layer of work. I haven't solved that yet, and it might take some time and lots of thinking (maybe for the next article :))

There is also a question I anticipate from anyone building at scale: this LPS composite metric looks very pretty in theory, but is it feasible for an app with millions of users? I think it is, with the right architecture and a staged rollout.

My thinking is that almost everything in this framework is computed automatically per learner. Vocabulary recall, confidence predictions, comprehension task scoring, the confidence gap calculation, these are database operations and scoring algorithms running on each session. The LPS doesn't recompute from scratch every time either: it updates only the sub-metrics that fired in a given session, rolling up from cached values. A learner who only did vocabulary practice today gets their vocabulary sub-metric updated; everything else stays at its last computed value. This is how most large-scale personalisation systems work.

The genuine bottleneck is writing assessment, which requires either human graders or LLM-based scoring, both of which carry cost. At scale you'd handle this by gating writing assessment behind a specific mode users opt into, or by sampling rather than scoring every attempt.

More importantly: this framework is a design target, not a day-one implementation. You'd ship a simplified version first, vocabulary recall accuracy plus a single post-session confidence question, validate that it predicts retention, and then add skill tracks incrementally. And at millions of users you'd have something a small product doesn't: enough data to derive your "on track" thresholds empirically, from what learners who actually convert looked like in their first few weeks. The scale that makes the problem harder also makes the metric more robust.

If you're working on similar problems, measuring learning outcomes, designing skill-specific assessment loops, trying to close the gap between engagement metrics and actual progress, I'd genuinely love to hear how you're approaching it. The edtech community is small enough that we should probably be sharing this thinking more openly than we do. So leave a comment ;)