-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Ebisu v3 request for comment (RFC)
Introduction
This issue contains a proposal for a future v3 release of Ebisu, with the goal of inviting feedback to improve the design before it is released.
V2 recap
A quick recap of the current version of Ebisu: v2 models each flashcard's probability of recall as a probability distribution that is valid at a certain time in the future. (Stats nerd note: when you first learn a card, its probability of recall t hours in the future is a Beta distribution with parameters a and b.) At any given time in the future (not just t), you can call ebisu.predictRecall to get the estimated recall probability for this flashcard. Doing this for each flashcard lets you pick which flashcards have the lowest recall probability, and you present one of those to the learner. Then, you call ebisu.updateRecall with the quiz's result (binary pass/fail, binomial passes and fails, noisy pass/fail), which updates the Beta distribution in light of the quiz result.
V3's goal
The main goal of v3 is to address the issue that various folks have raised over the years but that I finally understood thanks to @cyphar in #43 (which has links to the original Reddit thread): Ebisu v2 has a fundamental flaw in how it models your memory, in that it ignores the fact that, in the real world, the act of reviewing a flashcard actually changes the underlying strength of the memory. Instead, Ebisu v2 assumed that your memory of a flashcard was a fixed but unknown probability, and updated its belief about that probability's distribution after each quiz. This made it appear that it was strengthening or weakening the memory model but when we tried to infer the ideal initial halflife for real-world flashcard histories, Ebisu v2 insisted on absurdly high estimates of initial halflife.
In practice, this means that the actual recall probabilities predicted by Ebisu v2 were extremely pessimistic, and flashcard apps that attempted to schedule reviews based on a minimum acceptable recall probability had to set those thresholds unrealistically low. I didn't realize this was a problem because flashcard apps I wrote on top of Ebisu just used its predicted recall to rank flashcards, and ignored their actual values; and since the predicted recall went up or down as I passed or failed quizzes, I didn't notice that the halflife was growing very slowly. Again, thanks to @cyphar and others who patiently explained this to me repeatedly until I finally saw the problem 🙏!
Boost
Ebisu v3 is a total rewrite. It introduces the concept of a "boost" to Ebisu, which is very similar to Anki's "ease factor": each time you review a card, Ebisu will apply a Bayesian update based on the quiz's result, but then for each successful quiz it will boost the resulting probability distribution by a flashcard-specific scalar.
For example, suppose the halflife of a flashcard was a day, and you pass a quiz after two days, and suppose Ebisu v2 would simply update the halflife to four days. The Ebisu v3 model for this flashcard would have not only a probability distrubtion for the halflife (one day) but also for the boost, so if the mean boost value for this flashcard was 1.5, under Ebisu v3, updateRecall would yield a model with halflife 4 * 1.5 or six days. (I'm being very hand-wavy here for purposes of illustration.)
An important part of boosting is that it is applied only for successful quizzes after a big enough delay. This should make sense: imagine if you reviewed that flashcard with one day halflife after just ten minutes, and that Ebisu v2 would update its halflife to, say, 1.01 days. We wouldn't want Ebisu v3 to boost the halflife by 1.5 leaving us with a halflife of 1.01 * 1.5! We'd want v3 to keep the halflife around 1.01, because it's unlikely that a quiz after ten minutes would significantly alter our neural memory of this fact.
Ebisu v3 therefore has the following nominal psuedocode:
def updateRecallBoost(oldHl: float, elapsedTime: float, quizResult: bool, boost: float, LEFT: float, RIGHT: float) -> float:
updatedHl = updateRecallSimple(oldHl, elapsedTime, quizResult)
if quizResult:
b = lerp([LEFT * oldHl, RIGHT * oldHl], [1, boost], elapsedTime)
# clamp
if b < 1.0:
b = 1.0
elif b > boost:
b = boost
else:
b = 1.0
boostedHl = b * updatedHl
return boostedHl
def lerp(xs: float[], ys: float[], x: float) -> float:
mu = (x - x[0]) / (x[1] - x[0])
return (y[0] * (1 - mu) + y[1] * mu)where lerp([x1, x2], [y1, y2], x) is linear interpolation of x along the line between points (x1, y1) and (x2, y2). In words:
- calculate the new no-boost Bayesian-updated halflife based on a quiz result and quiz elapsed time,
- calculate the boost factor:
- 1.0 for quiz failures and elapsed times short relative to the old halflife,
- the full boost factor for elapsed times that are long relative to the old halflife, and
- between 1.0 and the full boost for intermediate elapsed times, linearly interpolating.
- Multiply the two values to get your final new halflife.
LEFT and RIGHT are constants that you pick to control what counts as "short" vs "long" relative to the old halflife. In practice, I've been experimenting with LEFT=0.3 and RIGHT=1.0 and these work well.
This RFC
With now two things to keep track of for each flashcard (its halflife and its boost), the update step is more complicated in v3 than v2: it is broken down into two functions that app authors will need to call, because of the computational burden. However, in v3, the predict step (predictRecall) is a lot faster than in v2 and in fact can be done in SQL. Hopefully the more complex mathematics and API are worth it for more realistic estimates.
In the remainder of this issue, I'll
- briefly sketch out the new probablility model (since GitHub doesn't render LaTeX, there will be only very basic equations), but more importantly,
- specify the functions in the API and how they work in considerable detail.
My main goal is to get feedback on the second point above from app authors who use Ebisu. Does the proposed API feel good? Are there flaws in it? Are there other nice-to-haves that I haven't thought of? Do you want things renamed?
V3 statistical model
This section can be skipped by folks who don't care about the math. For those that do, this section will only be a sketch because I can't use LaTeX math in a GitHub issue. The details will be spelled out on the main Ebisu website when v3 is released, but I'm of course happy to answer questions about this now during the RFC phase as well.
Before getting to v3's statistical model, recall that v2's statistical model is, for each flashchard:
(recall probability in t hours in the future) ~ Beta(a, b),
that is, a Beta random variable parameterized by a and b.
In v3, we instead go a level higher and place the prior probability distribution on the halflife directly:
halflife ~ Gamma(a, b),
We also place a probability distribution on boost:
boost ~ Gamma(aBoost, bBoost).
That is, we have two Gamma-distributed random variables.
Predict
The predicted recall probability at time t hours in the future is a very simple arithmetic expression:
logPredictRecall ∝ -t / E[halflife]
predictRecall = exp(-predictRecall)
The symbol
∝means "proportional to", so there are some extra factors I'm omitting here that aren't important to the explanation here but are needed to line up the units oftandhalflife, and to convertexp's base to2(so thatpredictRecallfort = halflifeis0.5 = 2**-1instead ofexp(-1)).
Note that the predict recall step is really simple. To find the flashcard with the lowest predicted recall, find the flashchard with lowest -t / halflife, which can be done in SQL and cached and etc.
In the README for Ebisu v2, I invoke Jensen's inequality to explain why this is technically inaccurate: because
E[f(x)] ≠ f(E[x]), the exact expected probability of recallE[2**(-t / halflife)]will be different from the above, which is2**(-t / E[halflife]). However, I've decided for v3 that this inexactness is a good exchange for the computational simplicity and boost-based modeling.
Update step 1: update halflife
When we have a quiz result, after time t has elapsed since the last review, we can apply a Bayesian update on halflife:
- original prior:
halflife ~ Gamma(a, b) - exact posterior:
halflife | quiz after t hours ~ SomeComplicatedDistribution(we have moments of this, not the exact density) - approximated posterior:
halflife | quiz after t hours ~ Gamma(a2, b2)
So far this is the exact same architecture as Ebisu v2, except with prior on halflife instead of recall probability (and therefore different integrals).
However, at this stage we apply the boost:
- given a
mean posterior halflife = E[halflife | quiz after t hours], - compute a deterministic value
boostValue = clampLerp([LEFT * E[prior halflife], RIGHT * E[prior halflife]], [1, E[boost]], t), whereclampLerpis the linear interpolation clamped so1 <= boostValue <= E[boost], before finally - scaling the posterior yielding the final boosted halflife,
boostValue * posterior ~ Gamma(a2, boostValue * b2)
That is, we've updated our probability distribution on the halflife in response to a quiz and applied a boost (a little bit of boost for small t, the full boost value for large t).
In this way, we can see that we're just updating our probability distribution around the initial halflife, before the first quiz, and after each successive quiz, we update our distribution about that initial halflife. The halflife after several quizzes is a random variable that's fully specified by the initial halflife random variable after it's been scaled by a sequence of these clampLerped boost values.
However, this leaves open the question: how do we estimate boost? How do we update our probability distribution from boost ~ Gamma(aBoost, bBoost)?
Update step 2: update halflife and boost
This section is the primary innovation of Ebisu v3 over v2, mathematically. After we have three or more quizzes (though you could do this after just two quizes), we can update our probability distributions about both (a) the initial halflife and (b) the boost.
To do this, we use two simple techniques in sequence.
First, curve fit. We evaluate the posterior initial halflife, boost | several quizzes for a grid of values in the initial halflife × boost plane and curve fit this two-dimensional posterior surface to two independent Gamma random variables. This is readily doable because we assume each quiz is independent, so the overall posterior is simply the product of individual likelihoods and priors after each quiz.
We curve fit because, while we have a simple expression for the posterior (up to a scalar constant), we can't analytically evaluate this bivariate distribution's moments. (In Ebisu v2, we had a univariate posterior on recall probability with thankfully tractable integrals yielding analytical moments.)
The curve fit is just weighted least-squares, i.e., a very fast solution to an overdetermined system of equations. It is essentially solving A x = b for a tall and skinny matrix A, and where the unknowns x are the four parameters of the two Gamma random variables (initial halflife and boost).
However, the resulting curve fit often fails to properly capture the behavior of the posterior in its tails, so badly that prediction performance drops if we just use these two Gammas as our updated probability distributions.
Therefore, second, we use importance sampling, a Monte Carlo method to obtain moments of a probability distribution which we can analytically evaluate, given samples of some other distribution (the latter is called the "proposal distribution").
Specifically, we use importance sampling get accurate estimates for the moments of the bivariate posterior, given samples from the curve-fit bivariate Gamma distribution we estimated. With relatively few samples (on the order of ~1000), we can get good estimates of the posterior's moments and moment-match these to two independent Gammas, on initial halflife and boost respectively.
Therefore, this update step allows us to update our estimate of the boost factor in a data-driven way, unlike Anki's ease factor which is hardcoded. Given a sequence of quiz results, we can answer the questions, "What was the halflife of this fact when I began learning it? What boost factor has been strenghtening it each time I review it?"
N.B. We need both these steps, and in this order. The curve fit alone doesn't give us accurate enough parameter estimates but it gives a great proposal distribution for use with importance sampling. Without that, importance sampling using the original priors on initial halflife and boost would need many more Monte Carlo samples and would not be computationally tractable. (I tried 🥲.)
N.B.2. While I've described this two-stage bivariate update step as relatively efficient (linear system solver, importance sampling with ~thousands of samples), it is much more expensive than the plain univariate Bayesian update described above, which used the mean boost
E[boost]everywhere. Therefore, the API breaks these up into two separate phases, with twoupdatefunctions in the API: you can run the simpler univariate halflife-only update after each quiz and then maybe once a day or once a week run the heavier-weight bivariate halflife-and-boostupdater to refresh your estimate ofboostfor each flashcard.
We turn to the API secion next. Again, the above mathematics has been terse because I only have ASCII to describe it and I'm hoping to primary get feedback on the API, but I hope it is sufficiently detailed to give a flavor of the computational burdens involved.
Tutorial on importance sampling
Here's a super quick script to show you the value of importance sampling, plus how it's used.
Suppose you want to estimate the mean of the square of the unit uniform distribution, i.e., if u ~ Unif(0, 1), what is E[u**2]?, and you can't do calculus so you want to do Monte Carlo. Super-easy:
import numpy as np
from scipy.stats import norm as normrv
from scipy.stats import uniform as uniformrv
nsamples = 1_000_000
u = np.random.rand(nsamples)
direct = np.mean(u**2)
directvar = np.var(u**2)You generate a million samples uniformly between 0 and 1, square them, and call np.mean to get your estimate of the mean. Then you can call np.var to get the variance, i.e., accuracy, of that estimate: the lower the variance, the more accurate your estimate, and the more efficient your sampling is (those million samples are doing a good job).
This is not importance sampling, it's just normal Monte Carlo sampling.
Suppose though that you for some reason cannot sample the uniform distribution. Maybe you only have a library to generate Normally-distributed (Gaussian) random samples, so you have randn but no rand. Could you estimate E[u**2] with just randn? Yes. Importance sampling lets you:
n = np.random.randn(nsamples)
indirect = np.mean(n**2 * uniformrv.pdf(n) / normrv.pdf(n))
indirectvar = np.var(n**2 * uniformrv.pdf(n) / normrv.pdf(n))Here we generate a million samples from the unit Normal distribution, n ~ Normal(0, 1), and again square them, but before calling np.mean to get our estimate, we weight each sample by its importance. That's what the crucial uniformrv.pdf(n) / normrv.pdf(n) factor there is doing: it's giving more weight to samples that are likely to have come from our target distribution (Unif(0,1)) and less weight to samples that aren't.
This will actually work: your indirect estimate of E[u**2] will be correct but (much) higher variance than when you could sample from the Uniform distribution directly.
But we know the unit Normal distribution is an obviously bad proposal distribution for the unit Uniform distribution—over half the samples will just be thrown away, since negative samples get an importance factor of 0. We can try one more experiment: let's use Normal(0.5, σ=0.5) as our proposal. This should be a good deal more accurate than the unit Normal:
n2 = normrv.rvs(loc=0.5, scale=0.5, size=nsamples)
indirect2 = np.mean(n2**2 * uniformrv.pdf(n2) / normrv.pdf(x=n2, loc=0.5, scale=0.5))
indirect2var = np.var(n2**2 * uniformrv.pdf(n2) / normrv.pdf(x=n2, loc=0.5, scale=0.5))Printing out the estimated mean and the variance (inaccuracy) of each estimate:
print("| method | estimated mean | estimator variance |")
print('|--------|----------------|--------------------|')
print(f'| direct | {direct:0.4f} | {directvar:0.4f} |')
print(f'| crappy proposal | {indirect:0.4f} | {indirectvar:0.4f} |')
print(f'| better proposal | {indirect2:0.4f} | {indirect2var:0.4f} |')yields
| method | estimated mean | estimator variance |
|---|---|---|
| direct | 0.3328 | 0.0888 |
| crappy proposal | 0.3340 | 0.6122 |
| better proposal | 0.3332 | 0.2182 |
As expected, the middle column all report the same estimate, 1/3. However, whereas the straightforward Monte Carlo estimator had a low variance of 0.09, which is the best you're going to get given you had access to rand, the crappy importance sampling estimator using just randn (the unit Normal) as its proposal had a much higher variance, almost 7× worse. But by using Normal(0.5, σ=0.5) as the proposal, we reduce the inaccuracy of the estimator, whose variance is just ~2.5× the direct estimator.
So, of course ideally you'd be able to do the integral to find E[u**2] (zero estimator variance 🙌!), and failing that, hopefully you can use rand to draw uniform samples and calculate the estimate directly. But failing both, importance sampling gives you a way to convert samples from some rando distribution into moment estimates of the distribution you do care about.
(Obviously the proposal distribution can't be too rando, so for example you couldn't switch the role of the Uniform and Normal above: a unit Uniform proposal totally misses out on huge chunks of the target unit Normal's support. There are rules about what makes an allowable proposal distribution, and about what's the best proposal distribution (the target distribution 🙃) that textbooks bore us with, but the rules are quite sensible.)
Hopefully the above convinces you that there's nothing magical about using it to improve our posterior fits. The curve-fit posterior is like the "better proposal" above. The "crappy proposal" for importance sampling might be the original priors on halflife and boost (except after a dozen quizzes, the variance of the estimates would be enormous without billions and trillions of samples).
V3 API
Model
Ebisu v2 had a very compact data model. Each flashcard was just a 3-tuple, [float, float, float]: two numbers representing the Beta random variable's parameters and the elapsed time at which that Beta applied. It didn't keep track of past quizzes because that 3-tuple was a "sufficient statistic": as far as the math was concerned, everything about the past was encoded in that 3-tuple.
Because the Ebisu v3 algorithm described above needs to know about past quiz times and results to re-estimate boost, the Ebisu v3 data model is much bigger, and now includes past quizzes.
See Python `dataclass`es for Ebisu v3 model
While the key-value structure of the data model is probably an implementation detail and not relevant for the majority of readers of this RFC, for completeness I include it here, with the caveat that it may change.
The top-level model has three sections:
@dataclass
class Model:
quiz: Quiz
prob: Probability
pred: PredictOne section for a list of quiz results and times:
@dataclass
class Quiz:
elapseds: list[list[float]]
# same length as `elapseds`, and each sub-list has the same length
results: list[list[Result]]
# 0 < x <= 1 (reinforcement). Same length/sub-lengths as `elapseds`
startStrengths: list[list[float]]A second section for the parameters of the probability distributions:
@dataclass
class Probability:
# priors: fixed at model creation time
initHlPrior: tuple[float, float] # alpha and beta
boostPrior: tuple[float, float] # alpha and beta
# posteriors: these change after quizzes
initHl: tuple[float, float] # alpha and beta
boost: tuple[float, float] # alpha and betaAnd a final section for computed values that are useful for predictRecall (whether that's done in Python or in SQL or whatever):
@dataclass
class Predict:
# just for developer ease, these can be stored in SQL, etc.
# halflife is proportional to `logStrength - (startTime * CONSTANT) / halflife`
# where `CONSTANT` converts `startTime` to same units as `halflife`.
startTime: float # unix epoch
currentHalflife: float # mean (so _currentHalflifePrior works). Same units as `elapseds`
logStrength: floatFor logStrength and startStrengths, see discussion below on reinforcement strength.
Initialize
def initModel(initHlPrior: Union[tuple[float, float], None] = None,
boostPrior: Union[tuple[float, float], None] = None,
initHlMean: Union[float, None] = None,
initHlStd: Union[float, None] = None,
boostMean: Union[float, None] = None,
boostStd: Union[float, None] = None) -> Model: passYou're expected to provide either initHlPrior (a 2-tuple [α, β] of the Gamma random variable) or both initHlMean and initHlStd representing the mean and the standard deviation of the Gamma random variable representining the initial halflife.
Similarly for boost.
Predict
There are actually two functions here:
def predictRecall(model: Model, elapsedHours=None, logDomain=True) -> float: pass
def _predictRecallBayesian(model: Model, elapsedHours=None, logDomain=True) -> float: passThe official predictRecall returns 2**(-t / (mean halflife)), which is an approximation of the recall probability at time t in the future. It's mathematically inaccurate but very fast and convenient to find the mean of the halflife and then put it through Ebbinghaus' exponential forgetting function. I expect this function to be portable to SQL, etc., because it's just a fast algebraic expression (addition, multiplication, division).
For fun, I plan to include _predictRecallBayesian which computes the exact expected recall probability (like Ebisu v2 does). The leading _ means I don't expect this function to be in ports to other languages. This function computes an expensive Bessel function of the second kind, the analytical solution to the integral that yields the exact recall probability.
Update halflife
I've decided to reuse the updateRecall name in Ebisu v3 for the "quick" update step that assumes the boost is fixed and just applies a quiz result:
def updateRecall(
model: Model,
elapsed: float,
successes: Union[float, int],
total: int = 1,
now: Union[None, float] = None,
q0: Union[None, float] = None,
reinforcement: float = 1.0,
left=0.3,
right=1.0,
) -> Model: passNote that like Ebisu v2, the above function supports noisy quizzes and binomial/binary quizzes. Also note two new v3-only parameters: left and right, which control how much boost to apply. Nominally, if elapsed < left * (current halflife), no boost is applied. If elapsed > right * (current halflife), the full boost is applied (whatever it may be, recall it has a probability distribution that you initialized or that was computed from data). For values of elapsed in between, the boost is scaled linearly.
This function needs to be called whenever the student completes a quiz. But as alluded above, it only updates the halflife, and takes the boost as fixed.
For more on what
reinforcementmeans, see below on reinforcement strength.
Update halflife and boost
The "non-quick" update function will update its belief about both the halflife and the boost.
def updateRecallHistory(
model: Model,
left=0.3,
right=1.0,
size=10_000,
) -> Model: passIf we had unlimited computing power, we'd run updateRecall and then immediately run updateRecallHistory so we always had the most accurate estimates. However, this updateRecallHistory is an expensive function, much moreso than the old v2 updateRecall, because it is looping over all quizzes. Specifically, as described in the math section above, it's
- evaluating probabilities on a two-dimensional grid of halflife and boost for each quiz, then
- solving a linear system of equations, before finally
- an importance sampling step (the number of Monte Carlo samples is governed by the
sizeargument).
All this should be totally feasible on mobile/browser, but I do expect apps will opt to run this as a batch function, for example,
- run this daily for flashcards that were reviewed today, or
- run this every 2–4 quizzes.
If you initialized the boost prior reasonably, there shouldn't be a pressing need to rerun this function after each quiz.
In the future, some smart person might enlighten me how we can make this estimation step better (or how we can change the entire model to have a better estimation step), but for now, I feel that this is a reasonable balance between model accuracy, prediction speed, and update complexity.
Reset halflife
Ebisu v2 supports manually rescaling the halflife (rescaleHalflife in v2) of a model, for those situations where you just know the model has over- or underestimated your memory. We'd like to support this in v3 as well.
In Ebisu v3, since a flashcard's data model has all past quiz results, you may think that it'd be straightforward to just tweak the initial halflife's priors and re-run updateRecallHistory, but after you get several quizzes, the initial priors are often relatively powerless, with the posteriors almost entirely driven by the data (the quizzes). This is a well-known outcome in Bayesian statistics.
Therefore, the proposed function is called "reset" because it will literally start a new chapter in your history with this flashcard with a new prior on halflife:
def resetHalflife(
model: Model,
initHlMean: float,
initHlStd: float,
startTime: Union[float, None] = None,
strength: float = 1.0,
) -> Model: passThe model will still contain all past quizzes, but this function sets those aside and reinitializes a new list of elapsed times and quizzes, so all future quizzes get the initial halflife you ask for here.
This way, if after several quizzes, you need to rescale/reset the halflife of a flashcard, this lets you do that. No hard feelings.
Ideally someday we'll be able to analyze each flashcard's history of quizzes and automatically detect when there's a "new halflife", i.e., when your memory for this fact got much better or much worse. Sort of like the example in Bayesian methods for hackers chapter 1, "Inferring behaviour from text-message data".
That's it. That's the API.
Reinforcement strength
Throughout the discussion above, I've alluded to startStrength and logStrength and reinforcement. This is to support partial reconsolidation or reinforcement, raised in #51 by @jasonsparc. The idea is, Ebbinghaus' exponential decay assumes that, after each review, your memory of some fact jumps to 100%: probability of recall 2**(-t / halflife) = 1.0 when t=0 (right after you've reviewed this fact, probability 1 = 100% recall probability). @jasonsparc's idea is: what if this isn't true? What if you know that the quiz only partially reconsolidated the memory, i.e., that at t=0.0001 (a millisecond after your last quiz), the probability of recall is not .9999 but something much less, maybe 0.5, maybe 0.1?
Ebisu v3 tentatively supports this by allowing you to specify how much reconsolidation you think has happened every time you call updateRecall. This number is taken as deterministic and fixed, just like the quiz result or elapsed time, and it flows through the update process as if it was known. We'll want to experiment with this to make sure it's working. For users who aren't interested in playing with this feature, the API defaults to sane values.
Conclusion
The code and tests are written. You can find them (along with a ton of irrelevant code I wrote while getting to this point) at https://github.com/fasiha/ebisu-likelihood-analysis/, specifically ebisu3.py and test_ebisu3.py.
I've prepared a script that can
- load flashcard review history from Anki's
collection.anki2files (if you export your Anki deck to anapkgfile and unzip that, thiscollection.anki2file, which is a SQLite database file, will be inside), and - run a couple of cards through both Ebisu v3 and a Stan model.
The Stan model will likely mostly be useful for those interested in changing the model or putting hyperpriors on left and right and other parameters, or verifying Ebisu v3's correctness (there's a difference in the output between these two that I'm working on tracking down).
But mainly I'm hoping that interested parties provide feedback on the API above. And of course any other feedback (the mathematical model for example) is also most welcome.
Many thanks for your patience with v3. This is several months late and I am grateful for the support and feedback I've received so far 🙇.