Omen - Fast Password Guessing
Omen - Fast Password Guessing
Markov Enumerator
Markus Duermuth, Fabian Angelstorf, Claude Castelluccia, Daniele Perito,
Abdelberi Chaabane
Abstract. Passwords are widely used for user authentication, and will
likely remain in use in the foreseeable future, despite several weaknesses.
One important weakness is that human-generated passwords are far from
being random, which makes them susceptible to guessing attacks. Under-
standing the adversaries capabilities for guessing attacks is a fundamental
necessity for estimating their impact and advising countermeasures.
This paper presents OMEN, a new Markov model-based password cracker
that extends ideas proposed by Narayanan and Shmatikov (CCS 2005).
The main novelty of our tool is that it generates password candidates
according to their occurrence probabilities, i.e., it outputs most likely
passwords first. As shown by our extensive experiments, OMEN signifi-
cantly improves guessing speed over existing proposals.
In particular, we compare the performance of OMEN with the Markov
mode of John the Ripper, which implements the password indexing func-
tion by Narayanan and Shmatikov. OMEN guesses more than 40% of
passwords correctly with the first 90 million guesses, while JtR-Markov
(for T = 1 billion) needs at least eight times as many guesses to reach
the same goal, and OMEN guesses more than 80% of passwords correctly
at 10 billion guesses, more than all probabilistic password crackers we
compared against.
1 Introduction
One of the main problems with passwords is that many users choose weak pass-
words. These passwords typically have a rich structure and thus can be guessed
much faster than with brute-force guessing attacks. Best practice mandates that
only the hash of a password is stored on the server, not the password, in order
to prevent leaking plain-text when the database is compromised.
In this work we consider offline guessing attacks, where an attacker has gained
access to this hash and tries to recover the password pwd . The hash function
is frequently designed for the purpose of slowing down guessing attempts [20].
This means that the cracking effort is strongly dominated by the computation
of the hash function making the cost of generating a new guess relatively small.
Therefore, we evaluate all password crackers based on the number of attempts
they make to correctly guess passwords.
John the Ripper: John the Ripper (JtR) [17] is one of the most popular pass-
word crackers. It proposes different methods to generate passwords. In dictionary
mode, a dictionary of words is provided as input, and the tool tests each one of
them. Users can also specify various mangling rules. Similarly to [6], we discover
that for relatively small number of guesses (less than 108 ), JtR in dictionary
mode produces best results. In Incremental mode (JtR-inc) [17], JtR tries pass-
words based on a (modified) 3-gram Markov model.
2
Password Guessing with Markov Models: Markov models have proven
very useful for computer security in general and for password security in par-
ticular. They are an effective tool to crack passwords [16], and can likewise be
used to accurately estimate the strength of new passwords [5]. Recent indepen-
dent work [14] compared different forms of probabilistic password models and
concluded that Markov models are better suited for estimating password prob-
abilities than probabilistic context-free grammars. The biggest difference to our
work is that they only approximate the likelihood of passwords, which does not
yield a password guesser which outputs guesses in the correct order, the main
contribution of our work.
The underlying idea of Markov models is that adjacent letters in human-
generated passwords are not independently chosen, but follow certain regulari-
ties (e.g., the 2-gram th is much more likely than tq and the letter e is very
likely to follow the 2-gram th). In an n-gram Markov model, one models the
probability of the next character in a string based on a prefix of length n − 1.
Hence, for a given string c1 , . . . , cm , Q a Markov model estimates its probability
m
as P (c1 , . . . , cm ) ≈ P (c1 , . . . , cn−1 ) · i=n P (ci |ci−n+1 , . . . , ci−1 ). For password
cracking, one basically learns the initial probabilities P (c1 , . . . , cn−1 ) and the
transition probabilities P (cn |c1 , . . . , cn−1 ) from real-world data (which should
be as close as possible to the distribution we expect in the data that we attack),
and then enumerates passwords in order of descending probabilities as estimated
by the Markov model. To make this attack efficient, we need to consider a num-
ber of details: Limited data makes learning these probabilities challenging (data
sparseness) and enumerating the passwords in the optimal order is challenging.
3
1.2 Paper organization
In Section 2 we describe the Ordered Markov ENumerator (OMEN) and provide
several experiments for selecting adequate parameters. Section 3 gives details
about OMEN’s cracking performance, including a comparison with other pass-
word guessers. We conclude the paper with a brief discussion in Section 4.
4
Algorithm 1 Enumerating passwords for level η and length ℓ (here for ℓ = 4).4
function enumPwd(η, ℓ) P
1. for each vector (ai )2≤i≤ℓ with i ai = η
2
and for each x1 x2 ∈ Σ with L(x1 x2 ) = a2
and for each x3 ∈ Σ with L(x3 | x1 x2 ) = a3
and for each x4 ∈ Σ with L(x4 | x2 x3 ) = a4 :
(a) output x1 x2 x3 x4
2. For each such vector a, it selects all 2-grams x1 x2 whose probabilities match
level a2 . For each of these 2-grams, it iterates over all x3 such that the 3-gram
x1 x2 x3 has level a3 . Next, for each of these 3-grams, it iterates over all x4
such that the 3-gram x2 x3 x4 has level a4 , and so on, until the desired length
is reached. In the end, this process outputs a set of candidate passwords of
length ℓ and level (or “strength”) η.
A more formal description is presented in Algorithm 1. It describes the algo-
rithm for ℓ = 4. However, the extension to larger ℓ is straightforward.
Example: We illustrate the algorithm with a brief example. For simplicity, we
consider passwords of length ℓ = 3 over a small alphabet Σ = {a, b}, where the
initial probabilities have levels
L(aa) = 0, L(ab) = −1,
L(ba) = −1, L(bb) = 0,
and transitions have levels
L(a|aa) = −1 L(b|aa) = −1
L(a|ab) = 0 L(b|ab) = −2
L(a|ba) = −1 L(b|ba) = −1
L(a|bb) = 0 L(b|bb) = −2.
– Starting with level η = 0 gives the vector (0, 0), which matches to the pass-
word bba only (the prefix “aa” matches the level 0, but there is no matching
transition with level 0).
– Level η = −1 gives the vector (−1, 0), which yields aba (the prefix “ba” has
no matching transition for level 0), as well as the vector (0, −1), which yields
aaa and aab).
– Level η = −2 gives three vectors: (−2, 0) yields no output (because no initial
probability matches the level −2), (−1, −1) yields baa and bab, and (0, −2)
yields bba.
– and so one for all remaining levels.
5
the length of the password to be guessed) is challenging, as the frequency with
which a password length appears in the training data is not a good indicator
of how often a specific length should be guessed. For example, assume that are
as many passwords of length 7 and of length 8, then the success probability of
passwords of length 7 is larger as the search-space is smaller. Hence, passwords
of length 7 should be guessed first. Therefore, we use an adaptive algorithm that
keeps track of the success ratio of each length and schedules more passwords to
guess for those lengths that were more effective.
More precisely, our adaptive password scheduling algorithm works as follows:
1. For all n length values of ℓ (we consider lengths from 3 to 20, i.e. n =
17), execute enumPwd(0, ℓ) and compute the success probability sp ℓ,0 . This
probability is computed as the ratio of successfully guessed passwords over
the number of generated password guesses of length ℓ.
2. Build a list L of size n, ordered by the success probabilities, where each
element is a triple (sp, level , length). (The first element L[0] denotes the
element with the largest success probability.)
3. Select the length with the highest success probability, i.e., the first element
L[0] = (sp0 , level0 , length0 ) and remove it from the list.
4. Run enumPwd(level 0 − 1, length 0 ), compute the new success probability sp ∗ ,
and add the new element (sp ∗ , level 0 − 1, length 0 ) to L.
5. Sort L and go to Step 3 until L is empty or enough guesses have been made.
6
90
80
60
50
40
60
40
20
100
Omen4G/5_levels Omen4G/20_levels
Omen4G/10_levels Omen4G/30_levels
80
CDF cracked password
60
40
20
Fig. 1: Comparing different n-gram sizes (top), alphabet sizes (middle), and dif-
ferent number of levels (bottom), for the RockYou dataset.
7
We tested several alphabet sizes by setting k = 20, 30, 40, 50, 62, 72, 92, where
the k most frequent characters of the training set form the alphabet. The results
are given in Figure 1 (middle). We clearly see an increase in the accuracy from
an alphabet size k from 20 to 62. Further increasing k does not noticeable in-
crease the cracking rate. This is mainly explained by the alphabet used in the
RockYou dataset where most users favor password with mostly alphanumeric
characters rather than using a large number of special characters. To be data
independent, we opted for the 72 character alphabet. Note that datasets that
use different languages and/or alphabets, such as Chinese Pinyins [13], will have
to set different OMEN parameters.
Number of levels: A third important parameter is the number of levels that
are used to enumerate password candidates. As for previous parameters, higher
number of levels can potentially increase accuracy, but it also increases runtime.
The results are shown in Figure 1 (bottom). We see that increasing the number
of levels from 5 to 10 substantially increases accuracy, but further increasing to
20 and 30 does not make a significant difference.
Selected parameters: Unless otherwise stated, in the following we use OMEN
with 4-grams, an alphabet size of 72, and 10 levels.
3.1 Datasets
8
Testing Set
Algorithm Training Set #guesses
RY-e MS FB
RY-t 10 billion 80.40% 77.06% 66.75%
Omen
RY-t 1 billion 68.7% 64.50% 59.67%
PCFG [24] RY-t 1 billion 32.63% 51.25% 36.4%
RY-t 10 billion 64% 53.19% 61%
JtR-Markov [16]
RY-t 1 billion 54.77% 38.57% 49.47%
JtR-Inc RY-t 10 billion 54% 25.17% 14.8%
have been used in numerous studies [24, 23, 5]. Also, these datasets are already
available to the public. Nevertheless we treat these lists with the necessary pre-
cautions and release aggregated results only that reveal next to no information
about the actual passwords (c.f. [7]).
9
Omen4G: RY-t/RY-e JtR-Markov: RY-t/RY-e
100 Omen4G: RY-t/FB JtR-Markov: RY-t/FB
Omen4G: RY-t/MS JtR-Markov: RY-t/MS
60
40
20
00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Guesses 1e9
80
70
CDF cracked password
60
50
40
30
20
Omen4G: RY-t/RY-e pcfg: RY-t/RY-e
10 Omen4G: RY-t/FB pcfg: RY-t/FB
Omen4G: RY-t/MS pcfg: RY-t/MS
00.0 0.2 0.4 0.6 0.8 1.0
Guesses 1e9
80
60
40
20
00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Guesses 1e10
Fig. 2: Comparing OMEN with the JtR Markov mode at 1B guesses (top), with
the PCFG guesser (middle), and with JtR incremental mode (bottom).
10
Omen4G: RY-t/RY-e JtR-Markov: RY-t/RY-e Omen4G: RY-t/RY-e JtR-Markov: RY-t/RY-e
100 Omen4G: RY-t/FB JtR-Markov: RY-t/FB 100 Omen4G: RY-t/FB JtR-Markov: RY-t/FB
Omen4G: RY-t/MS JtR-Markov: RY-t/MS Omen4G: RY-t/MS JtR-Markov: RY-t/MS
CDF cracked password
60 60
40 40
20 20
00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Guesses 1e9 Guesses 1e10
Fig. 3: Comparing OMEN with the JtR Markov mode at 1 billion guesses (left),
and at 10 billion guesses (right).
80
Omen2G: RY-t/RY-e JtR-Markov: RY-t/RY-e
70
CDF cracked password
60
50
40
30
20
10
00.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Guesses 1e10
11
means that OMEN has a better cracking speed. The speed advantage of OMEN
can be seen at 1 billion guesses where OMEN cracks 50% of all passwords while
JtR-markov cracks less than 10%. At the point T , i.e., when JtR-Markov stops,
both algorithms perform roughly the same. Note that since not all parameters
(i.e., alphabet size, number of levels etc.) of both models are identical, we have
a small difference in the cracking rate at the point T .
12
References
13
20. N. Provos and D. Mazières. A future-adaptive password scheme. In Proc. An-
nual conference on USENIX Annual Technical Conference, ATEC ’99. USENIX
Association, 1999.
21. S. Schechter, C. Herley, and M. Mitzenmacher. Popularity is everything: a new
approach to protecting passwords from statistical-guessing attacks. In Proc. 5th
USENIX conference on Hot topics in security, pages 1–8. USENIX Association,
2010.
22. E. H. Spafford. Observing reusable password choices. In Proc. 3rd Security Sym-
posium, pages 299–312. USENIX, 1992.
23. M. Weir, S. Aggarwal, M. Collins, and H. Stern. Testing metrics for password
creation policies by attacking large sets of revealed passwords. In Proc. 17th ACM
conference on Computer and communications security (CCS 2010), pages 162–175.
ACM, 2010.
24. M. Weir, S. Aggarwal, B. de Medeiros, and B. Glodek. Password cracking using
probabilistic context-free grammars. In Proc. IEEE Symposium on Security and
Privacy, pages 391–405. IEEE Computer Society, 2009.
25. Word list Collection, 2012. http://www.outpost9.com/files/WordLists.html.
14