Guessing Passwords using an LSTM Network


A while ago I applied NLP strategies to implement an algorithm that is capable to guess a password. Since then a new method called transformer was developed and successfully used in many NLP tasks. Before transformers were the method of choice, LSTM Models dominated the field. In this article we will lay down a setup for such an LSTM Model to guess passwords and derive why a network might be better suited than the Markov chain model presented in this post.

A shot recap of our notation: Let X \in \mathbb{X} \, ,X=x_1 x_2 x_3 \dots x_E be a password and the random variables x_i \in \mathbb{A}, E \in \mathbb{N}^+ the corresponding letters and password length. \mathbb{X} is the space of all passwords and \mathbb{A} corresponds to the used alphabet, for example if we allow only for lower case letters then \mathbb{A} =\{a,b,c, \dots, y,z \}. As usual this forms a discrete probability space given by (\mathbb{X}, P). The task to solve is to derive P(x_i|x_{i-1},\dots,x_{1}).

In the Markov chain example it is assumed that P(x_i|x_{i-1},\dots,x_{1}) = P(x_i|x_{i-1}) which obviously falls short. Our strategy to overcome this problem was to use “n-grams”, therefore one can look at sequences of length n instead of a single letter. However with raising n a major problem is that the possibility to observe a each sequence in the training set decreases which will lead to an underestimation of the probability for such sequences. This is where Recurrent Neuronal Networks (RNN) step in (LSTM is a specific RNN).

The terminology of neuronal networks

To start with, we will first have a brief look at more simple networks. There has been a great deal of hype surrounding
neural networks, making them seem magical and mysterious. However, they are just nonlinear statistical models [1]. In particular, a neuronal network is a class of functions mapping a random variable X f: X \rightarrow Y. However the terminology in the literature is quiet different. [2] gave the following translation:

  • variables are called features
  • independent variables are called inputs
  • predicted values are called outputs
  • dependent variables are called targets or training values
  • residuals are called errors
  • estimation is called training, learning, adaptation, or self-organization
  • an estimation criterion is called an error function, cost function, or Lyapunov function
  • observations are called patterns or training pairs
  • parameter estimates are called (synaptic) weights
  • interactions are called higher-order neurons
  • transformations are called functional links
  • regression and discriminant analysis are called supervised learning or heteroassociation
  • data reduction is called unsupervised learning, encoding, or autoassociation
  • cluster analysis is called competitive learning or adaptive vector quantization

The statistical terms sample and population do not seem to have Neuronal Network equivalents. However, the data are often divided into a training set and test set for cross-validation.

Feed-Forward Neuronal Networks

Our next work guessing task can be understood like as a classification problem where the number of classes corresponds to the size of K=|\mathbb{A}|. Our target in this case is to predict the actual letter x_i based on the previously observed letter x_{i-1}. Since neuronal nets can not deal with letters directly we have to convert them into an input vector \mathbf{x}_{i-1} with K components and a target vector \mathbf{x}_i which has also K components (in general the size of the vectors can differ). Both vectors are dummy variables (one-hot encoding) such that the j-th component of \mathbf{x}_i is 1 if the observed letter x_i is at the j-th position of the alphabet \mathbb{A} and 0 else.
Additionally we construct M derived features Z_m, these features can be compared to a linear basis generated from the data. \mathbf{Z}=Z_1,\dots, Z_M is called the hidden layer, since the outputs of \mathbf{Z} are not directly observed. Adding more then one hidden layer enables the network to capture interactions. Thus each component k=1,\dots,Kfor the estimator \hat{\mathbf{x}}_i of \mathbf{x}_i is modeled by f_k(\mathbf{x}_{i-1}) as a linear combination of Z_m:

    \[\arraycolsep=1.4pt\def\arraystretch{2.2}\begin{array}{l}Z_m= \sigma( \alpha_{0m} + \alpha^T_{m} \mathbf{x}_{i-1}), m=1,\dots, M \\T_k= \beta_{0k}+\beta^T_{k}\mathbf{Z}, k=1,\dots,K \\f_k(\mathbf{x}_{i-1})=g_k(\mathbf{T}),k=1,\dots,K\]

where \mathbf{T}=T_1,\dots,T_K. A popular choice for g_k is the softmax function g_k(\mathbf{T})=\frac{e^{T_k}}{\sum_{l=1}^Ke^{T_l}} which ensures that f_k(\mathbf{x}_{i-1}) is between [0,1]. In a regression setup frequently g_k(\mathbf{T})=T_k is used. \sigma is an “activation function, popular choices are \sigma(x)=tanh(x) or the sigmoid function \sigma(x)=\frac{1}{1+e^{-x}}.
The system of equations has unknown variables \theta=(\alpha, \beta) where \alpha \in \mathbb{R}_{M \times K+1}, \beta \in \mathbb{R}_{K+1 \times M} to be estimated from the data.
Assume that we observe j=1,\dots,N passwords with length E_j resulting in pairs consisting of (\mathbf{x}_{i-1j},\mathbf{x}_{ij}). To fit the network we have to define a suitable loss function, since we deal with a classification problem we use the cross-entropy loss function R(\theta) = - \sum_{j=1}^N \sum_{i=2}^{E_j} \mathbf{x}_{ij}^T log(f(\mathbf{x}_{i-1j}, \theta)) minimized over \theta.

Recurrent Neuronal Networks and LSTM

By construction the Feed-Forward network suffers from the same issue as Markov-Chain attempt, we only look at the previous letter ignoring the entire prior sequence. To model such time dependent sequences RNN steps in. We will therefore extend our network, allowing to use previous letters of the password sequences by altering the hidden layer such that an estimator for x_i is given by

    \[\arraycolsep=1.4pt\def\arraystretch{2.2}\begin{array}{l}Z_{im}= \sigma( \alpha_{0m} + \alpha^T_{m}\mathbf{x}_{i-1}+ w^T_{m} \mathbf{Z}_{i-1}), m=1,\dots, M \\T_{ik}= \beta_{0k}+\beta^T_{k}\mathbf{Z}_i, k=1,\dots,K \\f_k(\mathbf{x}_{1}, \dots, \mathbf{x}_{i-1})=g_k(\mathbf{T}_i),k=1,\dots,K\end{array}\]

This model now consists of unknown variables \alpha \in \mathbb{R}_{M \times K+1}, \beta \in \mathbb{R}_{K+1 \times M}, w \in \mathbb{R}_{M \times M} to be estimated from the data. Assume again that we observe j=1,\dots,N passwords with length E_j. To train the network we use all possible subsequences (\mathbf{x}_{1j}, \mathbf{x}_{2j}), (\mathbf{x}_{1j}\mathbf{x}_{2j}, \mathbf{x}_{3j}), … .

An RNN works well to handle sequential data but runs into problems if influential letters are far away. This is not much a problem of the network itself, but rather of the methods used to train it. Neuronal Networks where fitted using backpropagation, the problem concerning RNN are vanishing/exploding gradients [3] LSTM (to a major part) solves this issue by allowing to connect letters which are far away. This is archived with having a shortcut in C_{im} to connect the LSTM cells C_{jm} of different letters .

    \[\arraycolsep=1.4pt\def\arraystretch{2.2}\begin{array}{l}I_{im} = \sigma_1 ( \alpha^A_{0m} + \alpha^A^T_{m}\mathbf{x}_{i-1}+ w^A^T_m \mathbf{Z}_{i-1} ), m=1,\dots, M\\F_{im} = \sigma_2 ( \alpha^F_{0m} + \alpha^F^T_{m}\mathbf{x}_{i-1}+ w^F^T_m \mathbf{Z}_{i-1} ), m=1,\dots, M\\O_{im} = \sigma_3 ( \alpha^O_{0m} + \alpha^O^T_{m}\mathbf{x}_{i-1}+ w^O^T_m \mathbf{Z}_{i-1} ), m=1,\dots, M\\\tilde{C}_{im} = \sigma_4 ( \alpha^G_{0m} + \alpha^G^T_{m}\mathbf{x}_{i-1}+ w^G^T_m \mathbf{Z}_{i-1} ), m=1,\dots, M\\C_{im}= C_{i-1 m} F_{im} +I_{im} \tilde{C}_{im}, m=1,\dots, M\\Z_{im}=O_{im} \sigma_5(C_{im}) , m=1,\dots, M\\T_{ik}= \beta_{0k}+\beta^T_{k}\mathbf{Z}_i, k=1,\dots,K \\f_k(\mathbf{x}_{1}, \dots, \mathbf{x}_{i-1})=g_k(\mathbf{T}_i),k=1,\dots,K\end{array}\]

In the literature \sigma_4 and \sigma_5 are represented by tanh(x), while the choice of \sigma_1 to \sigma_3 is usually a sigmoid function.


The proposed methods where used to fit a Feedforward, a Recurrent Neuronal net as well as a Long-Short-Term-Memory net. To train the network the 10.000.000 most popular passwords where used. The fitted models can be found in my github repo the corresponding code can be found here. Figure 1-3 represent the probabilities for the next letter if “lov” was observed. We can see that while LSTM and RNN predict “e” as the most probable letter the Feedforward net prefers “eof” which means that the passwords ends. This misjudgment is not surprising, since all passwords have in common that they end at some point. Because the FF network does not know the length of the password, the best guess is the end of the password. RNN and LSTM do not make this mistake. It is noticeable that the probabilities look similar, but in the LSTM network the probabilities concerning “e” are more pronounced.

Figure 1: Probabilities predicted by the FF Net after the sequence “lov” was observed.
Figure 2: Probabilities predicted by the RNN after the sequence “lov” was observed.
Figure 3: Probabilities predicted by the LSTM Net after the sequence “lov” was observed.
[1] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, New York, NY, USA: Springer new york inc., 2001.
added-at = {2008-05-16T16:17:42.000+0200},
address = {New York, NY, USA},
author = {Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome},
biburl = {},
interhash = {d585aea274f2b9b228fc1629bc273644},
intrahash = {f58afc5c9793fcc8ad8389824e57984c},
keywords = {ml statistics},
publisher = {Springer New York Inc.},
series = {Springer Series in Statistics},
timestamp = {2008-05-16T16:17:43.000+0200},
title = {The Elements of Statistical Learning},
year = 2001
[2] W. Sarle, “Neural networks and statistical models.” 1994.
title={Neural Networks and Statistical Models},
author={Warren Sarle},
[3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735-1780, 1997.
title={Long Short-Term Memory},
author={Sepp Hochreiter and J{\"u}rgen Schmidhuber},
journal={Neural Computation},

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.