1. Perform the Analyze
To start with, we load the “rockyou.txt.tar.gz” password list using wget. I’m not sure if it is legal to provide a link for the list, therefore just ask a search engine 😉 . The next step is to extract the file sudo tar -zxvf rockyou.txt.tar.gz and copy the data into the Hadoop File system. We create a new folder hadoop fs -mkdir /passwords and copy the data hadoop fs -copyFromLocal rockyou.txt /passwords to this new folder. To test the script I also loaded the rockyou-10.txt file which contains a sample. We will now compute an estimator for the density of the word length given by
and accordingly
as an estimator for the density for a given letter (see this post for notation details). The density are then saved to a CSV file and the 10 elements with the highest probability are printed out.
from pyspark import SparkContext
#sc =SparkContext()
###Load the passwords from Hadoop into an RDD
passwords = sc.textFile("hdfs://master:54310/passwords/rockyou.txt")
###To check our code we first load a subsample
#passwords = sc.textFile("hdfs://master:54310/passwords/rockyou-10.txt")
###Explore Passwort length
#Compute the length of the Passwords
leng= passwords.map(lambda x : len(x) )
##Compute Props
#count the amount of Passwords
pws_cnt=passwords.count()
#Count the amount of each length
absv=leng.map(lambda word: (word, 1)).reduceByKey(add)
#divide by the total amount of the word length to get a density
length_dens=absv.map(lambda (y,x): (y,true_divide(x,pws_cnt))).cache()
#save the densitys to a text file
length_dens.saveAsTextFile("length_dens.csv")
#print the 10 most popular word length
print length_dens.takeOrdered(10, lambda (k, v): -v)
##Compute Props for letter
#construct one long array of letters
flatpw=passwords.map(lambda line : list(line) ).flatMap(lambda x:x)
#count the length of this array
word_cnt= flatpw.count()
#count the amount each letter
abswr=flatpw.map(lambda word: (word, 1)).reduceByKey(add)
#divide by the total letters to get a density
word_dens=abswr.map(lambda (y,x): (y,true_divide(x,word_cnt))).cache()
#save the densitys to a text file
word_dens.saveAsTextFile("word_dens.csv")
#print the 10 most popular letters
print word_dens.takeOrdered(10, lambda (k, v): -v)
What we learn from this toy experiment?
| 8 | 0.20684851660833842 |
| 7 | 0.17479055053644313 |
| 9 | 0.15278606111615334 |
| 10 | 0.14040993444754818 |
| 6 | 0.13589095556583755 |
| 11 | 0.060365058370201986 |
| 12 | 0.038697146501374652 |
| 13 | 0.025382464825449893 |
| 5 | 0.018100454735234143 |
| 14 | 0.017306904141137815 |
For the letter density we derive
| a | 0.070450032335993312 |
| e | 0.057498677721715831 |
| 1 | 0.05366393124621479 |
| 0 | 0.04574238186102738 |
| i | 0.044320375895981298 |
| 2 | 0.04173564796751799 |
| o | 0.041303826710886733 |
| n | 0.038522687445766347 |
| r | 0.036524316843870627 |
| l | 0.035604072994829351 |
So if you have no clue about some password, try a password of length 8 with a lot of a’s and e’s in it 😉
Pingback: 3. A more sophisticated approach using Markov chains. – The Big Data Blog