A Large-Scale Evaluation of High-Impact Password Strength Meters

Abstract
Passwords are ubiquitous in our daily digital lives. They protect various types of assets ranging from a simple account on an online newspaper website to our health information on government websites. However, due to the inherent value they protect, attackers have developed insights into cracking/guessing passwords both offline and online. In many cases, users are forced to choose stronger passwords to comply with password policies; such policies are known to alienate users and do not significantly improve password quality. Another solution is to put in place proactive password-strength meters/checkers to give feedback to users while they create new passwords. Millions of users are now exposed to these meters on highly popular web services that use user-chosen passwords for authentication. More recently, these meters are also being built into popular password managers, which protect several user secrets including passwords. Recent studies have found evidence that some meters actually guide users to choose better passwords—which is a rare bit of good news in password research. However, these meters are mostly based on ad hoc design. At least, as we found, most vendors do not provide any explanation for their design choices, sometimes making them appear as a black box. We analyze password meters deployed in selected popular websites and password managers. We document obfuscated source-available meters, infer the algorithm behind the closed-source ones, and measure the strength labels assigned to common passwords from several password dictionaries. From this empirical analysis with millions of passwords, we shed light on how the server end of some web service meters functions and provide examples of highly inconsistent strength outcomes for the same password in different meters, along with examples of many weak passwords being labeled as strong or even excellent . These weaknesses and inconsistencies may confuse users in choosing a stronger password, and thus may weaken the purpose of these meters. On the other hand, we believe these findings may help improve existing meters and possibly make them an effective tool in the long run.

This publication has 17 references indexed in Scilit: