Attacks on Applications of K-Anonymity — For the Rest of Us

9 min readAug 20, 2019

Three weeks ago I saw a blog post by fellow bug hunter, Jack Cable. The post both inspired and challenged me. The attack vector presented was focused more on reduction in computational security than a binary outcome (e.g. XSS, which either fires or it doesn’t).

Jack’s article presents a theoretical attack by a malicious (or compromised) application using K-Anonymity, similar to that used by Have I Been Pwned (HIBP).

Today, we’ll briefly talk about what K-Anonymity is and why the implementation of K-Anonymity in HIBP matters. Then we’ll break down the attack described by Jack Cable in a way that doesn’t require in-depth technical knowledge.

What is K-Anonymity

The concept of K-Anonymity has been explained elsewhere online so I intend to borrow heavily from other sources here. We really only need a basic understanding of K-Anonymity before we take a look at attacks against it.

From dataprivacylab.org, K-Anonymity is explained in the following way:

Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? … A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release…

I’ve bolded the main problem that K-Anonymity attempts to address. K-Anonymity attempts to achieve a “diffusion of identity” by applying various levels of suppression and generalization.

Generalization is the practice of taking a specific piece of information (like a home address) and reducing the precision of the information to a form that is still accurate, but less specific. An example of generalization is transforming a record’s home address to city.

Suppression is the practice of omitting (or suppressing) information entirely.

For example, if I have a collection of data (a database, for example) and I want to share that data with some researchers, it’s very likely that I will infringe on the privacy of the individuals represented in the dataset. To make it more difficult to attribute the data to particular individuals, we can selectively generalize sensitive data fields, or we can omit (or suppress) them.

That’s basically all we’ll need to know about K-Anonymity to understand the rest of this post.

Use Case: Have I Been Pwned

Have I Been Pwned is an awesome service started by Troy Hunt that allows you to determine if one of your accounts has been compromised in a data breach. Troy collects data dumps from breaches, proves the validity of the data, and then uploads them into his service which then notifies anyone subscribing to HIBP whether or not they‘re present in the dump.

It’s an incredibly useful service and I highly recommend everyone subscribe for notifications here.

Have I Been Pwned also offers another service which is going to be the subject of our criticisms today. Troy worked with Junade Ali from Cloudflare to build the HIBP “Pwned Passwords” service. This service allows you to safely query whether or not a particular password is present in Troy’s collection of compromised passwords.

There are a couple of engineering challenges already:

We definitely don’t want to require that users send their plaintext password to this service! (“Just send me your credit card number and I’ll tell you if a hacker has it”)
We probably don’t want to share out a list of plaintext passwords as a response

Troy and Junade came up with an alternative system modeled after K-Anonymity that involves using client-side hashing and partial redaction of queries (see: generalization above) submitted to the Pwned Password API. The service then returns a list of password hashes that share the same prefix as the input so the user can determine if their password is in that list.

If that’s not particularly clear, here’s a diagram explaining it better.

Here we can see that the password we wish to check is first hashed on the client-side producing a digest (F3BBBD66A63D…). Afterwards, we only send the first 5 characters of the digest to the Pwned Passwords API. The service then returns a list of password digests which begin with the same sequence we sent (in reality, they don’t send the 5 character prefixes back.. but that’s just an implementation detail and not important here). Our client then takes the full, unredacted digest of the password we wish to check (F3BBBD66A63D…) and determines if that digest is in the returned list.

This system allows a user to avoid sending over the plaintext password (or a simple digest) and potentially compromising their password if HIBP ever became malicious or was compromised.

While this system appears to provide those guarantees to a reasonable level of “computational security”, Jack Cable’s work showed that if we make some reasonable assumptions on how users might use the service, that a malicious or compromised HIBP could actually greatly reduce the security of your password(s) despite the usage of a partial digest generalization.

Attacks on HIBP’s Model of k-anonymity

Jack describes two attacks against this model. One relies on your passwords already being compromised (and therefore Troy has a copy of the plaintext, compromised passwords). The other attack relies on a relationship, described as being neighboring passwords, between multiple password hashes submitted by the user. As it turns out, both attacks rely on very similar principles therefore to make this easier to understand, I’ve taken the liberty of explaining just the underlying principle of both attacks instead of each of them in succession.

Wanna Play A Game

Let’s imagine that we’re gonna play a game…

Well.. not that kind of game.

We’re going to play a special game of Hangman with a few special changes to the rules.

Instead of one game of Hangman, we’re going to play three concurrent games of Hangman all at once.
Instead of showing you how many letters are in the word, I’m going to show you the first three letters of the word. The words can be any length after that. (app can be apple or application)
Lastly, I will ensure that the words for all three games will be synonyms of each other.

This game might seem daunting considering that you’re trying to win not only 1 but 3 games at the same time; however, if we use the modified rules of this game to our advantage we can probably win all three with a few guesses.

Let’s assume that the hint (the first three letters) for our three games are the following:

HAC
CRA
CYB

The first thing we’re going to do is grab a Dictionary and a Thesaurus. We’re, then, going to want to try and figure out all of the possible answers for Game 1. If we were to build a Venn diagram of all words and all words that start with HAC then we might see this.

Venn Diagram showing crude relationship between “All Words” and “Words beginning with HAC”

The answer for Game 1 lies within that small yellow circle. Unfortunately, that circle still contains too many words for us to guess all of them before we lost game 1.

Thankfully, we can still use this information from the Dictionary to our advantage.

Our next step is to take the list of all words that start with HAC and use our Thesaurus to build another list. A list of every synonym for words that begin with HAC.

If our list of words from the Dictionary looked like the following:

Hack
Hackberry
Hacked
Hacker
Hackers
Hacking
Hacksaw
...

Then our list from the Thesaurus may contain the following synonyms:

While this list is long, it’s certainly a LOT shorter than the “List of All Words”. We know that the answer to Game 2 and Game 3 MUST be in this list as they are synonyms of each other. Therefore, the next time we need to find words that only start with a particular prefix, we can search through this much smaller set of words.

In our particular case, only 2 words start with CRA (cracker and crackers) making the answer to game 3 pretty easy to guess at this point. However, just for the sake of the example, we’ll continue to go through the process.

It might appear that we should just repeat the previous step and build a list of synonyms from the “List of All Words” that start with CRA. But, as we said, we only care about CRA words that are also present in the list of synonyms we created for HAC. To narrow our list even further, words we need for Game 3 (which start with CYB) also need to be in both lists of synonyms of HAC and CRA words. At this point it’s probably best to represent all this in another diagram.

We’re particularly interested in the middle section as these are possible candidates for Game 3 being both synonyms of HAC words and synonyms of CRA words (where the CRA words are synonyms of HAC words).

We’re basically establishing the following relationship.

Eventually, after we’ve narrowed down all of the words that are:

Synonyms of each other
Start with HAC, CRA, and CYB respectively

We’re left with few options for the games:

Game 1: Hacker, Hackers
Game 2: Cracker, Crackers
Game 3: Cyberpunk

The relationship of the words allowed us to significantly reduce the possible answers to these games much more easily than if this relationship didn’t exist.

No More Fun and Games

So how does playing games of Hangman relate to the Pwned Passwords API and HIBP? Well, it’s pretty well known that many people reuse passwords or reuse parts of passwords (hunter2 and hunter3 or june1997 and june1998). Jack Cable described this relationship between similar passwords as neighboring passwords.

We generalize password modification to the notion of neighboring passwords, which encapsulates two passwords that are similar to each other.

In the imaginary game we described above, having the secret words being synonyms of each other was a significant advantage. Likewise, if we assume that the user of the Pwned Password service is checking neighboring passwords with their three queries then an attacker can gain a similar advantage in their guessing.This means that if HIBP was compromised or Troy was secretly malicious, it would be possible for HIBP to significantly increase their ability to guess your passwords even if that password didn’t previously appear in their list of compromised passwords.

Conclusion

I would like to clear up some potentially confusing things:

I think that the work Troy Hunt is doing with HIBP is very awesome and I do recommend that everyone signup for notifications from the service.
I also think at the work Troy is doing with Pwned Passwords is awesome and he’s probably not spending his time trying to break into random people’s Neopet’s account.
The point of Jack’s work and this blog post is to help make sure everyone is aware of the potential risks in using a system like Pwned Passwords and makes an informed decision on how to use it. Do note that we’re taking some liberties here in demonstrating this attack and this attack scenario won’t apply to everyone.

Lastly, if you aren’t already, you should definitely use a password manager like 1Password or LastPass and make sure to use long, unique, randomly generated passwords for all of your services to ensure that your accounts remain safe.

If you’ve made it this far, thanks for taking the time to read this long, technical piece I’ve been working on. I thank Jack Cable for writing up such a cool blog post that I just needed to talk about it. I also thank the reviewers for helping make this post easy for everyone to read.

Keep an eye out for the next piece I’ll be releasing which has everything to do with Regular Expressions.