Anonymizing Data is Really Hard

On Taxis and Rainbows, a fascinating piece about how a huge anonymized data set of NYC taxi data was released and how clever folks can reconstruct the data:

A cryptographically secure hashing function, like MD5 is a one-way function: it always turns the same input to the same output, but given the output, it’s pretty hard to figure out what the input was as long as you don’t know anything about what the input might look like. This is mostly what you’d like out of an anonymization function. The problem, however, is that in this case we know a lot about what the inputs look like.

Comments

Leave a comment