Data Obfuscation will not Anonymize You

Data obfuscation is a key concept of GDPR and other data privacy laws in general, but it doesn’t anonymize you as a person. Here is why…

Jan 10, 2022

Photo by Tima Miroshnichenko from Pexels

Data obfuscation is a key concept of GDPR and other data privacy laws in general, but it doesn’t anonymize you as a person. Here is why, based on science.

I am in data domain more than a decade and currently working as Principal Data Engineer. I applied GDPR to 3 companies from scratch, and I am expert on data privacy topics.

It is always a discussion what to do in technical way. There are several approaches if you would like to build your own privacy tools/procedures as well as products that they are following their own way. One of the biggest downsides of GDPR is that it is only a rule-set. It doesn’t specify which hashing algorithms you need to use, which length they should be etc. GDPR tries to give some flexibility to the companies, but it becomes one of the sweet spot for customer data if data obfuscation (some places calls as data masking, data hashing) is not performed properly.

Legal Base

Recital 26 of the GDPR states that the principles of data protection should apply to any information, “concerning an identified or identifiable natural person.” Hence, the principles do not apply to anonymous information or to personal data through which the subject is not identifiable.

Article 11 of the GDPR addresses processing that does not require identification. If a controller (an entity that determines the purposes and means of processing personal data) does not need the identity of the data subject, the obligations of the controller under the GDPR are significantly minimized.

It seems very cool and protective based on rules, right? Yes, but it lacks during the technical implementation.

Technical Implementation

I will not go one by one for technical architectures and implementation, but I will go with the very popular one called rainbow table. It is used by hackers to revert hashed passwords because reverting is a time-consuming and heavy process. The regular hash map algorithm and tables are tending to collide based on hash key. Although rainbow tables are not collision-free, they will not merge, which is reducing the overall number of collisions.

OK, fine, but what is the relation between password hashing and data obfuscation? Well, because of the nature of rainbow tables, they are excellent for anonymizing customer data. Each customer has a key, so his/her data is getting hashed with it. Some companies are taking one more step, and they are creating a hash key for each field of customer data.

There is another advantage of using rainbow tables too: Article 17 says that person (customer, for example) has a right to be forgotten. There are edge cases, but in general, any person may ask to be erased totally. This is very problematic request from a person because there may be solutions/data pipelines based on you (for example, personal recommendation), and you should be removed from everywhere. It doesn’t mean only the current data pipelines, it also means that historical data deletion. Especially in big data world, these type of requests make data engineering teams crazy because data pipelines may not be clear about what kind of data they are processing. Are you in there? Somebody should check billions of lines of logs or whatever if your data is getting processed. Solution? Again, rainbow table which holds your hash key(s) so if we remove your hash key(s) from the table, we don’t need to delete your data because you are no longer recognizable.

Well, at least in theory but not in reality.

You are Recognizable even with Data Obfuscation

The worst part of the rainbow table approach is that your data can be de-obfuscated via hash keys because controller has already the hash keys in somewhere (database, flat files, whatever). It may be necessary for some edge cases, but there is also another problem: Not all data obfuscation gives you full anonymity.

That’s… crazy! Why? It is one of my biggest discussion with data engineering and data security teams because generally, they do believe that data obfuscation is the ultimate solution. Unfortunately, it is not. In the year 2000, L. Sweeney showed in her academic paper that “Simple Demographics Often Identify People Uniquely”. As it is an academic scientific article, you can read all the details but in summary for mortals:

It was found that combinations of few characteristics often combine in populations to uniquely or nearly uniquely identify some individuals. Clearly, data released containing such information about these individuals should not be considered anonymous. Yet, health and other person-specific data are publicly available in this form. Here are some surprising results using only three fields of information, even though typical data releases contain many more fields. It was found that 87% (216 million of 248 million) of the population in the United States had reported characteristics that likely made them unique based only on 5-digit ZIP, gender, date of birth. About half of the U.S. population (132 million of 248 million or 53%) are likely to be uniquely identified by only place, gender, date of birth, where place is basically the city, town, or municipality in which the person resides. And even at the county level, county, gender, date of birth are likely to uniquely identify 18% of the U.S. population. In general, few characteristics are needed to uniquely identify a person

So, What are We Going to do as a Person (Customer)?

Well, this is extremely open question because basically, you cannot do so many things without wishing that data engineering and security guys know their jobs. I know hundreds of data pipelines for business intelligence (BI) carries this information (zip, gender, date of birth) in plain text in my 10+ years of career. Nowadays, companies are a bit more aware that it is not necessary for BI systems if they are not doing segmentation but now, data science took place for intelligent analytics, and they are definitely using these data. No matter where they are storing data (cloud, on-premise, spaceship etc.), you are not fully anonymized as long as you are sharing this information.

There should be data governance teams in the companies which must establish controls to appropriately mask or encrypt sensitive personal data. The data masking standards need to ensure that data cannot be reconstructed when multiple fields are combined.

I will talk about Data Governance in my articles later.

Enough for Conspiracy

I heard some of you that it is a conspiracy theory… OK, let’s see some real world cases:

Netflix’s recommendation engine challenge returned to be a fiasco for data privacy. Here is the link.
AOL’s search data leak gave us PII data, here is the link.
Microsoft Data Breach Exposes 38M Records Containing PII
2017 Equifax data breach
Marriott discloses massive data breach affecting up to 500 million guests
Ebay urges users to reset passwords after cyberattack
Cathay Pacific stocks plunge after airline reveals mass data breach by hacker
Turkey launches inquiry into leak of 50 million citizens’ data

In early times, it was only username and password problem but now, it is bigger. If your username and password are leaked, it is easier to manage nowadays, even for bank accounts but PII data breaches are extremely dangerous because some people can copy your existence in the World (not like sheep Dolly). You may not be aware of someone who is actually you are, buying some properties but not paying the credit money for example.

Last Words and Advises

If you believe that a website/service provider actually doesn’t need to have your information, you should fake it. Of course, it doesn’t work if you would like to order something from the Internet but for example, if a game is asking such details, don’t give them.

You can use e-mail relays, but they are not the ultimate solution. The ultimate solution is the awareness and not giving your data without thinking. Some people think: “I am a useless person, someone simple, so they can use my data because they provide free service.” It may be true from this aspect if you are happy to share your data for free service, but if this data leaks, then your simple life may be ruined.

Data is power, and power comes with responsibility. If you don’t take the responsibility of your data, you cannot expect someone else can do it for you.

/var/log/canartuc

Discussion about this post

Ready for more?