2019-01-15

Generating pseudonymized 1-character IDs from integers

A client of mine asked to generate pseudonymized statistics out of survey data. Because of PII concerns, they wanted to completely eliminate names or static user IDs from the resulting document and, given the usual count of users participating in surveys below 25-30, choosing 1-character IDs seems to be a good choice. To give you context, responses to a survey with the scale between 1 to 5 would look like this:


Here's a scalar-valued function, which would do such a translation:



Points of interest:

  1. The function takes @seed as a parameter, which should be the same for all IDs in the batch. You may, and probably should, get a new random seed every time you start processing a new batch of IDs, in my case generating stats for a new survey;
  2. The maximum number of distinct characters that can be used for encoding currently is 62 – upper and lower case Latin characters and digits. To further extend this approach, one can either append more encoding characters to the set or modify the logic to encode user IDs with 2-character sequences. This would give 52^2 > 6k possible encodings for just using Latin characters, which is more than enough for all practical applications.

Here're two examples of the same stats encoded in two separate batches.
Barch 1:
Same data in batch 2:


As you can see, the encoding characters were randomly changed for the second batch.

No comments: