2019-01-29

I was pleasantly surprised that out of the Gizmodo list of best free apps for a new Windows notebook I've been using 9 out of 12 apps for years now. I'd add a few more to the list however:

  • Notepad++ as the default replacement of the stock Notepad 
  • Autohotkey for keyboard and mouse automation, hotkeys, signatures, etc, saved probably hours of typing for me since 2005!
  • Far Manager (of course!)
  • Paint.Net my personal preference over Gizmodo's recommendation of GIMP.
  • ShareX for taking annotated screenshots with ability to post to a picture sharing service of your choice
  • Calibre for keeping my e-books organized
And tools and packages for software development make up a whole another universe on their own.

2019-01-15

Generating pseudonymized 1-character IDs from integers

A client of mine asked to generate pseudonymized statistics out of survey data. Because of PII concerns, they wanted to completely eliminate names or static user IDs from the resulting document and, given the usual count of users participating in surveys below 25-30, choosing 1-character IDs seems to be a good choice. To give you context, responses to a survey with the scale between 1 to 5 would look like this:


Here's a scalar-valued function, which would do such a translation:



Points of interest:

  1. The function takes @seed as a parameter, which should be the same for all IDs in the batch. You may, and probably should, get a new random seed every time you start processing a new batch of IDs, in my case generating stats for a new survey;
  2. The maximum number of distinct characters that can be used for encoding currently is 62 – upper and lower case Latin characters and digits. To further extend this approach, one can either append more encoding characters to the set or modify the logic to encode user IDs with 2-character sequences. This would give 52^2 > 6k possible encodings for just using Latin characters, which is more than enough for all practical applications.

Here're two examples of the same stats encoded in two separate batches.
Barch 1:
Same data in batch 2:


As you can see, the encoding characters were randomly changed for the second batch.