EU's General Data Protection Regulation (GDPR), and the German DSGVO implementation, are very strict when it comes to individual-related data (such as IP addresses). However this question is not about the GDPR, but how to implement the regulation with nginx HTTP access log while keep the possibility of "identifying" the anonymous user within a user journey (to border a user journey from other ones).
My current implementation is, that I do not record the remote IP and port at all. I purged the environment variables for upstreams/proxies/etc and simple does not have remote IP and port information with the access logs.
Now I am facing the issue that I need to follow a path of a user journey. I just simply does not have any way of "identifying" which requests are within which user journey. I want to point out, that I also do not use cookies, etc.
The legacy approach to "identify" an "anonymous user" is to look for the remote IP and the date information. Within the same day, the same remote IP would most likely be the same user. However, as mentioned above, I do not log remote IP and port information. And I don't want that even now.
My current though is to hash the remote IP address with the remote port and date of the request. I would have the date information with the logs but not the remote port, so I cannot - without heavy brute forcing - recover the remote IP, an individual-related data. This approach would help to give back some level of user journey identification, which would help me quite a bit.
A general workflow to accomplish this approach would be:
- The request is accepted by nginx,
- nginx performs a hash operation with remote IP, remote port and current date (e.g.
md5_hex("$remote_addr $remote_port $current_date")
) and stores the hash in a new variable (e.g.$remote_ip_anonymous
), - the log_format would be having the $remote_ip_anonymous variable.
The hash would alter, even when the remote IP and remote port would be the same, due to the current date salt. And it would alter, when the remote port is changed. So this should be fine with GDPR or at least the lowest data security category, while the actual remote IP would be a mayor data security category with GDPR.
Enough with the theory... how would I implement such remote IP anonymization? Do I have to use the nginx Perl module or Lua module, or is there another (faster) way of getting that hash and store it into the nginx variable?
EU's General Data Protection Regulation (GDPR) is about "protection of natural persons with regard to the processing of personal data and rules relating to the free movement of personal data". It's not about how to sabotage IT systems. The best approach is to calm down and see whats O.K. and whats not O.K. regarding personal data protection.
It's technically essential that a web server processes an IP address of a browser/client. Without this ability a web server would be unable to send a response back to the browser/client.
Avoiding the processing of personal data is no option. (Actually there are of course options. For example the TOR browser or using an anonymization proxy would be an option. But this must be done by the client.)
Regarding your web server and a GDPR complaint set-up you should:
However there is proper way of anonymize IP logging in Nginx. I would not recommend it but it works.
How to delete log files after 7 days:
With the installed service log-rotation you have to change the Nginx config file as follows ...