I'm looking to use iptables hashlimit to limit abusive web crawlers, much like this question is trying to limit ssh bruteforce scans.
Every once in a while they hit an inefficient code path on our site. This brings us to our knees because they parallelize so heavily, and come in so fast (e.g., 3-5 incoming connections per second). End users don't hit this too often, and when they do, it's not 10x or 20x in parallel.
I know I'll have some tuning to do, to ensure the burst size is adequate for real users on browsers, and to make sure that my per-ip checking doesn't hurt a couple users behind NAT. All this seems doable, though. Tuning it on our live site shouldn't be too big a deal, I'll just log instead of dropping for the first couple weeks.
That said, I'm a little concerned about the memory usage of hashlimit. Mostly, I want to ensure that the site doesn't go down because this iptables rule doesn't have enough memory.
The Fine Manual for iptables-extensions says:
--hashlimit-htable-size buckets
The number of buckets of the hash table
--hashlimit-htable-max entries
Maximum entries in the hash.
But it's not entirely clear what are the buckets and what are the entries.
Also, what happens when the hash table fills up (maximum entries or buckets)? Hopefully the rule fails and iptables moves on to the next rule, but it doesn't really say.
Here's the rule I'm considering. It works as designed in limited testing, but load-testing with thousands of remote IPs is a little tricky.
iptables -A INPUT -p tcp --dport 80 -m conntrack --ctstate NEW \
-m hashlimit --hashlimit-name=WWW --hashlimit-above 1/sec --hashlimit-burst 50 \
--hashlimit-mode srcip -j LOGACCEPT
I suppose you know how hashing generally works: it calculates some function out of the data (IP, pair of IPs, etc) and uses value of that function as an index in the table to locate the structures associated with that data. Each cell in the table (which corresponds to one possible value of hash function) is usually called hash bucket.
Unfortunately different sets of data may produce the same value of hash function, and will be associated with the same hash bucket. That's why hash bucket may contain several hash entries, which are usually stored as a linked list. Thus, when a lookup is done, hash function is calculated first and a hash bucket is selected, and if it contains several hash entries, they are analyzed one by one to find the approriate hash entry.
Thus hashlimit-htable-size limits the number of hash buckets (size of hash table itself), and hashlimit-htable-max limits the number of all hash entries (stored in all hash buckets).