I work for a local government that bills for water usage and garbage collection. I received a call today from a customer that a "New York company" called him asking for his customer number and pin to give them access to his online account so they can "scrape" his water usage from the site. They are apparently collecting that information from several apartments for some reason that he couldn't recall. Of course this raises flags with me and I informed him to not give them this information. I also told him that if they call back to tell this company that they can contact us directly to get this information if they require it.
Now, if one of our customers received a call, I feel it's safe to assume that others have gotten the same or similar call and they may or may not have given this information. How can I check our logs to see if there is a bot hitting our site and screen-scraping our data? I also feel we should block that bot and prevent further attempts.
Note: The only information stored on the web server is the name and address of the customer, water usages and costs of the bills with a total amount due. They can also pay the bill. We don't store any account information online. So overall, the information on the web server could be considered public information (though the proper channels).
There's no way to detect or block a well-written bot that's only scraping a small number of pages -- its behaviour can be indistinguishable from a genuine user.
You could block or rate-limit any single source IP that is accessing more than one account. As mentioned above, this would have to be able to know that more than account is being accessed and might not be trivial to implement. This could also block tenants in an apartment complex that have NAT-ted internet as a "utility", of course.
You could implement a CAPTCHA.
If there is a bot screen scraping your web site then your only indication of that would be in your web server logs, and it's going to be difficult (at best) to detect. Usually the way you see bots in logs is through the User-Agent string, but that really is only used properly for the well-behaving bots like search crawlers. All of the other nasties will set the user-agent string to match something common like one of the major browsers in order to hide themselves.
Most likely you're looking at tracing IP addresses that access specific URL's that you can directly tie back to a specific customer. This is further complicated if the requests are POST requests since the customer specific information is likely in the POST data and not in the URL such as you'd see with a GET request.
Honestly, best of luck with that unfortunately... Not sure you're going to be able to get very far.
If they are screen scraping, they likely access only the target page without any related content like css, jsp, and images. You would need to search your access logs to see if this kind of activity is occurring. You may catch some legitimate browsers that don't reload data that is already cached.
You might be able to detect excessive visits to the page in question from a single IP address. This may catch a few ISPs that are NATing their customers access.
Running a GeoIP check should quickly tell you if you have accesses from other countries. Some of this may be legitimate customers who are living abroad or traveling.
If your site has a News feature, it might be worthwhile putting up a posting about this situation. This may get you some more reports.
Log the IP of every access to an account. After a while, go back and query your logs assigning a point to each IP for each account it accesses, then sort to find the IP addresses that have accessed the most accounts.
After you rule out some libraries and such, I bet you'll find your culprit even if they do access the accounts very slowly or rarely. That starts to stick out after a month if they're doing it from the same place -- which they more than likely are.
Some other permutations: first limit to accounts that have been accessed from more than one IP -- the real user and the scraper. Find the account of any one user that know has given out the information.
Couldn't you flag and slow down (Rate limit) and/or block a bot if you had one IP accessing multiple accounts? Sure, it is possible for multiple real users to share an IP address (say at an office) but what is the likelihood of a hundred people all using the service and the same IP? Also the bot will probably have the same user agent every time (depends on their programming).
Depending on how you get to the information in question, you can hide data on screen, or in webforms. Like most bots don't handle javascript, so you can use that to modify data in submits.
This stinks but the way I've beaten simple screen scrapers in the past is to use silverlight or flash to put the text on the screen as images rather than text. At least at that point they have to OCR the images rather than simply capture the HTML output and parse it.
you can implement some javascript into login page. when page loads, set form action to something like this
http://mydomain.ltd/login/bot
, when user hits submit (login) button then change action to valid url ordocument.getElementById('login_form_id').action = "http://mydomain.ltd/login/human"
.If bot is crawling your data, then typically bot doesn't support JS. Of course they will figure out, why bot isn't working. But you can log failed IP addresses and then compare later with success logged in addresses.
If human is collecting data, then it is difficult to figure it out.
Also you can analyze user agent string and IP address. If user is logging out and then logging into another account (in 5 minutes ?), then you can log that. If that is human, they will look one account, then another etc ...