I have schedule some crontab jobs to scrape a number of websites.
I have set some cron jobs to run the scrapers at 1 AM, scraper_1 starts on 1:01, scraper_2 starts on 1:03 and scraper_3 starts on 1:05
Each scraper may take 3 to 6 mins to complete, so there are some overlapping time between running scrappers.
# start on 1:01
01 01 * * * cd /home/ubuntu/jobscrapers/scraper_1 && scrapy crawl spider_1 >> /tmp/scraper.log 2>&1
# start on 1:03
03 01 * * * cd /home/ubuntu/jobscrapers/scraper_2 && scrapy crawl spider_2 >> /tmp/scraper.log 2>&1
# start on 1:05
05 01 * * * cd /home/ubuntu/jobscrapers/scraper_3 && scrapy crawl spider_3 >> /tmp/scraper.log 2>&1
All of these scrapers are written using Scrapy and they use Selenium and Chrome Web Driver.
The code runs fine on my development machine (windows)... but recently I am getting some occasional errors on the production machine (Ubuntu)
For example an scraper run fines for some time and then it crashes with the following error:
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed (Session info: headless chrome=86.0.4240.111) (Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)
Is this because 2 scraper are running at the same time? Does crontab create a new thread for each scraper (webdriver)?
Updated question
The issue was that there was no space left on the server...
I realized the problem by accident, the scrapy log was not helpful. Was there other logs that I should have checked to point me to the actual issue?
The issue was that there was no space left on my sever:
I used the
df -h
command to check the available space and noticed that the / partition was 100% full:As my server is an AWS EC2 instance, I had to extent the volume. The following 2 links explain how to extend an EC2's volume: