How can I use docker without sudo?

Question

Hooman Bahreini

Asked: 2021-04-01 18:32:29 +0800 CST2021-04-01 18:32:29 +0800 CST 2021-04-01 18:32:29 +0800 CST

Can I run multiple scraper at the same time using cron jobs?

772

I have schedule some crontab jobs to scrape a number of websites.

I have set some cron jobs to run the scrapers at 1 AM, scraper_1 starts on 1:01, scraper_2 starts on 1:03 and scraper_3 starts on 1:05

Each scraper may take 3 to 6 mins to complete, so there are some overlapping time between running scrappers.

# start on 1:01
01 01 * * * cd /home/ubuntu/jobscrapers/scraper_1 && scrapy crawl spider_1 >> /tmp/scraper.log 2>&1

# start on 1:03
03 01 * * * cd /home/ubuntu/jobscrapers/scraper_2 && scrapy crawl spider_2 >> /tmp/scraper.log 2>&1

# start on 1:05
05 01 * * * cd /home/ubuntu/jobscrapers/scraper_3 && scrapy crawl spider_3 >> /tmp/scraper.log 2>&1

All of these scrapers are written using Scrapy and they use Selenium and Chrome Web Driver.

The code runs fine on my development machine (windows)... but recently I am getting some occasional errors on the production machine (Ubuntu)

For example an scraper run fines for some time and then it crashes with the following error:

selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash from unknown error: cannot determine loading status from tab crashed (Session info: headless chrome=86.0.4240.111) (Driver info: chromedriver=2.41.578700 (2f1ed5f9343c13f73144538f15c00b370eda6706),platform=Linux 5.4.0-1029-aws x86_64)

Is this because 2 scraper are running at the same time? Does crontab create a new thread for each scraper (webdriver)?

Updated question

The issue was that there was no space left on the server...

I realized the problem by accident, the scrapy log was not helpful. Was there other logs that I should have checked to point me to the actual issue?

1 Answers

Voted

Hooman Bahreini · Answer 1 · 2021-04-02T23:16:19+08:00

Hooman Bahreini

2021-04-02T23:16:19+08:002021-04-02T23:16:19+08:00

The issue was that there was no space left on my sever:

I used the df -h command to check the available space and noticed that the / partition was 100% full:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        460M     0  475M   0% /dev
tmpfs           478M     0  492M   0% /dev/shm
tmpfs           478M  432K  492M   1% /run
tmpfs           478M     0  492M   0% /sys/fs/cgroup
/dev/nvme0n1p1  8.0G  8.0G  664K 100% /
tmpfs            96M     0   99M   0% /run/user/1000

As my server is an AWS EC2 instance, I had to extent the volume. The following 2 links explain how to extend an EC2's volume:

2

Can I run multiple scraper at the same time using cron jobs?

Updated question

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?