I found an old code on a legacy server that writes from a dataset of roughly 20 columns and >60K rows item by item to an Excel file using openpyxl. It does this by filling each cell on its own (the slow way, but not that slow either since the whole thing is done after a few minutes, and with formatting of the cells):
cell = ws.cell('%s%s' % (col_idx, i + 3))
cell.value = unicode(value).encode("utf-8")
get_style(cell, "content", column_colors[col_id])
At roughly 59K rows, it crashes, the console prints:
Killed
The logs show:
Received SIGTERM, shutting down.
The SIGTERM hints at too little memory so that the server kills the task "from outside". It is a virtual machine. Checking the memory with the command free
shows that all of the free 3.5 GB get used by writing that small dataset (20 MB on the server) to Excel.
I checked it on the run and found: 500 MB just to configure the file, 700 MB every 10K rows. Writing the dataset leads to:
60 x 700 + 500 = 4700 MB free RAM needed, and the VM has only 3.5 GB. Which should be enough anyway for such a small file in the output. The output in the end is roughly a 20 MB workbook. The output Excel version should be 2007 or 2010, since it is running on Python 2.73 and in outdated legacy code.
Why does writing to a 20 MB workbook with Python's openpyxl module eat Gigabytes of RAM?
That openpyxl requires a lot of RAM for reading/writing Excel files is known:
According to the openpyxl documentation about performance the general RAM usage can be calculated this way.
In your case the value seems to be even higher may be your usage is different or the values form the documentation are outdated.
But the documentation also contains hints how to optimize, thus reduce the RAM usage:
Install the Python package lxml - if that package is installed it will be used. It is especially recommended when writing "large" Excel files.
Use Write-Only mode