I must work with data from a huge CSV file in a format like this one. The file is very huge (~200MB) and my PC is having difficulties with it, so I would like to split the file in smaller ones easier to cope with. Supposing that the file has a format like this:
NAME,SURNAME,SEX,CITY,AGE RANK
Tom,Brown,M,New York,20-40
Dick,Clarke,M,Seattle,0-20
Katie,Johnson,F,Boston,40-60
Harry,Smith,M,Washington,40-60
Amy,Davies,F,Chicago,20-40
Emily,Adams,F,New York,20-40
...
I would like to split it as follows:
- separate age ranks
- each file no longer than given line number, otherwise splitted again
For example:
- 0-20.1.csv (5000 lines)
- 0-20.2.csv (5000 lines)
- 0-20.3.csv (1234 remaining lines)
- 20-30.1.csv (5000 lines)
- 20-30.2.csv (4321 remaining lines)
- ...
I would also like to repeat the first line (header) of the input file at the beginning of each output file and also remove some columns I don't need, but that's not essential. So my ideal output for age rank 20-40 would be (supposing I want to remove NAME
and AGE RANK
column):
SURNAME,SEX,CITY
Brown,M,New York
Davies,F,Chicago
Adams,F,New York
...
Is there a way to automatically manipulate the file like that? I can use any tool or script, but I would much prefer to avoid proprietary software.
It would be helpful if you included a mini-example with enough data for splitting at, say, 5 lines instead of 5000. However, you should be able to use something like this, in Awk:
Testing with
nsplit=2
:gives