How can I use docker without sudo?

Question

Ntakwetet

Asked: 2020-07-12 00:36:37 +0800 CST2020-07-12 00:36:37 +0800 CST 2020-07-12 00:36:37 +0800 CST

Split CSV file based on column values and number of lines

772

I must work with data from a huge CSV file in a format like this one. The file is very huge (~200MB) and my PC is having difficulties with it, so I would like to split the file in smaller ones easier to cope with. Supposing that the file has a format like this:

NAME,SURNAME,SEX,CITY,AGE RANK
Tom,Brown,M,New York,20-40
Dick,Clarke,M,Seattle,0-20
Katie,Johnson,F,Boston,40-60
Harry,Smith,M,Washington,40-60
Amy,Davies,F,Chicago,20-40
Emily,Adams,F,New York,20-40
...

I would like to split it as follows:

separate age ranks
each file no longer than given line number, otherwise splitted again

For example:

0-20.1.csv (5000 lines)
0-20.2.csv (5000 lines)
0-20.3.csv (1234 remaining lines)
20-30.1.csv (5000 lines)
20-30.2.csv (4321 remaining lines)
...

I would also like to repeat the first line (header) of the input file at the beginning of each output file and also remove some columns I don't need, but that's not essential. So my ideal output for age rank 20-40 would be (supposing I want to remove NAME and AGE RANK column):

SURNAME,SEX,CITY
Brown,M,New York
Davies,F,Chicago
Adams,F,New York
...

Is there a way to automatically manipulate the file like that? I can use any tool or script, but I would much prefer to avoid proprietary software.

1 Answers

Voted

steeldriver · Answer 1 · 2020-07-12T04:16:41+08:00

It would be helpful if you included a mini-example with enough data for splitting at, say, 5 lines instead of 5000. However, you should be able to use something like this, in Awk:

awk -F, -v nsplit=5000 '
  NR==1 {OFS=FS; hdr=$0; next} 
  (++count[$5] % nsplit == 1) {
    close(fname[$5]); 
    fname[$5] = $5 "." (++ind[$5]) ".csv";
    print hdr > fname[$5]
  } 
  {print > fname[$5]}
' file.csv

Testing with nsplit=2:

$ awk -F, -v nsplit=2 '
  NR==1 {OFS=FS; hdr=$0; next} 
  (++count[$5] % nsplit == 1) {
    close(fname[$5]); 
    fname[$5] = $5 "." (++ind[$5]) ".csv";
    print hdr > fname[$5]
  } 
  {print > fname[$5]}
' file.csv

gives

$ head 0-20.?.csv 20-40.?.csv 40-60.?.csv 
==> 0-20.1.csv <==
NAME,SURNAME,SEX,CITY,AGE RANK
Dick,Clarke,M,Seattle,0-20

==> 20-40.1.csv <==
NAME,SURNAME,SEX,CITY,AGE RANK
Tom,Brown,M,New York,20-40
Amy,Davies,F,Chicago,20-40

==> 20-40.2.csv <==
NAME,SURNAME,SEX,CITY,AGE RANK
Emily,Adams,F,New York,20-40

==> 40-60.1.csv <==
NAME,SURNAME,SEX,CITY,AGE RANK
Katie,Johnson,F,Boston,40-60
Harry,Smith,M,Washington,40-60

Split CSV file based on column values and number of lines

How to install Google Chrome

Is there a command to list all users? Also to add, delete, modify users, in the terminal?

How to delete a non-empty directory in Terminal?

How to unzip a zip file from the Terminal?

How can I copy the contents of a folder to another folder in a different directory using terminal?

How do I install a .deb file via the command line?

How do I run .sh scripts?

How do I install a .tar.gz (or .tar.bz2) file?

How to list all installed packages

Unable to lock the administration directory (/var/lib/dpkg/) is another process using it?