I Have two big text files, checksums_1.txt and checksums_2.txt, I want to parse these files and remove duplication between them and merge the unique lines in one file.
Each file have the following structure for each line.
size, md5, path
Example: Checksums_1.txt
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/app/1tier/2tier/filename2.exe
Example: Checksums_2.txt
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/filename2.exe
9565, a4fs614as6fas4fa4s46fsaf1, /mnt/temp/1tier/2tier/newfile.exe
The section that have to be used to check between the checksums_1.txt and checksums_2.txt is after the mountpoint /mnt/app/ and /mnt/temp/, In other words, from the start of each line to the end of the mountpoint /mnt/temp/ or /mnt/app/ will be ignored.
The data inside checksums_1.txt is more important, so if a a duplicated is found the line in checksums_1.txt must be moved to the merged file.
Part of Checksums_1.txt
1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index
Part of Checksums_2.txt
1058,b8203a236b4f1531616318284202c9e6,/mnt/temp/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial
2694,8a815adefde4fa0c263e74832b15de64,/mnt/temp/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/temp/Certificados/ca.db.index
Example of the merged file
1058,b8203a236b4f15316e516165a6546666,/mnt/app/Certificados/ca.crt
3,72b2ac90f7f3ff075a937d6be8fc3dc3,/mnt/temp/Certificados/ca.db.serial
2694,8a815adefde4fa0c263e74832b15de64,/mnt/app/Certificados/ca.db.certs/01.pem
136,77bf2e5313dbaac4df76a4b72df2e2ad,/mnt/app/Certificados/ca.db.index
If you are willing to use python (therefore if performance is not an issue), what you want can be achieved with the following script:
Simply save the script as
merge-checksums.py
, give it execution permissionand run it as:
The
bash
version (withawk
andgrep
):Checksums_1.txt
Checksums_2.txt
Run with
Checksums_3.txt
Or with interchanged input files
Checksums_3.txt
Assuming both files are not huge, the python script below will do the job as well.
How it works
Both files are read by the script. The lines in file_1 (the file that has precedence) is split by the directory you entered for the file in the head section (in your example
/mnt/app/
).Subsequently, the lines in file_1 are written to the output file (the merged file). At the same time, lines from file_2 are removed from the line list if the identifying string (the section after the mount point) occurs in the line. Finally, the "remaining" lines of file_2 (of which no dupe exist in file_1) are written to the output file as well. The result:
file_1:
file_2:
merged:
The script
How to use
merge.py
f1
(file_1
),f2
, the path to the merging file and the mountpoint as mentioned infile_1
.Run it by the command:
Edit
Or a tiny bit shorter: