in our company we pull in inventory files from third parties. These files are in a fixed format, containing the 13-digit EAN (think UPC code) as well as other data. I also have a master list of EANs in our database.
I would like to compare the master file with the new file and remove all lines from the new file, which contain an EAN, which is not in the master.
Example:
Master
1234567890123
4567890123456
New file
1234567890123
4567890123456
5678901234567 <- remove this one
The new file contains data other than the EAN. The EAN is in the first column. The data is tab-separated.
I am currently doing this in PHP. The problem is both files have about 4 mn. rows each and my script is consuming a ton of memory. I currently load the whole master list into RAM and do isset()s.
Are there any smart linux tricks/programs which could help me?
Rephrasing the question in a more grep-friendly way, you want to print all lines which match an EAN from some master list of EANs.
Assuming that something resembling an EAN won't show up anywhere except in the EAN column, try:
master
Assuming the EAN is the first column of
master
(and that master contains other columns)should come close (you can remove the
awk
ifmaster
is just an EAN list; nasty sed at the end to remove trailing|
that results from the rest of the pipeline)The above breaks down if EANs (or EAN-like 13-digit patterns) are present elsewhere in the data & would require a more complex regular expression to restrict the search to a specific column.
Try something like this: