Ping a Specific Port

Question

jotango

Asked: 2010-02-09 12:33:13 +0800 CST2010-02-09 12:33:13 +0800 CST 2010-02-09 12:33:13 +0800 CST

Running (vertical?) diff on columns in a file

772

in our company we pull in inventory files from third parties. These files are in a fixed format, containing the 13-digit EAN (think UPC code) as well as other data. I also have a master list of EANs in our database.

I would like to compare the master file with the new file and remove all lines from the new file, which contain an EAN, which is not in the master.

Example: Master
1234567890123
4567890123456

New file 1234567890123 4567890123456
5678901234567 <- remove this one

The new file contains data other than the EAN. The EAN is in the first column. The data is tab-separated.

I am currently doing this in PHP. The problem is both files have about 4 mn. rows each and my script is consuming a ton of memory. I currently load the whole master list into RAM and do isset()s.

Are there any smart linux tricks/programs which could help me?

2 Answers

Voted

voretaq7 · Answer 1 · 2010-02-09T13:44:01+08:00

Best Answer

voretaq7

2010-02-09T13:44:01+08:002010-02-09T13:44:01+08:00

Rephrasing the question in a more grep-friendly way, you want to print all lines which match an EAN from some master list of EANs.

Assuming that something resembling an EAN won't show up anywhere except in the EAN column, try:

Extract all the EANs from master
Squish that list of EANs into a regex
Feed the regex to egrep

Assuming the EAN is the first column of master (and that master contains other columns)

egrep "(`cat master | awk '{print $1}' | tr '\n' '\\|' | sed 's/|$//'`)" newfile

should come close (you can remove the awk if master is just an EAN list; nasty sed at the end to remove trailing | that results from the rest of the pipeline)

The above breaks down if EANs (or EAN-like 13-digit patterns) are present elsewhere in the data & would require a more complex regular expression to restrict the search to a specific column.

1

sntg · Answer 2 · 2010-02-09T12:45:20+08:00

sntg

2010-02-09T12:45:20+08:002010-02-09T12:45:20+08:00

Try something like this:

# Put each code in one line, and sort them
sed -e 's/\ /\n/g' new | sort > neweans
sed -e '/s\ /\n/g' master | sort > mastereans

# Diff them by columns, and delete from the list
# the new's that are not in master. Then, print them

diff -y neweans mastereans | grep -v "<" | awk '{print $1}'

0

Running (vertical?) diff on columns in a file

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?