I have a log file as below:
12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:27 +0330 SOCK5.6699 00094 user156 32.99.193.2:51242 1.1.1.1:443 715 388 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40048 1.1.1.1:443 18105 29029 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40070 1.1.1.1:443 674 26805 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:24 +0330 SOCK5.6699 00000 user143 112.199.63.119:60682 1.1.1.1:443 475 445 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:37 +0330 SOCK5.6699 00000 user105 191.184.66.98:40102 1.1.1.1:443 12913 18780 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:42 +0330 SOCK5.6699 00000 user143 112.199.63.119:60688 1.1.1.1:443 4530 34717 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:44 +0330 SOCK5.6699 00000 user127 212.167.145.49:2972 1.1.1.1:443 827 267 0 CONNECT 1.1.1.1:443
my goal is to extract two portions of this log file:
- Username
- IP address of the user source
below is a sample of the portions of data needed.
12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443
So I wrote a Python script to extract both items and store them in separate lists and then joined them with zip function.
import pprint
import collections
iplist=[]
for l in data:
ip_port=l[53:71]
iplist.append(ip_port.split(':')[0])
userlist=[]
for u in data:
user=u[42:52]
userlist.append(user.replace(" ", ""))
a=list(zip(iplist,userlist))
most_ip=collections.Counter(a).most_common(5)
pprint.pprint(most_ip)
This code works fine, and I'm able to get the top used ip with its corresponding username. Also need to mention that I didn't use re module, since it was listing the second IP (destination IP which is 1.1.1.1- which I don't care about it)
Question: Is there any other way(more neat wey) than the way I've written the code?
With the suggestion of "shearn89" I have edited my code as below:
much simpler with a single iteration.
There are many capabilities to optimize also your new code. The two things catching me most:
Do not execute split() more than once for each line of the log, just execute split() once and store the result in a variable, because each execution of this functions needs some time (even its not much, but will add up the more data you process).
Why creating two list and then zipping them together afterwards? Just store the tuples directly in a list: