We have Oracle running on a Windows server and a couple of Linux clients. Windows clients work just fine. However, running tnsping
on a Linux (running RHEL 6.9) client has an interesting issue. Take for example:
tnsping <IP> 100
This will eventually fail on a Linux client with:
TNS-12535: TNS:operation timed out
Keeping in mind - ping does not show any dropped packets. I did some testing with tcpdump and when I loaded it in wireshark the only thin that I could see that was weird was TCP retransmissions. Every Linux client I try this one exhibits this behavior. I tried tweaking some OS level TCP timeouts/keep alives and still didn't resolve it.
I did do an strace on tnsping and the only thing it showed was a timeout which didn't help.
The systems in question are VMs running on VMWare.
Edit:
I created a trace file and ran tnsping again. It did succeed for many connections then eventually timed out:
nttcni: Tcp conn timeout = 60000 (ms)
nttcni: TCP Connect TO enabled. Switching to NB
nttctl: entry
nttctl: Setting connection into non-blocking mode
nttcni: trying to connect to socket 5.
ntt2err: entry
ntt2err: exit
ntctst: size of NTTEST list is 1 - not calling poll
sntpoltst: No of conn to test 1, wait time 60
sntpoltst: fd 5 need 1 readiness events
sntpoltst: exit
nttcni: TImeout or Error on this socket
nttcni: exit
nttcon: exit
nserror: entry
nserror: nsres: id=0, op=65, ns=12535, ns2=12560; nt[0]=505, nt[1]=0, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
nsopen: unable to open transport
nsiocancel: entry
nsiofrrg: entry
nsiofrrg: cur = 24eff18
nsbfr: entry
nsbaddfl: entry
nsbaddfl: normal exit
nsbfr: normal exit
nsiofrrg: exit
nsiocancel: exit
nsvntx_dei: entry
nsvntx_dei: exit
I did find this article which is similar to my issue but not quite. In their example tnsping was able to connect and send data it looks like - my tnsping doesn't even establish a TCP connection.
https://ardentperf.com/2010/09/08/mysterious-oracle-net-errors/
If I add the following options to the connection in tnsnames.ora file:
(RETRY_COUNT=3)
(TRANSPORT_CONNECT_TIMEOUT=10)
Then it will work without a timeout but will result in a high latency:
OK (10 msec)
OK (0 msec)
OK (0 msec)
OK (10 msec)
OK (0 msec)
OK (0 msec)
OK (10020 msec)
OK (0 msec)
OK (0 msec)
OK (0 msec)
OK (10 msec)
OK (0 msec)
This "fix" is the equivalent of bubble gum and duct tape but it does work.