I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.
I mount a file system
sudo mount -o 'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4' toto:/test /test
and then run
dd if=/test/file of=/dev/null bs=1024k
I can read 200-400MB/s but when I change version to vers=3
, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I used dtrace
to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:
nfs4_bsize
nfs3_bsize
from default 32K to 1M (on v4 I maxed at 150MB/s with 32K) I've tried tweaking
- nfs3_max_threads
- clnt_max_conns
- nfs3_async_clusters
to improve the v3 performance, but no go.
On v3 if I run four parallel dd
's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.
dtrace code to get window sizes:
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline string ADDR=$$1;
dtrace:::BEGIN
{
TITLE = 10;
title = 0;
printf("starting up ...\n");
self->start = 0;
}
tcp:::send, tcp:::receive
/ self->start == 0 /
{
walltime[args[1]->cs_cid]= timestamp;
self->start = 1;
}
tcp:::send, tcp:::receive
/ title == 0 &&
( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
"cid",
"ip",
"usend" ,
"urecd" ,
"delta" ,
"send" ,
"recd" ,
"ssz" ,
"sscal" ,
"rsz",
"rscal",
"congw",
"conthr",
"flags",
"retran"
);
title = TITLE ;
}
tcp:::send
/ ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
this->delta= timestamp-walltime[args[1]->cs_cid];
walltime[args[1]->cs_cid]=timestamp;
this->flags="";
this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d %8d %8d %12d %s %d \n",
args[1]->cs_cid%1000,
args[3]->tcps_raddr ,
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
this->delta/1000,
args[2]->ip_plength - args[4]->tcp_offset,
"",
args[3]->tcps_swnd,
args[3]->tcps_snd_ws,
args[3]->tcps_rwnd,
args[3]->tcps_rcv_ws,
args[3]->tcps_cwnd,
args[3]->tcps_cwnd_ssthresh,
this->flags,
args[3]->tcps_retransmit
);
this->flags=0;
title--;
this->delta=0;
}
tcp:::receive
/ nfs[args[1]->cs_cid] && ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
this->delta= timestamp-walltime[args[1]->cs_cid];
walltime[args[1]->cs_cid]=timestamp;
this->flags="";
this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d %8d %8d %12d %s %d \n",
args[1]->cs_cid%1000,
args[3]->tcps_raddr ,
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
this->delta/1000,
"",
args[2]->ip_plength - args[4]->tcp_offset,
args[3]->tcps_swnd,
args[3]->tcps_snd_ws,
args[3]->tcps_rwnd,
args[3]->tcps_rcv_ws,
args[3]->tcps_cwnd,
args[3]->tcps_cwnd_ssthresh,
this->flags,
args[3]->tcps_retransmit
);
this->flags=0;
title--;
this->delta=0;
}
Output looks like ( not from this particular situation):
cid ip usend urecd delta send recd ssz sscal rsz rscal congw conthr flags retran
320 192.168.100.186 240 0 272 240 \ 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0
320 192.168.100.186 240 0 196 / 68 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0
320 192.168.100.186 0 0 27445 0 \ 49232 0 1049800 5 1049800 2896 ACK| 0
24 192.168.100.177 0 0 255562 / 52 64060 0 64240 0 91980 2920 ACK|PUSH| 0
24 192.168.100.177 52 0 301 52 \ 64060 0 64240 0 91980 2920 ACK|PUSH| 0
some headers
usend - unacknowledged send bytes
urecd - unacknowledged received bytes
ssz - send window
rsz - receive window
congw - congestion window
planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD
Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.
NFS 4.1 (minor 1) is designed to be a faster and more efficient protocol and is recommended over previous versions, especially 4.0.
This includes client-side caching, and although not relevant in this scenario, parallel-NFS (pNFS). The major change is that is that the protocol is now stateful.
http://www.netapp.com/us/communities/tech-ontap/nfsv4-0408.html
I think it is the recommended protocol when using NetApps, judging by their performance documentation. The technology is similar to Windows Vista+ opportunistic locking.