I'm on an OpenSolaris box and I want to monitor the VMs on an ESX box. VMware has a remote tool kit in perl for linux and a remote esxtop for LINUX, but is there a way to collect ESX statistics on a non-LINUX box, in this case OpenSolaris x86? I can use ssh, expect and esxtop but want to know if there is way to do this with a remote API? (I'd rather not use ssh and expect)
Kyle Hailey's questions
I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.
I mount a file system
sudo mount -o 'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4' toto:/test /test
and then run
dd if=/test/file of=/dev/null bs=1024k
I can read 200-400MB/s but when I change version to vers=3
, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I used dtrace
to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:
nfs4_bsize
nfs3_bsize
from default 32K to 1M (on v4 I maxed at 150MB/s with 32K) I've tried tweaking
- nfs3_max_threads
- clnt_max_conns
- nfs3_async_clusters
to improve the v3 performance, but no go.
On v3 if I run four parallel dd
's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.
dtrace code to get window sizes:
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline string ADDR=$$1;
dtrace:::BEGIN
{
TITLE = 10;
title = 0;
printf("starting up ...\n");
self->start = 0;
}
tcp:::send, tcp:::receive
/ self->start == 0 /
{
walltime[args[1]->cs_cid]= timestamp;
self->start = 1;
}
tcp:::send, tcp:::receive
/ title == 0 &&
( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
"cid",
"ip",
"usend" ,
"urecd" ,
"delta" ,
"send" ,
"recd" ,
"ssz" ,
"sscal" ,
"rsz",
"rscal",
"congw",
"conthr",
"flags",
"retran"
);
title = TITLE ;
}
tcp:::send
/ ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
this->delta= timestamp-walltime[args[1]->cs_cid];
walltime[args[1]->cs_cid]=timestamp;
this->flags="";
this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d %8d %8d %12d %s %d \n",
args[1]->cs_cid%1000,
args[3]->tcps_raddr ,
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
this->delta/1000,
args[2]->ip_plength - args[4]->tcp_offset,
"",
args[3]->tcps_swnd,
args[3]->tcps_snd_ws,
args[3]->tcps_rwnd,
args[3]->tcps_rcv_ws,
args[3]->tcps_cwnd,
args[3]->tcps_cwnd_ssthresh,
this->flags,
args[3]->tcps_retransmit
);
this->flags=0;
title--;
this->delta=0;
}
tcp:::receive
/ nfs[args[1]->cs_cid] && ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
this->delta= timestamp-walltime[args[1]->cs_cid];
walltime[args[1]->cs_cid]=timestamp;
this->flags="";
this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d %8d %8d %12d %s %d \n",
args[1]->cs_cid%1000,
args[3]->tcps_raddr ,
args[3]->tcps_snxt - args[3]->tcps_suna ,
args[3]->tcps_rnxt - args[3]->tcps_rack,
this->delta/1000,
"",
args[2]->ip_plength - args[4]->tcp_offset,
args[3]->tcps_swnd,
args[3]->tcps_snd_ws,
args[3]->tcps_rwnd,
args[3]->tcps_rcv_ws,
args[3]->tcps_cwnd,
args[3]->tcps_cwnd_ssthresh,
this->flags,
args[3]->tcps_retransmit
);
this->flags=0;
title--;
this->delta=0;
}
Output looks like ( not from this particular situation):
cid ip usend urecd delta send recd ssz sscal rsz rscal congw conthr flags retran
320 192.168.100.186 240 0 272 240 \ 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0
320 192.168.100.186 240 0 196 / 68 49232 0 1049800 5 1049800 2896 ACK|PUSH| 0
320 192.168.100.186 0 0 27445 0 \ 49232 0 1049800 5 1049800 2896 ACK| 0
24 192.168.100.177 0 0 255562 / 52 64060 0 64240 0 91980 2920 ACK|PUSH| 0
24 192.168.100.177 52 0 301 52 \ 64060 0 64240 0 91980 2920 ACK|PUSH| 0
some headers
usend - unacknowledged send bytes
urecd - unacknowledged received bytes
ssz - send window
rsz - receive window
congw - congestion window
planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD
Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.
is there anyway to limit the output from esxtop in batch mode? I tried running it in batch mode and got 16,000 columns! I could filter this out post collection but at that kind of data volumn it seems like I'd be wasting resources. The interactive output from esxtop is fairly customizable. Here is a pretty good discussion of esxtop http://www.yellow-bricks.com/esxtop/ If the batch mode is not, then I will probably see about parsing the interactive output progamatically. Another option would be using the SKD from VMware but I haven't found any practical examples. I'm doing the collection from opensolaris. There is a perl SDK for LINUX and Windows but I'd rather do everything from opensolaris if possible.
I want to read and write to a ramdisk on OpenSolaris for performance testing purposes. The tests would be aimed at network transmission and I want to rule out disk performance. I set up the ramdisk on the NFS server, machine A, with
mkfile -nv 1000m `pwd`/ramdisk
on a directory that was mounted via NFS onto machine B. Reading the ramdisk went fine, but writing to it, just overwrote the file. I then setup a ramdisk with
ramdiskadm -a ramdisk1 1000m
which I can write to fine but I can't access over NFS. The ramdisk is put on /dev/ramdisk which is a link to /devices/pseudo I added /devices/pseudo to /etc/dfs/sharetab and mounted it on machine B without error, but the contents of the directory on machine B are emtpy.
I'm running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B. Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. On the open solaris I dtraced the connections and all the interactions look the same - same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending and ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives. All three machines are on the same switch. Here is the Dtrace output: Fast Machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 122 1448 \ 195200 7 131768 2 128872 1073725440 37 1448 \ 195200 7 131768 2 128872 1073725440 20 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 19 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 57 1448 \ 195200 7 131768 2 128872 1073725440 171 1448 \ 195200 7 131768 2 128872 1073725440 29 912 \ 195200 7 131768 2 128872 1073725440 30 / 0 195200 7 131768 2 128872 1073725440
slow machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 161 / 0 195200 7 131768 2 127424 1073725440 52 1448 \ 195200 7 131768 2 128872 1073725440 33 1448 \ 195200 7 131768 2 128872 1073725440 11 1448 \ 195200 7 131768 2 128872 1073725440 143 / 0 195200 7 131768 2 128872 1073725440 46 1448 \ 195200 7 131768 2 130320 1073725440 31 1448 \ 195200 7 131768 2 130320 1073725440 11 1448 \ 195200 7 131768 2 130320 1073725440 157 / 0 195200 7 131768 2 130320 1073725440 46 1448 \ 195200 7 131768 2 131768 1073725440 18 1448 \ 195200 7 131768 2 131768 1073725440
Dtrace code
dtrace: 130717 drops on CPU 0 #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; dtrace:::BEGIN { TIMER = ( TICKS != NULL ) ? TICKS : 1 ; ticks = TIMER; TITLE = 10; title = 0; walltime=timestamp; printf("starting up ...\n"); } tcp:::send / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %8s / %8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; }
Followup added to to include the number of unacknowledged bytes and it turns out the slow code does run up it's unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it's congestion window. Here is the output from the slow machine when it's unacknowledged bytes hit the congestion window:
unack unack delta bytes bytes send recieve cong ssthresh bytes byte us sent recieved window window window sent recieved 139760 0 31 1448 \ 195200 131768 144800 1073725440 139760 0 33 1448 \ 195200 131768 144800 1073725440 144104 0 29 1448 \ 195200 131768 146248 1073725440 145552 0 31 / 0 195200 131768 144800 1073725440 145552 0 41 1448 \ 195200 131768 147696 1073725440 147000 0 30 / 0 195200 131768 144800 1073725440 147000 0 22 1448 \ 195200 131768 76744 72400 147000 0 28 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 26 / 0 195200 131768 76744 72400 147000 0 17 1448 \ 195200 131768 76744 72400 147000 0 27 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 56 / 0 195200 131768 76744 72400 147000 0 22 1448 \ 195200 131768 76744 72400
dtrace code:
#!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; tcp:::send, tcp:::receive / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); }
Now it still is a question as to why one machine falls behind and the other doesn't ...
I have two machines with same subnets X.Y.Z.1 and X.Y.Z.2 I connect them directly with a crossover cable. I can
$ ping X.Y.Z.2
from X.Y.Z.1 and the response is machine 2 is alive but if I do something like
$ ping -s X.Y.Z.2
it hangs. machine 1 is open solaris. Machine 2 has been hpUX, LINUX and Solaris Sparc second test
$ssh X.Y.Z.2
connects and asks for DSA key, which I accept with "yes", then it hangs