We have a Java application which is significantly slower in Solaris 10 server than in Windows PC.
We have profiled (-Xprof) the application and observed that UnixFileSystem.getBooleanAttributes0 method consumes about 40% CPU time with native calls.
How can we follow our search to identify which is the cause of the slow behaviour?
Update:
We have executed dtrace
and we observe the CPU time is 0.489 sec, while the time
results are
real 0m40.759s
user 0m47.669s
sys 0m2.287s
It seems time is not spent on CPU but in another place. How can we proceed with further investigation having this dtrace
results? Can we use another dtrace
options to show more info?
Thank you very much.
# /dtrace-toolkit/DTraceToolkit-0.99/procsystime -aTp 5695
Tracing... Hit Ctrl-C to end...
dtrace: 1032 dynamic variable drops with non-empty dirty list
^C
Elapsed Times for PID 5695,
SYSCALL TIME (ns)
gtime 7799
sysconfig 8018
lwp_kill 10105
getsockname 10215
systeminfo 10984
getrlimit 11093
lwp_self 20210
schedctl 26361
lwp_continue 27349
priocntlsys 33390
unlink 53050
lwp_sigmask 136303
setcontext 143447
so_socketpair 202427
lwp_create 223626
brk 227141
so_socket 263276
memcntl 461091
ioctl 1094955
lwp_cond_broadcast 1946728
lwp_mutex_wakeup 2579492
lwp_cond_signal 2903277
mmap 3082973
access 4446250
mprotect 4534229
getdents64 5713309
mmap64 6272480
fsat 6925239
stat 6926115
open 8324983
fcntl 8840868
close 11264404
llseek 11678791
munmap 11867401
fstat64 17852129
lwp_park 22098570
open64 23549922
stat64 39673214
read 41876097
resolvepath 81905859
yield 175946072
write 225507878
lwp_mutex_timedlock 887174392
pollsys 20393318412
lwp_cond_wait 345099369278
TOTAL: 367108549202
CPU Times for PID 5695,
SYSCALL TIME (ns)
gtime 1208
sysconfig 1647
lwp_self 2198
systeminfo 2856
lwp_kill 3186
getsockname 3625
getrlimit 4613
lwp_continue 13839
schedctl 14059
priocntlsys 14498
lwp_sigmask 20541
unlink 45801
setcontext 96874
brk 121804
lwp_park 158601
so_socketpair 195178
lwp_create 209017
so_socket 241199
memcntl 365972
ioctl 606397
lwp_cond_broadcast 1367556
lwp_mutex_wakeup 1561103
lwp_cond_signal 1803831
access 1885436
mmap 2655710
llseek 3266410
open 3614795
fsat 4043369
getdents64 4253373
mprotect 4345864
close 4547395
fcntl 4871095
stat 5183012
mmap64 5308789
fstat64 8141116
munmap 10774529
pollsys 11928811
lwp_mutex_timedlock 19212980
open64 19443958
lwp_cond_wait 23144761
read 28043647
stat64 31710269
resolvepath 70641662
yield 77731698
write 137678654
TOTAL: 489282936
Syscall Counts for PID 5695,
SYSCALL COUNT
getrlimit 1
getsockname 1
gtime 1
lwp_kill 1
rexit 1
so_socketpair 1
sysconfig 1
systeminfo 1
unlink 1
lwp_continue 2
lwp_create 2
schedctl 2
lwp_self 3
priocntlsys 3
so_socket 3
lwp_exit 4
setcontext 7
lwp_park 11
memcntl 14
brk 16
lwp_sigmask 19
mprotect 25
mmap 56
access 67
lwp_cond_broadcast 86
ioctl 87
open 100
fsat 129
mmap64 133
lwp_mutex_wakeup 148
munmap 153
lwp_cond_signal 162
getdents64 224
stat 283
pollsys 348
lwp_cond_wait 537
lwp_mutex_timedlock 558
open64 590
fcntl 625
close 777
stat64 1146
llseek 1368
read 1404
fstat64 1559
resolvepath 1644
yield 10049
write 13416
TOTAL: 35769
Update 2:
Here is the source of the dtraced Java application. It just loads several thousand classes from one single .jar file, reproducing the same behaviour observed in a bigger app.
import java.io.IOException;
import java.io.InputStream;
import java.util.Enumeration;
import java.util.Properties;
public class TestCL {
public static void main(String[] args) {
Properties properties = new Properties();
InputStream in = TestCL.class.getClassLoader().getResourceAsStream("TestCL.properties");
if (in == null) {
System.err.println("TestCL.properties not found");
System.exit(-1);
}
try {
properties.load(in);
} catch (IOException e) {
System.err.println("TestCL.properties loading error");
e.printStackTrace();
System.exit(-2);
} finally {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Enumeration e = properties.keys();
while (e.hasMoreElements()) {
String className = (String) e.nextElement();
try {
Class.forName(className);
System.out.println("OK " + className);
} catch (Throwable t) {
System.err.println("KO " + className + " " + t.getClass().getName() + " " + t.getMessage());
}
}
}
}
Update 3:
All classes are packed in a single .jar file. We have also run the test with all classes unpacked and there's no significant improvement on performance.
java -cp library.jar:. TestCL
If your using Solaris, you are lucky to be able to use dtrace. This will allow you to profile to the kernel level and get more clues about how the jvm is interacting with the kernel.
More information here
http://www.devx.com/Java/Article/33943
If you want find out what you jvm is doing then run dtrace with the jvm probes.
http://java.sun.com/javase/6/docs/technotes/guides/vm/dtrace.html
http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Java#DTrace_Topics:_Java
this will give you much more meaningful output in relation to your program. Have a look at 'Method Times' section.
http://www.princeton.edu/~unix/Solaris/troubleshoot/diskio.html
is a great guide for finding i/o bottle necks.
this may also help http://prefetch.net/blog/index.php/2007/03/03/viewing-busy-code-paths-with-dtrace/
There are no hard and fast rules when tracking problems like this, but information is the key!!. If you follow these guides , You are well on the way to becoming a ninja system engineer.
You may be able to use jprofiler http://www.ej-technologies.com/products/jprofiler/overview.html
it's not open source, however i've had great success in tracking down java performance problems with it.
You should also run your java vm and your application with full debug logging. If you have access to the kernel logs then check them for unusual events.
Good luck.
Does anyone else on server fault have some ninja tips for fault finding problems such as this. I've got my own way of debugging, but it would be interesting to know what other think?
-----------------UPDATE--------------
I've looked at the trace again, seems that you seem to be doing a lot of resolvepaths.
This needs to be fixed. Are you using a very long path's, or running from a directory that is symlinked. Try removing symlinks, or using hardlinks and see if you get better results.
Does the app do lot's of directory traversal , then try using absolute path's , rather than relative paths.
Again, just an idea' I had over eating a cheese cake. Run dtrace with the java probes, that should drill down enough to see what java class/methods are taking the most time.
Good Luck (again). Don't give up now, I think we are very close to solution.
On your solaris system just run
sys_diag -G -I1 -l -v
and it will aggregate all performance info (CPU/Memory/Network/Disk IO/Dtrace/kernel...) and analyze the output with a single color-coded .html report of findings/bottlenecks characterizing the workload by sub-system.. This will show any/all bottlenecks as well as locking that might be occuring (lockstat,..). The latest version is v8.1 HTH.Just to post the final result.
The server (Sun T Series) is just that slow for a single-threaded Java application. Go figure.
Thank you all for the answers and support.
Your dtrace output shows that your applications spends most of the time on writes. You should check (using dtrace of course :-)) where these writes go. If they go to solaris filesystem then you should check if the filesystem is not a bottleneck.
It might be worth taking a look with iostat to see if there's a problem that's causing your disk accesses to be slower than expected. Loading up a few thousand classes shouldn't consume a lot of disk IO, especially if you've run it a few times so that the blocks are in cache.
Try
whilst your test is running, and see if any devices have a particularly high busy/waiting percentage, or if the average service times are especially high. It's just possible you've got a dying disk, or an unresponsive NFS mount killing performance.