Header lists all CPUs/cores and keeps re-sizing as I go back in time with t and T. I read though the help and tried searching.
How to hide that header?
Header lists all CPUs/cores and keeps re-sizing as I go back in time with t and T. I read though the help and tried searching.
How to hide that header?
TCP retransmission rate on a host are often a good indicator of network problems. How do I find out the source and destination IPs for the packets that are being retransmitted?
For context, on hosts that have sar installed, one can see the re-transmission rates like so:
sar -n ETCP
10:11:02 AM atmptf/s estres/s retrans/s isegerr/s orsts/s
10:12:01 AM 0.07 1.95 0.08 0.00 1.18
10:13:01 AM 0.07 1.30 0.02 0.00 0.83
10:14:01 AM 0.07 1.40 0.02 0.00 0.85
HBase stable is currently hbase-0.90.4, what version(s) of HDFS is it compatible with?
On Solaris / OpendIndiana NFS server, is the a way to get per-client stats?
On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.
The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).
I can clearly see that with the following script:
#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
ps -p $i | grep -v CMD
done
OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.
I have Debian 6.0 (Linux 2.6.32-3-amd64).
Does anyone know where to contorl the -17 oom_adj assignment behaviour?
Could launching sshd and Torque mom from /etc/rc.local
be causing the overprotective behaviour?
On a mis-configured or buggy network filer (NFS NAS) writing a large file can cause the filer to freeze.
For diagnostics I need to be able to:
Basically, like a kill -s SIGSTOP
and kill -s SIGCONT
but for the entire user.
To do that, is there a way to temporary take away all CPU-time from a user in Linux?
This question related to NexentaStor vs FreeNAS and Is FreeNAS reliable?
I have been using OpenIndiana / Illumos as the OS for my self build NAS.
There is nothing much to it:
I aslo wrote a few Bash scripts that are cronned minutly to write-down zfs get all
to the shared Filesystem so that I can monitor things like disk usage, compression ratio, and dedup ratio on the client side.
I don't need any other features.
How will FreeNAS compare to OpenSolaris in terms of speed, driver availability, and robustness?
I have NFS shared among 30 cluster nodes. The nodes are Debian 5 and 6. The NFS server is OpenSolaris 2009. We have good hardware and a 20Gbit Infiniband network.
On the cluster nodes, fs operations are snappy but not when it comes to:
Rscript <(echo "library(GOstats)")
They all get stuck for a few minutes after the following system calls:
fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=1}
orfcntl(3, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}
What could be the cause? How to diagnose and fix?
Would switching the NFS server to OpenIndiana oi_148 fix?
I have been exporting NFS from OpenSloarins like this (successfully):
zfs set sharenfs=root=rw=host1:host2:host3 pool1
I'm acting according the man pages sharefs
, share_nfs
but the following does not work:
zfs set sharenfs=root=rw=host1:host2:host3,ro=host4 pool1
All hosts loose access permission.
How can I share to some hosts as read/write and to some as read only?
In a Linux DHCP server I'm getting a bunch of these log lines:
dhcpd: DHCPDISCOVER from 00:30:48:fe:5c:9c via eth1: network 192.168.2.0/24: no free leases
I don't have any machines with 00:30:48:fe:5c:9c and I don't intend to give out an IP to 00:30:48:fe:5c:9c (whatever that could be).
I tracked down the server that this is coming from and killed all the DHCP clients that were running but the DHCPDISCOVER requests do not stop.
I can prove that this is the sending server by pulling the Ethernet cable - the requests stop.
The strange thing is that the sending server only has 2 interfaces which are:
What can be the cause of the off-by-one address? Who could be sending the requests?
My DHCP client is the default in Debian 6.0 (Squeeze) http://packages.debian.org/squeeze/isc-dhcp-client
On the DHCP client host:
root@n34:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 100
link/ether 00:30:48:fe:5c:9a brd ff:ff:ff:ff:ff:ff
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:30:48:fe:5c:9b brd ff:ff:ff:ff:ff:ff
4: ib0: <BROADCAST,MULTICAST> mtu 2044 qdisc noop state DOWN qlen 256
link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:81:9f brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
5: ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256
link/infiniband 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:81:a0 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
On the DHCP client host (same info as above):
root@n34:~# ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:30:48:fe:5c:9a
inet addr:192.168.2.234 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::230:48ff:fefe:5c9a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:72544 errors:0 dropped:0 overruns:0 frame:0
TX packets:152773 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:4908592 (4.6 MiB) TX bytes:89815782 (85.6 MiB)
Memory:dfd60000-dfd80000
eth1 Link encap:Ethernet HWaddr 00:30:48:fe:5c:9b
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Memory:dfde0000-dfe00000
ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
BROADCAST MULTICAST MTU:2044 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
ib1 Link encap:UNSPEC HWaddr 80-00-00-49-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.3.234 Bcast:192.168.3.255 Mask:255.255.255.0
inet6 addr: fe80::202:c903:8:81a0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:1330 errors:0 dropped:0 overruns:0 frame:0
TX packets:255 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:716415 (699.6 KiB) TX bytes:17584 (17.1 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:8 errors:0 dropped:0 overruns:0 frame:0
TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:560 (560.0 B) TX bytes:560 (560.0 B)
The nodes were imaged with Perseus which uses kexec instead of rebooting.
I have the following NFS-based storage setup:
Computes nodes are Linux. The NFS servers are Solaris.
A not-so-important user runs a bunch of read intensive jobs on a subset of the compute nodes. As a result, the whole group of compute nodes becomes very slow (ls
blocks for 30 seconds). I was able to track down that the dedicated NFS server is hitting the limit of the san's read throughput.
How to implement quality of service (QoS) limiting the NFS bandwidth to nodes, processes, or users?
Rsyslog is backwards-compatible with Syslog configuration files.
The syslog.conf man page has:
You may prefix each entry with the minus ``-'' sign to omit syncing the file after every logging. Note that you might lose information if the system crashes right behind a write attempt. Nevertheless this might give you back some performance, especially if you run programs that use logging in a very verbose manner.
but I could not find aything about the -
sign in man rsyslog.conf
.
What does rsyslog do when if reads -
in the config file?
When a server gets rooted (e.g. a situation like this), one of the first things that you may decide to do is containment. Some security specialists advise not to enter remediation immediately and to keep the server online until forensics are completed. Those advises are usually for APT. It's different if you have occasional Script kiddie breaches, so you may decide to remediate (fix things) early. One of the steps in remediation is containment of the server. Quoting from Robert Moir's Answer - "disconnect the victim from its muggers".
A server can be contained by pulling the network cable or the power cable.
Which method is better?
Taking into consideration the need for:
Edit: 5 assumptions
Assuming:
There is a Xen vs. KVM in performance question on ServerFault.
What will be the speed difference if the choice is between Xen and OpenVZ?
Searching for such benchmarks does not show any results newer than 2008.
What would be some important performance measurements to compare OpenVZ against Xen?
Some may say "you're comparing oranges and pineapples" but I have to choose 1 of the 2 and it needs to be a wise choice. Performance is most important to us. We may switch away from OpenVZ because Xen is more ubiquitous but only if performance overhead is not significant. Next month (January 2011) I'm thinking of doing my own performance comparison - here is the project planning blog.
We have a fiber channel san managed by two OpenSolaris 2009.06 NFS servers.
The slow 50TB server was installed a few month ago and was working fine. Users filled-up 2TB. I did a small experiment (created 1000 filesystems and had 24 snapshots on each). Everything when well as far as creating, accessing the filesystems with snapshots, and NFS mounting a few of them.
When I tried destroying the 1000 filesystems, the first fs took several minutes and then failed reporting the fs was in use. I issued a system shutdown but took more than 10 minutes. I did not wait longer and shut the power off.
Now when booting, OpenSolaris hangs. The lights on the 32 drives are blinking rapidly. I left it for 24 hours - still blinking but no progress.
I booted into an system snapshot before the zpool was created and tryied importing the zpool.
pfexec zpool import bigdata
Same situation: LEDs blinking and the import hangs forever.
Dtracing the "zpool import" process shows only the ioctl system call:
dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }'
ioctl 2499
Is there a way to fix this? Edit: Yes. Upgrading OpenSolaris to svn_134b did the trick:
pkg publisher # shows opensolaris.org
beadm create opensolaris-updated-on-2010-12-17
beadm mount opensolaris-updated-on-2010-12-17 /mnt
pkg -R /mnt image-update
beadm unmount opensolaris-updated-on-2010-12-17
beadm activate opensolaris-updated-on-2010-12-17
init 6
Now I have zfs version 3. Bigdata zpool stays at version 14. And it's back in production!
But what was it doing with the heavy I/O access for more then 24 hours (before the software upgraded)?
Looks like Debian 6.0 (Squeeze) will be supporting ZFS via the official GNU/kFreeBSD kernel.
This opens a possibility of converting our Debian GNU/Linux cluster's dedicated NAS server from OpenSolaris 2009.06 to Debain. The server connects to the SAN via FiberChannel HBA and to the LAN vi InfiniBand HBA. Probably, it would be pretty hard to get the drivers to work on kFreeBSD.
Supposing all the drivers actually work, would this be a stable setup?
For 3 years we had an LSI SAN with 48, 300GB Segate Cheetah 15K.5 (Model ST3300655FC) 3.5-inch drives. There are about 7 drives failed total. There bulk failed recently. Six drives since May 2010.
That's at a rate of 0.02 (drives failed)/(month)/(drives in array) for the last 6 month period.
There is an older SAN from HP running in the same room, I the drives are probably 15K 36 GB. Those never failed.
Is it common that 300GB 15K RPM drives start failing at this rate after 3 years?
I have OpenSolaris 2009.06 server providing ZFS over NFS v3 to a Linux 2.6.26 server. Does not happen when accessing the files via NFSv4.
Very happy. It catches silent data corruption in our LSI san. Great performance. Has snapshots. Has compression. Transaction log replay backups. Most importantly we no longer have FS caching issues and freezes that occurred on a Linux server.
There is one strange thing: Empty files are inaccessible from the Linux NFS client. When I try to ls, cat, or stat them I get:
stat: cannot stat `/srv/zpools/a/write.lock': Invalid argument
Rsync backups report:
rsync: readlink "/srv/zpools/a/write.lock" failed: Invalid argument (22)
rsync: readlink "/srv/zpools/userX/.netbeans/6.9/var/cache/mavenindex/netbeans/write.lock" failed: Invalid argument (22)
rsync: readlink "/srv/zpools/userX/.netbeans/6.9/var/cache/mavenindex/local/write.lock" failed: Invalid argument (22)
rsync: readlink "/srv/zpools/userX/javaPrograms/mavenProjects/thesis/libbn/target/test-classes/.netbeans_automatic_build" failed: Invalid argument (22)
rsync: readlink "/srv/zpools/userX/javaPrograms/mavenProjects/scalaCommon/target/test-classes/.netbeans_automatic_build" failed: Invalid argument (22)
I cannot reproduce it by creating new empty files only for some old files.
Can anyone tell what could the reason?
Edit: In the ZFS server, when stat'ing the strange files I found that the Modification time was in back 1927. :) Touching the file on the server, fixed the problem on the NFS client.
What is the fundamental difference between:
Infiniband Universal I/O Card (e.g. Supermicro AOC-UINF-M2)
and
Infiniband Host Channel Adapter (e.g. Qlogic QLE7240-CK)
Can't both of those do do IP-over-IB?
We are looking for a ~32T external storage solution for two 48-core AMD servers. These will be used for a small Linux OpenVZ cloud for CPU intensive web-servers and data warehousing. Dual path with automatic fail-over is pretty much a must. Hopefully the enclosure and the SAS controller would cost around $9k and 16 drives around $4k.
We initially looked at Promise's VTrak E610sD: http://www.promise.com/media_bank/Download%20Bank/Manual/VTrak_E-Class_PM_v3.2.pdf (page 35 shows the topology that we would want)
A college suggested Infortrend's EonStor DS S16S-R2240: http://www.infortrend.com/products/models/ESDS%20S16S-R2240
Has anyone had experience with these systems?
What are some alternatives to the above Promise and Infortrend SAS product for a two server web+db cloud application?
This could be a good option: http://www.raidinc.com/xanadu_230.php
Would something like this also work? https://www.thinkmate.com/System/STX_JE16-0300/14991