I've recently setup a couple of new servers. This time I'm encrypting most of my partitions using dmcrypt+LUKS. However these new servers crash very often, every few days. Full lockups, kernel does not respond to keyboard, system does not ping. According to Munin graphs and atop records, there has been no increase in resource usage. There are no relevant log records in the local syslog logs. There are no relevant records on our remote log host (which the new servers forward syslog to). There are no relevant netconsole messages (the new servers forward all kernel messages using netconsole to a log host). The kernel didn't even print anything to the TTY. I asked the hosting company to perform a full hardware test, and they found nothing. I'm suspecting LUKS. Does anybody else also experience full lock ups with LUKS? The only reference I could find is http://ubuntuforums.org/showthread.php?t=2125287.
Hongli Lai's questions
I'm writing a script which automatically sets up testing environment virtual machines. This script should automatically format a dmcrypt+LUKS partition for me, with a certain passphrase. Because this is a local testing environment I don't care about the security of the passphrase, I just want the entire VM setup process to be automated and non-interactive.
How can I non-interactively supply a passphrase to 'dmcrypt luksFormat'? I want to use passphrases, not keys, because in production we use passphrases for LUKS as well.
I am running Ubuntu 12.04 LTS. Yesterday I found a message in my mailbox saying that my server was shut down. I proceeded to reboot the system, but it didn't come up after many minutes, and I didn't have a hardware KVM system to see what the kernel was printing to the terminal. So I rebooted the system to a Linux rescue image and I saw that the software RAID 1 array was out of sync. The rescue system also began to reconstruct the RAID array.
So far there is no evidence that any of the disks have hardware errors. SMART statuses look good so far.
I never received an email notification by mdadm, even though email notification was turned on in /etc/mdadm/mdadm.conf.
This server was also configured to forward all syslog messages to a log host, so I checked my log host. The relevant parts are:
May 20 15:38:40 kernel: [ 1.869825] md0: detected capacity change from 0 to 536858624 May 20 15:38:40 kernel: [ 1.870687] md0: unknown partition table May 20 15:38:40 kernel: [ 1.877412] md: bind May 20 15:38:40 kernel: [ 1.878337] md/raid1:md1: not clean -- starting background reconstruction May 20 15:38:40 kernel: [ 1.878376] md/raid1:md1: active with 2 out of 2 mirrors May 20 15:38:40 kernel: [ 1.878418] md1: detected capacity change from 0 to 3000052808704 May 20 15:38:40 kernel: [ 1.878575] md: resync of RAID array md1 [snip] May 20 15:52:33 kernel: Kernel logging (proc) stopped. May 20 15:52:33 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="845" x-info="http://www.rsyslog.com"] exiting on signal 15.
As you can see, the system (the normal one, not the rescue system) already detected that something was wrong with the RAID array during a system boot. Then, shortly after, something (not me) halted the system.
So my questions are:
- What could cause the disks to suddenly become out of sync?
- Why was I not notified by email?
- Why was the error not properly logged to syslog before halting the system? Could it be that the system tried to log to syslog, but did so after stopping the syslog daemon? If so what can I do to prevent that?
- What can I do to find out what happened? Or, if there's no way for me now to find out what happened, how can I improve logging and notifications so that next time I can do a better post-mortem?
My question is not about proper backup practice. I already know that RAID is not a backup etc. My question is solely about notifications and diagnosis.
I want to install the OS (Ubuntu 12.04) on the second hard drive, but I'm unable to access the BIOS so I must install Grub on the first hard drive. How do I install Grub on the first hard drive, and have Grub boot the OS from the second hard drive?
Here's what I've tried so far. I installed Ubuntu on the second hdd, which had a /boot partition and a root partition. It didn't boot because the boot loader was on the second hdd but the system insists on booting from the first.
Then I booted from a rescue system, chrooted to my second hdd's root parition, mounted the second hdd's boot partition to /boot, and ran grub-install /dev/sda
. grub-install
refused to run because it couldn't find any partitions on the first hdd (which it didn't have).
So I made a boot partition on the first hdd and copied the second hdd's boot parition's contents to there. This time grub-install
succeeded. The system booted.
But even though the system booted from the first hdd's boot partition, once booted it mounts the second hdd's boot partition. That can't be good for kernel upgrades, so I edited /etc/fstab, changed /boot's device to /dev/sda, ran update-grab && grub-install /dev/sda
, and rebooted. This time it seems to work too, except that grub's timeout is gone. Since this is a server that I tend to access remotely, grub not automatically booting is problematic. grub.cfg contains the timeout option but the timeout doesn't actually work, leading me to think that I may be installing grub incorrectly.
Bad sectors will eventually occur, but how should I deal with them? If a bad sector occurs, does that mean that the data in that sector is irrecoverably lost, and I should restore it from backup? Is there any way to automate finding out which file belonged to that sector and at which offset, and to automate that recovery? Is there anything I can do on the filesystem level to make my life easier? (ECC?)
The smartctl tool allows initiating a long self-test (smartctl -t long /dev/sda
). However there's also badblocks
that I can run on a drive. How are the two related? If badblocks detects bad blocks, does the drive automatically update its SMART values (e.g. by updating its relocated sectors count)? Can badblocks replace smartctl -t long
, or vice versa?
I have a mail server "example.com" which forwards all emails with recipient "[email protected]" to "[email protected]". My mail server runs Postfix and it uses the virtual_alias_maps mechanism to perform the forwarding. I also have SPF records installed for "example.com":
v=spf1 a include:aspmx.googlemail.com ~all
The problem is, whenever someone delivers mail to "[email protected]", Gmail validates the example.com SPF records against example.com's IP address! I thought it's supposed to validate against the original sender's IP address.
For example, I'm on my laptop on my home Internet connection. I connect to example.com's mail server as follows:
$ telnet example.com 25
20 example.com ESMTP Postfix (Debian/GNU)
HELO my-laptop.local
250 example.com
MAIL FROM:<[email protected]>
250 2.1.0 Ok
RCPT TO:<[email protected]>
250 2.1.5 Ok
DATA
354 End data with <CR><LF>.<CR><LF>
From: [email protected]
To: [email protected]
Subject: test
test
.
250 2.0.0 Ok: queued as CE5F42200F9
Now when I open that mail in Gmail and view its source, I see the following headers:
Delivered-To: [email protected]
Received: by 10.231.219.195 with SMTP id hv3csp61494ibb;
Sat, 14 Jul 2012 02:15:58 -0700 (PDT)
Received: by 10.229.135.5 with SMTP id l5mr2360326qct.5.1342257358291;
Sat, 14 Jul 2012 02:15:58 -0700 (PDT)
Return-Path: <[email protected]>
Received: from example.com [EXAMPLE.COM's IP ADDRESS HERE]
by mx.google.com with ESMTP id u9si4262071qcv.89.2012.07.14.02.15.58;
Sat, 14 Jul 2012 02:15:58 -0700 (PDT)
Received-SPF: neutral (google.com: [EXAMPLE.COM's IP ADDRESS HERE] is neither permitted nor denied by domain of [email protected]) client-ip=[EXAMPLE.COM's IP ADDRESS HERE];
Authentication-Results: mx.google.com; spf=neutral (google.com: [EXAMPLE.COM's IP ADDRESS HERE] is neither permitted nor denied by domain of [email protected]) [email protected]
Date: Sat, 14 Jul 2012 02:15:58 -0700 (PDT)
Message-Id: <[email protected]>
Received: from my-laptop.local ([LAPTOP's IP ADDRESS HERE])
by example.com (Postfix) with SMTP id CE5F42200F9
for <[email protected]>; Sat, 14 Jul 2012 09:15:44 +0000 (UTC)
From: [email protected]
To: [email protected]
Subject: test
As you can see in Received-SPF and Authentication-Results, the SPF records are being validated against [EXAMPLE.COM's IP ADDRESS] instead of [LAPTOP's IP ADDRESS].
Why does this happen, and how do I fix this problem?
I have a host, let's call it foo.com, on which I'm running Postfix on Debian. Postfix is currently configured to do these things:
- All mail with @foo.com as recipient is handled by this Postfix server. It forwards all such mail to my Gmail account. The firewall thus allows port 25.
- All mail with another domain as recipient is rejected.
- SPF records have been set up for the foo.com domain, saying that foo.com is the sole origin of all mail from @foo.com.
- Applications running on foo.com can connect to localhost:25 to deliver mail, with [email protected] as sender.
However I recently noticed that some spammers are able to send spam to me while passing the SPF checks. Upon further inspection, it looks like they connect to my Postfix server and then say
HELO bar.com
MAIL FROM:<[email protected]> <---- this!
RCPT TO:<[email protected]>
DATA
From: "Buy Viagra" <[email protected]> <--- and this!
...
How do I prevent this? I only want applications running on localhost to be able to say MAIL FROM:<[email protected]>
. Here's my current config (main.cf): https://gist.github.com/1283647
My server has 2 harddisks. I've installed smartmontools on Debian with apt-get install smartmontools
, enabled it in /etc/default/smartmontools (start_smartd=yes) and started the daemon (/etc/init.d/smartmontools start
).
My /etc/smartd.conf contains this:
DEVICESCAN -d removable -n standby -m [email protected] -M exec /usr/share/smartmontools/smartd-runner
Is the smartmontools now configured to run regular health checks? If so, how do I see when it does that? I don't see any indication in smartctl -l selftest
of health checks being run regularly; the command only showed the two tests I recently ran manually. I also don't see anything in /var/log/messages that indicates regular health checks are being run.
I have recently obtained a new dedicated server which I'm now setting up. It's running on 64-bit Debian 6.0. I have cloned a fairly large git repository (177 MB including working files) onto this server. Switching to a different branch is very very slow. On my laptop it takes 1-2 seconds, on this server it can take half a minute. After some investigation it turns out to be some kind of DNS timeout. Here's an exhibit from strace -s 128 git checkout release
:
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=132, ...}) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("213.133.99.99")}, 16) = 0
poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}])
sendto(5, "\235\333\1\0\0\1\0\0\0\0\0\0\35Debian-60-squeeze-64-minimal\n\17happyponies\3com\0\0\1\0\1", 67, MSG_NOSIGNAL, NULL, 0) = 67
poll([{fd=5, events=POLLIN}], 1, 5000) = 0 (Timeout)
This snippet repeats several times per 'git checkout' call.
My server's hostname was originally Debian-60-squeeze-64-minimal
. I had changed it to shell.happyponies.com
by running hostname shell.happyponies.com
, editing /etc/hostname and rebooting the server.
I don't understand the DNS protocol, but it looks like Git is trying to lookup the IP for Debian-60-squeeze-64-minimal
as well as for happyponies.com
. Why does Debian-60-squeeze-64-minimal
come back even though I've already changed the host name? Why does Git perform DNS lookups at all? Why are these lookups so slow? I've already verified that all DNS servers in /etc/resolv.conf are up and responding slowly, yet Git's own lookups time out.
Changing the host name back to Debian-60-squeeze-64-minimal seems to fix the slowness.
Basically I just want to fix whatever DNS issues my server has because I'm sure they will cause more problems that just slowing down git checkout
. But I'm not sure sure what the problem exactly is and what these symptoms mean.
I'm trying to change the TCP TIME_WAIT timeout on Solaris. According to some Google results I need to run this command:
ndd -set /dev/tcp tcp_time_wait_interval 60000
However I get:
operation failed: Not owner
What am I doing wrong? I'm already running ndd as root. Is there another way to tune TIME_WAIT?
I have a 64-bit RHEL 5.3 server. There's a piece of server software that's more memory efficient if I compile it as 32-bit. Is there a way to tell GCC to target 32-bit?
I just want a specific piece of software to be 32-bit, everything else should stay 64-bit.
This software is not packaged in the yum repositories so I cannot just do 'yum install 32-bit-version'.
I've written a web application for which the user interface is in Dutch. I use the system's date and time routines to format date strings in the application. However, the date strings that the system formats are in English but I want them in Dutch, so I need to set the system's locale. How do I do that on Debian? I tried setting LC_ALL=nl_NL
but it doesn't seem to have any effect:
$ date
Sat Aug 15 14:31:31 UTC 2009
$ LC_ALL=nl_NL date
Sat Aug 15 14:31:36 UTC 2009
I remember that setting LC_ALL on my Ubuntu desktop system works fine. Do I need to install extra packages to make this work, or am I doing it entirely wrong?