We are looking into using BtrFS on an array of SSD disks and I have been asked to verify that BtrFS does in fact perform TRIM operations upon deleting a file. So far I have been unable to verify that the TRIM command is sent to the disks.
I know BtrFS is not considered production ready, but we like the bleeding edge, therefore I'm testing it. The server is Ubuntu 11.04 server 64-bit release (mkfs.btrfs version 0.19). I have installed the Linux 3.0.0 kernel as the BtrFS changelog states that bulk TRIM is not available in the kernel shipped with Ubuntu 11.04 (2.6.38).
Here's my testing methodology (initially adopted from http://andyduffell.com/techblog/?p=852, with modifications to work with BtrFS):
- Manually TRIM the disks before starting:
for i in {0..10} ; do let A="$i * 65536" ; hdparm --trim-sector-ranges $A:65535 --please-destroy-my-drive /dev/sda ; done
- Verify the drive was TRIM'd:
./sectors.pl |grep + | tee sectors-$(date +%s)
- Partition the drive:
fdisk /dev/sda
- Make the file system:
mkfs.btrfs /dev/sda1
- Mount:
sudo mount -t btrfs -o ssd /dev/sda1 /mnt
- Create a file:
dd if=/dev/urandom of=/mnt/testfile bs=1k count=50000 oflag=direct
- Verify the file is on the disk:
./sectors.pl | tee sectors-$(date +%s)
- Delete the test file:
rm /mnt/testfile
- See that the test file is TRIM'd from the disk:
./sectors.pl | tee sectors-$(date +%s)
- Verify the TRIM'd blocks:
diff
the two most recentsectors-*
files
At this point, the pre-delete and post delete verifications still show the same disk blocks in use. I should instead see a reduction in the number of in use blocks. Waiting an hour (in case it takes a while for the TRIM command to be issued) after the test file is deleted still shows the same blocks in use.
I have also tried mounting with the -o ssd,discard
options, but that doesn't seem to help at all.
Partition that was created from fdisk
above (I keep the partition small so the verification can go faster):
root@ubuntu:~# fdisk -l -u /dev/sda
Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x6bb7542b
Device Boot Start End Blocks Id System
/dev/sda1 63 546209 273073+ 83 Linux
My sectors.pl
script (I know this is inefficient, but it gets the job done):
#!/usr/bin/perl -w
use strict;
my $device = '/dev/sda';
my $start = 0;
my $limit = 655360;
foreach ($start..$limit) {
printf "\n%6d ", $_ if !($_ % 50);
my @sector = `/sbin/hdparm --read-sector $_ $device`;
my $status = '.';
foreach my $line (@sector) {
chomp $line;
next if $line eq '';
next if $line =~ /$device/;
next if $line =~ /^reading sector/;
if ($line !~ /0000 0000 0000 0000 0000 0000 0000 0000/) {
$status = '+';
}
}
print $status;
}
print "\n";
Is my testing methodology flawed? Am I missing something here?
Thanks for the help.
So after many days working on this, I was able to demonstrate that BtrFS does use TRIM. I was unable to successfully have TRIM work on the server that we will be deploying these SSDs to. However, when testing using the same drive plugged into a laptop, the tests succeed.
Hardware used for all of this testing:
After many failed attempts at verifying BtrFS on the server, I decided to try this same test using an old laptop (remove the RAID card layer). The initial attempts of this test using both Ext4 and BtrFS on the laptop fail (data not TRIM'd).
I then upgraded the SSD drive firmware from version 0001 (as shipped out of the box) to version 0009. The tests were repeated with Ext4 and BtrFS and both filesystems successfully TRIM'd the data.
To ensure the TRIM command had time to run, I did a
rm /mnt/testfile && sync && sleep 120
before performing validation.One thing to note if you're attempting this same test: SSDs have erase blocks that they operate on (I don't know the size of the Crucial m4 erase blocks). When the file system sends the TRIM command to the drive, the drive will only erase a complete block; if the TRIM command is specified for a portion of a block, that block will not be TRIM'd due to the remaining valid data within the erase block.
So to demonstrate what I'm talking about (output of the
sectors.pl
script above). This is with the test file on the SSD. Periods are sectors that only contain zeros. Pluses have one or more non-zero bytes.Test file on drive:
Test file deleted from drive (after a
sync && sleep 120
):It appears that the first and last sectors of the file are within a different erase blocks from the rest of the file. Therefore some sectors were left untouched.
A takeaway form this: some Ext4 TRIM testing instructions ask the user to only verify that the first sector was TRIM'd from the file. The tester should view a larger portion of the test file to really see if the TRIM was successful or not.
Now to figure out why manually issued TRIM commands sent to the SSD through the RAID card work but automatic TRIM commands to not...
Based on what I've read, there may be a flaw in your methodology.
You are assuming that TRIM will result in your SSD zeroing the blocks which have been deleted. However this is often not the case.
http://www.redhat.com/archives/linux-lvm/2011-April/msg00048.html
BTW I was looking for a reliable way to verify TRIM and haven't found one yet. I'd love know to if anyone finds a way.
Here is testing methodology for 10.10 and EXT4. Maybe it'll help.
https://askubuntu.com/questions/18903/how-to-enable-trim
Oh and I think you do need the discard parameter on the fstab mount. Not sure if SSD param is needed as I think it should auto detect SSD.
For btrfs you need
discard
option to enable TRIM support.A very simple but working test for functional TRIM is here: http://techgage.com/article/enabling_and_testing_ssd_trim_support_under_linux/2
Virtually all SSDs with a SATA interface run some sort of log structure filesystem that is completely hidden from you. The SATA 'trim' command tells the device that the block is no longer in use and that the underlying log structure filesystem can flash it /if/ the corresponding erase block (which might be substantially larger) /only/ contains blocks marked with trim.
I have not read the standard docs, which are here: http://t13.org/Documents/MinutesDefault.aspx?keyword=trim, but I'm not sure if there is any standard level guarantee that you'd be able to see the results of a trim command. If you can see something change, like the first few byte being zero'd out at the start of an erase block, I don't think there's any guarentee this is applicable across different devices or perhaps even firmware version.
If you think about the way the abstraction might be implemented, it should be possible to make the result of the trim command completely invisible to the one just reading/writing blocks. Furthermore it might be hard to tell which blocks are in the same erase block, since only the flash translation layer has to know that and might have reordered them logically.
Perhaps there is a SATA command (OEM command perhaps?) for fetching metadata related to the SSDs flash translation layer?
Some things to think about (to help answer your "am i missing something?" question):
what exactly is /dev/sda? a single SSD? or a (hardware?) RAID array of SSDs?
if the latter then what kind of RAID controller?
and does your raid controller support TRIM?
and, finally,