Ping a Specific Port

Question

tsc_chazz

Asked: 2023-08-09 06:40:45 +0800 CST2023-08-09 06:40:45 +0800 CST 2023-08-09 06:40:45 +0800 CST

File access on SSD array suddenly slowed down; TRIM appears to be unavailable. How to enable, or what else could it be?

772

We have a system that's used for a GIS database (with Postgres as the underlying engine) which is using a software RAID 5 array of 4x2TB Samsung EVO870 SATA SSDs as its database drive. There is a nightly backup script that dumps the tables to a local temporary directory, GZips them, and transfers them to a separate machine (with mv). Normally the backup starts at 1830 and runs until 0500; yes, it's a big backup. A month or so ago, the external system fell off line, and so the mv step stopped working, and the temporary storage area filled up with unmoved files. After the external system was repaired, we noticed that the temp area was full and deleted everything out of it - about 3.5TB of files. About two weeks ago, we noticed that the daily backup was not completing until 1000. My suspicion is that things have slowed down because the temp directory, though erased, is not being purged, so when we have to write a new temp file as part of the backup, we have to clean SSD blocks before we can rewrite them.

fstrim -av does not print anything, which suggests that no filesystems are saying they have support for DISCARD.

This system does have LVM on top of the RAID array. The database and temp directories are in an ext4 filesystem (was ext2, but stuff happened) in its own LV that is mounted at /db; fstrim -v /db reports File system does not support DISCARD.

OS version: Debian Linux 8 (jessie), Linux 3.16.0-4-amd64 x86_64

RAID information:

root@local-database:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sda1[7] sdd1[4] sdc1[5] sdb1[6]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/2 pages [4KB], 524288KB chunk

root@local-database:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Dec 27 17:55:35 2015
     Raid Level : raid5
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Aug  8 14:07:27 2023
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : local-database:0  (local to host local-database)
           UUID : 18d38d9a:daaa0652:8e43a020:133e5a4f
         Events : 53431

    Number   Major   Minor   RaidDevice State
       7       8        1        0      active sync   /dev/sda1
       6       8       17        1      active sync   /dev/sdb1
       5       8       33        2      active sync   /dev/sdc1
       4       8       49        3      active sync   /dev/sdd1

Information about the specific LV used for the database and temp areas:

  --- Logical volume ---
  LV Path                /dev/MainDisk/postgres
  LV Name                postgres
  VG Name                MainDisk
  LV UUID                TpKgGe-oHKS-Y341-029v-jkir-lJn8-jo8xmZ
  LV Write Access        read/write
  LV Creation host, time local-database, 2015-12-27 18:04:04 -0800
  LV Status              available
  # open                 1
  LV Size                4.78 TiB
  Current LE             1251942
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:2

PV information:

root@local-database:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/md0
  VG Name               MainDisk
  PV Size               5.46 TiB / not usable 2.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              1430699
  Free PE               121538
  Allocated PE          1309161
  PV UUID               N3tcTa-LBw2-D8gI-6Jg4-9v3T-KWn2-5CDVzK

I would really like to get the backup times back down to 11 hours, so that we're no longer colliding with actual work times. Is there something in the TRIM options that I can do here, or is there something else I've missed? I have checked that the database did not suddenly grow any new tables, or grow 50% overnight; there are no network connection issues, there was nothing odd that happened to the network or the external server just before we started taking 16 hours to back up as far as I can see. Is there anything else I'm missing?

Edit due to comments: The actual SSDs are only a year and a half old, replacing the original 250GB SSDs in April 2022. (Ran out of space, and the RAID array, LV, and filesystem were expanded in place.) We're using software RAID, bone-standard Linux with mdadm.

Edit in response to comments:

root@local-database:~# lsblk -d
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda    8:0    0  1.8T  0 disk
sdb    8:16   0  1.8T  0 disk
sdc    8:32   0  1.8T  0 disk
sdd    8:48   0  1.8T  0 disk

root@local-database:~# cat /sys/module/raid456/parameters/devices_handle_discard_safely
N

root@local-database:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD FX(tm)-8320 Eight-Core Processor
Stepping:              0
CPU MHz:               1400.000
CPU max MHz:           3500.0000
CPU min MHz:           1400.0000
BogoMIPS:              7023.19
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

According to an article linked by Nikita Kyprianov in the comments below, Samsung EVO 870s have serious trouble with AMD hardware, which this clearly is. So that would seem to be that. I guess we'll just have to live with it...

1 Answers

Voted

symcbean · Answer 1 · 2023-08-10T00:22:26+08:00

Best Answer

symcbean

2023-08-10T00:22:26+08:002023-08-10T00:22:26+08:00

You need to enable discard support in /etc/lvm.conf (issue_discards=1)

I can't remember if this needs to set in md but there's no mention in my local man pages.

2

File access on SSD array suddenly slowed down; TRIM appears to be unavailable. How to enable, or what else could it be?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?