tsc_chazz's questions -server

tsc_chazz

Asked: 2023-08-09 06:40:45 +0800 CST

File access on SSD array suddenly slowed down; TRIM appears to be unavailable. How to enable, or what else could it be?

7

We have a system that's used for a GIS database (with Postgres as the underlying engine) which is using a software RAID 5 array of 4x2TB Samsung EVO870 SATA SSDs as its database drive. There is a nightly backup script that dumps the tables to a local temporary directory, GZips them, and transfers them to a separate machine (with mv). Normally the backup starts at 1830 and runs until 0500; yes, it's a big backup. A month or so ago, the external system fell off line, and so the mv step stopped working, and the temporary storage area filled up with unmoved files. After the external system was repaired, we noticed that the temp area was full and deleted everything out of it - about 3.5TB of files. About two weeks ago, we noticed that the daily backup was not completing until 1000. My suspicion is that things have slowed down because the temp directory, though erased, is not being purged, so when we have to write a new temp file as part of the backup, we have to clean SSD blocks before we can rewrite them.

fstrim -av does not print anything, which suggests that no filesystems are saying they have support for DISCARD.

This system does have LVM on top of the RAID array. The database and temp directories are in an ext4 filesystem (was ext2, but stuff happened) in its own LV that is mounted at /db; fstrim -v /db reports File system does not support DISCARD.

OS version: Debian Linux 8 (jessie), Linux 3.16.0-4-amd64 x86_64

RAID information:

root@local-database:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sda1[7] sdd1[4] sdc1[5] sdb1[6]
      5860147200 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 1/2 pages [4KB], 524288KB chunk

root@local-database:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Dec 27 17:55:35 2015
     Raid Level : raid5
     Array Size : 5860147200 (5588.67 GiB 6000.79 GB)
  Used Dev Size : 1953382400 (1862.89 GiB 2000.26 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Aug  8 14:07:27 2023
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : local-database:0  (local to host local-database)
           UUID : 18d38d9a:daaa0652:8e43a020:133e5a4f
         Events : 53431

    Number   Major   Minor   RaidDevice State
       7       8        1        0      active sync   /dev/sda1
       6       8       17        1      active sync   /dev/sdb1
       5       8       33        2      active sync   /dev/sdc1
       4       8       49        3      active sync   /dev/sdd1

Information about the specific LV used for the database and temp areas:

  --- Logical volume ---
  LV Path                /dev/MainDisk/postgres
  LV Name                postgres
  VG Name                MainDisk
  LV UUID                TpKgGe-oHKS-Y341-029v-jkir-lJn8-jo8xmZ
  LV Write Access        read/write
  LV Creation host, time local-database, 2015-12-27 18:04:04 -0800
  LV Status              available
  # open                 1
  LV Size                4.78 TiB
  Current LE             1251942
  Segments               4
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:2

PV information:

root@local-database:~# pvdisplay
  --- Physical volume ---
  PV Name               /dev/md0
  VG Name               MainDisk
  PV Size               5.46 TiB / not usable 2.50 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              1430699
  Free PE               121538
  Allocated PE          1309161
  PV UUID               N3tcTa-LBw2-D8gI-6Jg4-9v3T-KWn2-5CDVzK

I would really like to get the backup times back down to 11 hours, so that we're no longer colliding with actual work times. Is there something in the TRIM options that I can do here, or is there something else I've missed? I have checked that the database did not suddenly grow any new tables, or grow 50% overnight; there are no network connection issues, there was nothing odd that happened to the network or the external server just before we started taking 16 hours to back up as far as I can see. Is there anything else I'm missing?

Edit due to comments: The actual SSDs are only a year and a half old, replacing the original 250GB SSDs in April 2022. (Ran out of space, and the RAID array, LV, and filesystem were expanded in place.) We're using software RAID, bone-standard Linux with mdadm.

Edit in response to comments:

root@local-database:~# lsblk -d
NAME MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda    8:0    0  1.8T  0 disk
sdb    8:16   0  1.8T  0 disk
sdc    8:32   0  1.8T  0 disk
sdd    8:48   0  1.8T  0 disk

root@local-database:~# cat /sys/module/raid456/parameters/devices_handle_discard_safely
N

root@local-database:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD FX(tm)-8320 Eight-Core Processor
Stepping:              0
CPU MHz:               1400.000
CPU max MHz:           3500.0000
CPU min MHz:           1400.0000
BogoMIPS:              7023.19
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

According to an article linked by Nikita Kyprianov in the comments below, Samsung EVO 870s have serious trouble with AMD hardware, which this clearly is. So that would seem to be that. I guess we'll just have to live with it...

tsc_chazz

Asked: 2023-04-25 04:39:29 +0800 CST

RAID arrays failed, now will not restart; mdadm --examine shows drive healthy but --assemble fails missing two disks

6

This is a Mint 21.1 x64 Linux system, which has over the years had disks added to RAID arrays until we now have one array of 10 3TB and one array of 5 6TB. Four HDs dropped out of the arrays, two from each, apparently as a result of one controller failing. We've replaced controllers, but that has not restored the arrays to function. mdadm --assemble reports unable to start either array, insufficient disks (with two failed in each, I'm not surprised); mdadm --run reports I/O error (syslog seems to suggest this is because it can't start all the drives, but there is no indication that it tried to start the two apparently unhappy ones), but I can still mdadm --examine failed disks and they look absolutely normal. Here's output from a functional drive:

mdadm --examine /dev/sda
/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 6e072616:2f7079b0:b336c1a7:f222c711

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:30:27 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : 2faf0b93 - correct
         Events : 21397

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 9
   Array State : AAAAAA..AA ('A' == active, '.' == missing, 'R' == replacing)

And here's output from a failed drive:

mdadm --examine /dev/sdk
/dev/sdk:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : d62b85bc:fb108c56:4710850c:477c0c06

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : d53202fe - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

Edit: Here's the --examine report from the second failed drive; as you can see, it failed at the same time the entire array fell off line.

# mdadm --examine /dev/sdl
/dev/sdl:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 829c0c49:033a810b:7f5bb415:913c91ed
           Name : DataBackup:back  (local to host DataBackup)
  Creation Time : Mon Feb 15 13:43:15 2021
     Raid Level : raid5
   Raid Devices : 10

 Avail Dev Size : 5860268976 sectors (2.73 TiB 3.00 TB)
     Array Size : 26371206144 KiB (24.56 TiB 27.00 TB)
  Used Dev Size : 5860268032 sectors (2.73 TiB 3.00 TB)
    Data Offset : 264192 sectors
   Super Offset : 8 sectors
   Unused Space : before=264112 sectors, after=944 sectors
          State : clean
    Device UUID : 35ebf7d9:55148a4a:e190671d:6db1c2cf

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Apr  2 04:27:31 2023
  Bad Block Log : 512 entries available at offset 24 sectors
       Checksum : c13b7b79 - correct
         Events : 21392

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 7
   Array State : AAAAAAAAAA ('A' == active, '.' == missing, 'R' == replacing)

The second array, 5x6TB, fell off line two minutes later when two disks quit. The two failed disks on this array, and the two on the other array, all connected to a single 4-port SATA controller card which of course has now been replaced.

The main thing I find interesting about this is that the failed drive seems to report itself as alive, but mdadm doesn't agree with it. journalctl doesn't seem to go back as far as 2 April, so I may not be able to find out what happened. Anyone have any ideas about what I can do to bring this beast back online?

tsc_chazz

Asked: 2021-12-18 18:00:39 +0800 CST

How does Sendmail figure out where SASL is listening?

0

I'm trying to get SMTP-AUTH working on a Mint Linux 20.2 machine, and I'm having trouble getting authentication to happen. I have installed Cyrus sasl2, and apparently configured it properly - testsaslauthd -u <user> -p <password> -s smtp returns 0: OK "Success." But when I try to send mail from my client using the same credentials, sendmail can't authenticate. What is curious to me is that when I use testsaslauthd entries are generated in my auth log, but when Sendmail tries, there is nothing. Trying to use testsaslauthd from a non-root context also results in no auth entry, which leads me to believe that either sendmail does not have sufficient permission to connect to the sasl daemon, or that it doesn't know where that pipe is and is guessing wrong. So I guess the question is, how does sendmail find that pipe, and who is it being when it does that?

tsc_chazz

Asked: 2021-09-01 14:12:28 +0800 CST

Where does Sendmail get authentication from?

0

For reasons I can't get into at the moment, I'm authenticating to an SMB domain (using Samba 4.9.5 on a Debian host as the DC, if it matters) with a Mint Linux server in the domain with Samba 4.11.6 using Sendmail 8.15.2. I have Thunderbird on a third, Windows machine. The mail server also has Dovecot 2.3.7.2 installed. From Thunderbird, I can view, open and manipulate mailboxes with domain credentials. However, I cannot send mail, the same credentials that work to open the mailbox via Dovecot fail password validation when trying to send to port 587 on Sendmail. I do have a local account for the domain user, I'm told Dovecot needs that in order to keep its data. It seems to me that I somehow have to tell Sendmail to use the domain credentials rather than the local ones, but while I can see how to tell it how to accept credentials, I don't see how to tell it how to authenticate them. Am I missing something?

tsc_chazz

Asked: 2020-12-12 16:52:16 +0800 CST

Promotion Windows Server 2019 to domain controller takes a week?

0

Setup first: I had a Windows Server 2008 domain for testing. The PDC died a horrible death, I was able to seize most FSMO roles (but not all) and drop them on another DC, but missed the RID Master role (if I recall correctly). The machine was shaky, though - it would lock up tight if someone tried to access a file via SMB, though Explorer, FTP, and CLI were fine - so it was not a good option for long-term DC operation. I have a Server 2019 instance that I'm also using for testing, so I moved it into the domain and added the active directory functions to it; on reboot it wanted to be made into a DC as they do. Trying to make it into a DC failed because it couldn't reach the dead PDC to get the RID Master information. I went back and read the articles I'd been following, found that it wasn't enough to be in the admin group, to seize the RID Master role I actually had to be logged in as Administrator, so tried that, seized RID Master onto the 2008 server, rebooted the 2019 server, told it to upgrade itself to DC, and sat back and waited.

And waited.

It has now been nine days and the 2019 server is still "configuring this server to be a domain controller" and spinning its little progress bar. The Cancel button is, of course, grayed out. Can anyone suggest a next thing to do? Or is my entire domain now fatally hooped?

File access on SSD array suddenly slowed down; TRIM appears to be unavailable. How to enable, or what else could it be?

RAID arrays failed, now will not restart; mdadm --examine shows drive healthy but --assemble fails missing two disks

How does Sendmail figure out where SASL is listening?

Where does Sendmail get authentication from?

Promotion Windows Server 2019 to domain controller takes a week?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?