degraded questions - Page 1

phaidros

Asked: 2022-01-07 02:31:15 +0800 CST

How to fix ZFS pool once spare replacement done or how to correct spare replacement

2

I have a ZFS pool in the current state:

[root@SERVER-abc ~]# zpool status -v DATAPOOL
  pool: DATAPOOL
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 18.5M in 00:00:01 with 0 errors on Wed Jan  5 19:10:50 2022
config:`

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0    17  too many errors
            spare-1                                       ONLINE       0     0    17
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    30
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    ONLINE       0     0    29
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87@auto-2022-01-04_11-41:<0x1>
        <0x1080a>:<0x1>
        <0x182a>:<0x1>
        DATAPOOL/VMS/ubuntu_1804_LTS_ustrich-m6i87:<0x1>
        <0x16fa>:<0x1>

This is a zpool with 4 + 1 spare drives. Something happened and suddenly the spare ist pairing automatically with the other drive as spare-1.

This is unexpected to me, as:

Why did the spare not replace the degraded drive?
How to find out why the spare jumped to spare-1?
Is it possible (or even recommended/possible) to get the spare back and then to replace the degraded drive?

Goal is to rescue the pool without having to get tons of data from the backup, but in core I want to understand what happened and why. And how to deal with those situations as in 'best practices'.

Tanks a bunch! :)

System is: SuperMicro, TrueNAS-12.0-U4.1, zfs-2.0.4-3

Edit: Changed output from zpool status -x to zpool status -v DATAPOOL

Edit2: As of now I understant that first 168342c5 seem to have an error and the spare (1bfaa607) jumped in. After that 14c707c6 degraded as well.

Edit3, Additional question: as all drives (except the one in spare-1) seem to have CKSUM errors - what does that indicate? Cabling? HBA? All drives are dying simultaneously?

Latest Update, after zpool clear and zpool scrub DATAPOOL it seems clear, that alot has happened and there is no way to rescue the pool:

  pool: DATAPOOL
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan  6 16:18:05 2022
        1.82T scanned at 1.55G/s, 204G issued at 174M/s, 7.82T total
        40.8G resilvered, 2.55% done, 12:44:33 to go
config:

        NAME                                              STATE     READ WRITE CKSUM
        DATAPOOL                                          DEGRADED     0     0     0
          raidz2-0                                        DEGRADED     0     0     0
            gptid/14c707c6-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   156  too many errors
            spare-1                                       DEGRADED     0     0     0
              gptid/168342c5-f16c-11e8-b117-0cc47a2ba44e  DEGRADED     0     0   236  too many errors
              gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e  ONLINE       0     0     0  (resilvering)
            gptid/1875501a-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   182  too many errors
            gptid/1a16d37c-f16c-11e8-b117-0cc47a2ba44e    DEGRADED     0     0   179  too many errors
        spares
          gptid/1bfaa607-f16c-11e8-b117-0cc47a2ba44e      INUSE     currently in use

I'll check all smart stats now.

undefine

Asked: 2016-06-04 04:27:25 +0800 CST

How to create degraded raid6 array using megacli

2

There is a server with few disks. And - a new server with PERC controller.

I would like to migrate existing data into a new server to raid6 which uses 6 disks(4+2).

Unfortunly - i don't have enought free disks to create a "target" raid6 array. I would like to create a degraded raid6 array using 4 disks (which will work like a 4 disks raid0), and then after migrating data - add 2 last disks from old server into it and rebuild array.

Is that possible using megacli? I tried it using -Force option to -CfgLdAdd and pointing missing slots - but it didn't work. Is there any other way to do that job?

dalf

Asked: 2016-06-02 21:33:29 +0800 CST

Ganeti disks degraded drbd cs:NetworkFailure

2

I have an instance (with 2disks) on Ganeti with both disks degraded (probably due to a connection problem?). This instance was working correctly for many years until this morning.

On my master

$ gnt-instance info myinstance
...
   -disk/0
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      on secondary: /dev/drbd4 (147:4) in sync, status *DEGRADED*
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

...

On primary node

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

On secondary node

$ cat /proc/drbd
 4: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/DUnknown C r----
    ns:0 nr:678340009 dw:678340009 dr:0 al:0 bm:14884 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

I can't reboot nor shutdown the instance (Operation timed out).

I think it is NOT a brain split issue because there is no "standalone" and on primary node it is "Primary/Unknown" and on secondary it is "Secondary/Unknown".

I tried to run "drbdadm connect all" on the secondary node, but that did nothing.

I tried to replace disk but it failed:

gnt-instance replace-disks -s myinstance
Thu Jun  2 11:32:00 2016 Replacing disk(s) 0, 1 for myinstancel
Thu Jun  2 11:36:00 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=False, pass=1): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Thu Jun  2 11:38:01 2016  - WARNING: Could not prepare block device disk/0 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd4: cannot activate, unknown or unhandled reason
Thu Jun  2 11:40:02 2016  - WARNING: Could not prepare block device disk/1 on node primaryNode (is_primary=True, pass=2): Error while assembling disk: drbd5: cannot activate, unknown or unhandled reason
Failure: command execution error:
Disk consistency error

And now it looks like this:

$ gnt-instance info myinstance
...
    -disk/0 
      on primary:   /dev/drbd4 (147:4) in sync, status *DEGRADED*
      (no more secondary)
      child devices:
        - child 0: lvm, size 20.0G
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:10)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_data (254:8)
        - child 1: lvm, size 128M
          logical_id:   kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta
          on primary:   /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:11)
          on secondary: /dev/kvmvg/299a0bdf-1acb-4bcd-ac43-eb02b0928757.disk0_meta (254:9)

On primary node

$ cat /proc/drbd
 4: cs:NetworkFailure ro:Primary/Unknown ds:UpToDate/DUnknown C r----
    ns:678399926 nr:0 dw:678315292 dr:25942012 al:22230 bm:16189 lo:0 pe:196 ua:0 ap:195 ep:1 wo:b oos:0

And on secondary node:

$ cat /proc/drbd
...
4: cs:Unconfigured
5: cs:Unconfigured

Any idea how to solve this?

DRBD version: 8.3.7

Ganeti version: 2.4.5

OS: Debian 6.0

Industrial

Asked: 2011-06-09 07:37:07 +0800 CST

Ubuntu raid - replacing drive?

5

I've set up software RAID1 with the latest Ubuntu version - 11.04 using two 250GB harddrives which initially worked great. Tried to unplug both and computer still started with degraded raid status - reconnecting and rebooting makes the resync perform automatically. Everything good this far.

Unfortunately this morning, one of the harddrives died and when opening the Disk tools or gnome-disk-utility as its called, the main RAID chain shows as degraded.

If I run away to the store and get a new harddrive and plug it into the computer, will everything work out as intended or do I need to partition stuff as I did while installing Ubuntu with the Alternate installer?

mr.b

Asked: 2010-10-31 07:30:11 +0800 CST

Boot Debian while RAID array is degraded

10

Recently, I came across Ubuntu Server install. During install, it asked me whether or not to allow booting system from degraded RAID array (probably because I installed system onto RAID1 /dev/md0 device). This is mighty-useful option for unattended servers which just have to come online, whether or not their RAID array is degraded (as long as it didn't completely fail).

After quick lookup, I found that it works by either reading /etc/initramfs-tools/conf.d/mdadm configuration file (BOOT_DEGRADED=true option), or by reading kernel boot line argument (bootdegraded=true).

Question: Is there something similar (a way to boot system with degraded array) that would work for Debian? I'm not sure if this exact method is applicable, or even that it has this specific functionality.

I'm asking this because I used to have RAID5 array in some system, and upon improper shutdown, it could not boot, until I manually "fixed" the array, which proved to be major PITA, since server was unattended at remote location, there was no UPS, and power failures did happen. So, I'm asking so I could prevent this kind of issue in future.

How to fix ZFS pool once spare replacement done or how to correct spare replacement

How to create degraded raid6 array using megacli

Ganeti disks degraded drbd cs:NetworkFailure

Ubuntu raid - replacing drive?

Boot Debian while RAID array is degraded

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Questions[degraded](server)