Ryan Jeremiah Freeman's questions -server

Ryan Jeremiah Freeman

Asked: 2024-06-04 01:13:01 +0800 CST

Ceph - Too many objects () are misplaced;

I'm currently running a 3-Node Hyperconverged Proxmox/Ceph cluster. I'm in the process of transferring a large amount of data (100TB+) from an old unRAID instance to the new cluster infrastructure. Having to copy data 1HDD as at time to the new CephFS pool, then wipe the disk and add it to the OSD pool. I don't have any more space HDDs laying around or the budget to buy more drives which would make this process much easier.

Half way through the process I'm now stuck with the "ceph balancer" reporting "Too many objects are misplaced;" and "283 active+remapped+backfill_wait" has remained unchanged for over 12hrs now. The cluster is idling, but not "self healing" as I would expect it to.

Before I started this migration, I pushed and pulled Ceph and broke it a number of ways as part of testing. I was always able to get it back to Healthy_OK without any Data loss or extended downtime bar a service/server restart. I've read through the docs for this issue and haven't found anything useful as to how to kick this into gear.

https://docs.ceph.com/en/latest/rados/operations/health-checks/#object-misplaced

Data migration is currently on hold.

NB: 1-There is a bit of a mismatch between my OSD sizes (the reweights) were me trying to get Ceph to spread to the larger drives. Instead of constantly filling small drives. 2-nearfull OSD is one of the 3 4TB drives (there are 16TB drives nearly empty but it's not balancing across.

ceph balancer status

{
    "active": true,
    "last_optimize_duration": "0:00:00.000087",
    "last_optimize_started": "Mon Jun  3 17:56:27 2024",
    "mode": "upmap",
    "no_optimization_needed": false,
    "optimize_result": "Too many objects (0.401282 > 0.050000) are misplaced; try again later",
    "plans": []
}

ceph -s

  cluster:
    id:     {id}
    health: HEALTH_WARN
            1 nearfull osd(s)
            2 pgs not deep-scrubbed in time
            2 pool(s) nearfull
            1 pools have too many placement groups
 
  services:
    mon: 3 daemons, quorum {node1},{node2},{node3} (age 31h)
    mgr: {node3}(active, since 26h), standbys: {node1}, {node2}
    mds: 2/2 daemons up, 1 standby
    osd: 23 osds: 23 up (since 23h), 23 in (since 2h); 284 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   7 pools, 801 pgs
    objects: 10.19M objects, 37 TiB
    usage:   57 TiB used, 55 TiB / 112 TiB avail
    pgs:     12261409/30555598 objects misplaced (40.128%)
             513 active+clean
             283 active+remapped+backfill_wait
             2   active+clean+scrubbing+deep
             2   active+clean+scrubbing
             1   active+remapped+backfilling
 
  io:
    client:   15 MiB/s wr, 0 op/s rd, 71 op/s wr

ceph osd df

ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS
 8    hdd   7.31639   1.00000  7.3 TiB  665 GiB  625 GiB    2 KiB  2.2 GiB  6.7 TiB   8.88  0.17   27      up
10    hdd   9.13480   1.00000  9.1 TiB   40 GiB   30 MiB    1 KiB  1.3 GiB  9.1 TiB   0.43  0.01   16      up
 5    ssd   0.72769   1.00000  745 GiB  248 GiB  246 GiB  189 MiB  2.5 GiB  497 GiB  33.32  0.65  133      up
 6    ssd   0.72769   1.00000  745 GiB  252 GiB  251 GiB  104 MiB  1.1 GiB  493 GiB  33.80  0.66  126      up
 7    hdd   5.49709   1.00000  5.5 TiB  259 GiB  219 GiB    1 KiB  1.6 GiB  5.2 TiB   4.61  0.09    9      up
22    hdd   9.13480   1.00000  9.1 TiB  626 GiB  586 GiB    1 KiB  2.7 GiB  8.5 TiB   6.70  0.13   12      up
15    ssd   0.72769   1.00000  745 GiB  120 GiB  118 GiB   53 MiB  1.3 GiB  626 GiB  16.05  0.31   71      up
16    ssd   0.87329   1.00000  894 GiB  128 GiB  126 GiB   56 MiB  1.9 GiB  766 GiB  14.35  0.28   78      up
17    ssd   0.43660   1.00000  447 GiB   63 GiB   62 GiB   25 MiB  1.2 GiB  384 GiB  14.11  0.28   40      up
18    ssd   0.43660   1.00000  447 GiB   91 GiB   89 GiB   24 MiB  1.8 GiB  357 GiB  20.25  0.40   48      up
19    ssd   0.72769   1.00000  745 GiB  132 GiB  130 GiB   67 MiB  2.1 GiB  613 GiB  17.71  0.35   82      up
20    ssd   0.72769   1.00000  745 GiB  106 GiB  104 GiB   24 MiB  1.9 GiB  639 GiB  14.28  0.28   65      up
21    ssd   0.72769   1.00000  745 GiB  127 GiB  124 GiB   62 MiB  2.2 GiB  619 GiB  17.00  0.33   75      up
 0    hdd  16.40039   1.00000   16 TiB   12 TiB   12 TiB    7 KiB   25 GiB  4.5 TiB  72.65  1.42  241      up
 1    hdd   3.66800   0.50000  3.7 TiB  2.6 TiB  2.6 TiB    6 KiB  6.1 GiB  1.1 TiB  71.08  1.39   56      up
 2    hdd   3.66800   0.09999  3.7 TiB  2.9 TiB  2.9 TiB    6 KiB  7.3 GiB  793 GiB  78.89  1.55   56      up
 3    hdd  14.58199   1.00000   15 TiB   10 TiB   10 TiB    6 KiB   22 GiB  4.4 TiB  69.94  1.37  216      up
 4    hdd   3.66800   0.09999  3.7 TiB  3.2 TiB  3.1 TiB    6 KiB  7.3 GiB  501 GiB  86.66  1.70   63      up
11    hdd  14.58199   0.95001   15 TiB   12 TiB   12 TiB    9 KiB   24 GiB  2.7 TiB  81.23  1.59  233      up
13    hdd  14.58199   0.95001   15 TiB   11 TiB   11 TiB    6 KiB   24 GiB  3.4 TiB  76.77  1.51  223      up
 9    ssd   0.72769   1.00000  745 GiB  139 GiB  137 GiB   63 MiB  1.5 GiB  606 GiB  18.65  0.37   80      up
12    ssd   1.81940   1.00000  1.8 TiB  311 GiB  308 GiB  146 MiB  2.5 GiB  1.5 TiB  16.67  0.33  182      up
14    ssd   1.45549   1.00000  1.5 TiB  247 GiB  245 GiB   88 MiB  2.2 GiB  1.2 TiB  16.57  0.32  143      up
                        TOTAL  112 TiB   57 TiB   57 TiB  902 MiB  145 GiB   55 TiB  51.00

Ryan Jeremiah Freeman

Asked: 2019-09-20 15:00:43 +0800 CST

Systemd error "Timed out waiting for device dev-sdl2.device"

I'm getting this error showing up in my logs.

Sep 19 23:50:24 MY-SERV systemd[1]: dev-sdj2.device: Job dev-sdj2.device/start timed out.
Sep 19 23:50:24 MY-SERV systemd[1]: Timed out waiting for device dev-sdj2.device.
Sep 19 23:50:24 MY-SERV systemd[1]: Dependency failed for /dev/sdj2.
Sep 19 23:50:24 MY-SERV systemd[1]: dev-sdj2.swap: Job dev-sdj2.swap/start failed with result 'dependency'.
Sep 19 23:50:24 MY-SERV systemd[1]: dev-sdj2.device: Job dev-sdj2.device/start failed with result 'timeout'.

$> ls /dev/sd*
/dev/sda   /dev/sdb1  /dev/sdd1  /dev/sdf1  /dev/sdh1  /dev/sdj1  /dev/sdl1
/dev/sda1  /dev/sdc   /dev/sde   /dev/sdg   /dev/sdi   /dev/sdk   /dev/sdm
/dev/sda2  /dev/sdc1  /dev/sde1  /dev/sdg1  /dev/sdi1  /dev/sdk1  /dev/sdm1
/dev/sdb   /dev/sdd   /dev/sdf   /dev/sdh   /dev/sdj   /dev/sdl

not a /dev/sd*2 in sight, apart from my 'sda2' which is my boot SSD.

Currently, my logfile is 800K and growing with each minute.

Info: 18.04.3, Kernel 5.0.0-29-generic

I'm running GlusterFS on this machine with great successes. I like it over a conventional raid that if a disk fails I can still retrieve the files manually from the disk rather than just 1's and 0's on a platter.

Ceph - Too many objects () are misplaced;

Systemd error "Timed out waiting for device dev-sdl2.device"

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?