I'm trying to figure out the lowest hassle way to provision 24x locally attached SSDs as a large logical volume with low-value data. I'm using them as a hot-set cache for data who's master state (about a petabyte) resides in S3, so I care more about performance, complexity of maintenance, and downtime more than lost data. Nothing will linger in the hot data set for more than a couple days, and its all easy to recreate from S3 anyway.
- Medium large instance: 32x vCPUs, 120GB RAM, Skylake
- 24x locally attached SSDs @ 375GB each = 9TB total
- Hosted on Google Cloud (GCP)
- Debian 10 (Buster)
- Access is ~4x heavier on read than write
- High number of concurrent users (human and machine) with pretty random access patterns, and very hungry for I/O.
- 90% of files are larger than 10MB
I'm thinking RAID 5 is out of the question, no chance I'm going to wait for manual rebuilds. I'm inclined toward either RAID 0, RAID 10, or.... maybe this is actually a case for a simple LVM pool with no RAID at all? Do I really lose anything by going that relatively simpler route in this case?
My ideal solution would have each subdir (I have one self contained dataset per subdir) of /
completely contained on a single disk (I can fit maybe 10 subdirs on each drive). If a drive failed, I'd have a temporary outage of the subdirs/datasets on that drive, but an easy to reason about set of "these data sets are redownloading and not available". Then I'd just rebuild the missing data sets from S3 on a new drive. I suspect LVM jbods (not sure of exactly the right word for this?) might come closest to replicating this behavior.
You appear to be contradicting your needs - "My ideal solution would have each subdir (I have one self contained dataset per subdir) of / completely contained on a single disk" tells you that you don't want RAID, LVM or any abstraction technology - *surely the solution to this is would be to simply mount each disk individually. The disadvantage here is you are likely to waste disk space and if the data set grows you will need to spend more time juggling it. (I expect you know Unix can mount drives in arbitrary places of a filesystem tree, so with a bit if thought it should be easy enough to make the drives visible as a logical tree structure)
You talk about JBOD or RAID0. If you do decide for a combined disk solution, RAID0 will give you better read performance in most cases, as data is broken up over the disks easily. RAID10 would buy you redundancy you said you don't need. JBOD is only useful to you if you have disks of different sizes, and you would be better off using LVM instead, as it can behave the same way but give you flexibility to move data around.
I can see edge cases where LVM would help over individual disk, but in general, any scenario is likely to add more complexity then it gives useful flexibility here - particularly bearing in mind the initial statement about data sets being bound to disks.
Where you might want to spend some effort is looking at the most appropriate file system and tuning parameters.
Maximizing performance indicates you need to use some form of RAID-0 or RAID10, or LVM. Complexity of maintenance rules out doing something like segmenting the disk by subdirectory (as another mentions volume juggling). Minimizing downtime means you have to have some form of redundancy, since the loss of one drive takes the whole array down, which you'd then have to rebuild. I read that as "downtime". Degraded mode on RAID-5 likely also rules out RAID-5 for performance reasons.
So I'd say your options are RAID10, or RAID1+LVM. LVM offers some increased ability to manage the size of the volume, but a lot of that would disappear if you're going to mirror it with RAID-1 anyway. According to this article https://www.linuxtoday.com/blog/pick-your-pleasure-raid-0-mdadm-striping-or-lvm-striping.html RAID-0 offers better performance than LVM.
If you genuinely don't care about the data, only its performance and the speed to rebuild service WHEN it fails rather than to avoid failure then, against all my normal better judgement, R0 will be fine.
It doesn't let you choose what data goes where obviously, but it'll be about as fast as I can think it might be, yes it'll definitely fail but you can just have a script that removes the R0 array, rebuilds it and mounts it, shouldn't take more than a minute or so to do maximum - you could even run it automatically when you lose access to the drive.
One small question - you want a 32 x vCPU VM using Skylake cores, they don't do a single socket this big so your VM will be split across sockets, this might not be as fast as you'd expect, maybe test performance with 32/24/16 cores to see what the impact would be ok, it's worth a quick try at least.
The simpler, hassle-free setup is to use a software RAID array + XFS. If, and only if, you do not care about data and availability, you can use a RAID0 array; else, I strongly suggest you using some other RAID layout. I generally suggest using RAID10 but it commands a 50% capacity penaly; for a 24x 375GB RAID you can think about RAID6 or -gasp- even RAID5.
The above solution cames with many strings attached, most importantly presenting you a single block devices and skipping any LVM-based storage partitions and meaning no snapshot capability. On the other hand, XFS allocator handles very well balancing between individual disk in a RAID0 setup.
Other possible solutions:
use XFS over classical LVM over RAID0/5/6: a legacy LVM volume has basically no impact on performance and enable you to both dinamically partition the single block devices and taking short-lived snapshot (albeit at very high performance penalty)
use XFS over thin LVM over RAID0/5/6: thin LVM enable modern snapshots, with reduced performance penalty, and other goodies. If used with a big enough chunk size performances are good
consider using ZFS (in its ZoL incarantion): especially if your data is compressible, it can provide significan space and performance advantages. Moreover, as you workload seems read-heavy, ZFS ARC can be more efficient than the traditional linux pagecache
If your data do not compress well but are deduplication-friendly, you can consider inserting VDO between the RAID block device and filesystem.
Finally please consider than any sort of LVM, JBOD or ZFS pooling does not means that losing a disk will only bring offline the directories located on such disks; rather, the entire virtual block device become unavailable. To have such sort of isolation, you need to lay a filesystem for each block devices: this mean you must manage the various mount points and, more importantly, that your storage is not pooled (ie: you can run out of space on a disk, while the others have plenty of free space).
About best performance, complexity of maintenance, you can use the best practices listed here [1] [2] as a quick reference of what to keep in mind when building an application that uses Cloud Storage.
[1] https://cloud.google.com/storage/docs/best-practices
[2] https://cloud.google.com/compute/docs/disks/performance