Question about mdadm-like alternatives
Added by Daniel Buus about 1 year ago
Hi all :)
Over the weekend I've been recovering data from a degraded RAID-5 mdadm-based array with an XFS FS on my Ubuntu server. For awhile I've been wanting to migrate to ZFS with RAID-Z2, and now it seems the data gods have chosen a good time for me to do so ;)
At first I thought I'd just use the FUSE/ZFS port that exists for Ubuntu, but I'd like to maximize performance, and so I started looking for OpenSolaris based alternatives, enter Nexenta.
The only issue I have with migrating is that my disk setup is a tad inconsistent. Physically, it comes down to this:
2 2TB disks 6 1TB disks 8 ½TB disks
Now, with Linux, this is easily assembled into appropriately sized "groups" using mdadm, like so:
2 groups of 1x2TB disks, direct access 3 groups of 2x1TB disks, accessed via mdadm JBOD arrays 2 groups of 4x½TB disks accessed via mdadm JBOD arrays
But as I'm pretty sure that mdadm doesn't exist for Solaris-based systems, I'm not sure if something like this is at all possible. AFAIU ZFS does not allow you to use pools as the basis for larger pools, right?
So is there any way to do this in Nexenta?
Thanks in advance, Daniel :)
Replies
RE: Question about mdadm-like alternatives - Added by Jérôme Warnier about 1 year ago
Of course. But you should really read some documentation about ZFS. Start here: http://www.sun.com/bigadmin/topics/zfs/
The good news is that its easier to setup (and manage) that Linux' Software RAID + LVM + FS. Plus, there are more features.
RE: Question about mdadm-like alternatives - Added by Daniel Buus about 1 year ago
Ah!
I'd been searching and searching, but all I found was questions about using odd-sized drives in a RAID-Z configuration. I guess I was missing the terminology completely :)
Just to confirm that I'm not mistaken, you're thinking of SVM, right?
Thanks :)
RE: Question about mdadm-like alternatives - Added by Jérôme Warnier about 1 year ago
No, SVM (Solaris Volume Manager, to be sure) is past, forget about it. Future is bright, future is ZFS.
During installation of Nexenta, you can only choose to do mirroring (or single disk, of course). After install, you can start to play with the other disks you have lying around. And ZFS makes it really easy, and dynamic.
RE: Question about mdadm-like alternatives - Added by Christian o about 1 year ago
a zpool can be backed by multiple "vdevs" or RAIDZ/mirroring etc virtual devices.
But you can not "recurse" vdevs (create a RAIDZ out of 3x2 mirrored devices).
The trick is if you need everything to be redundant your best choice is probably to mirror your 2 2TB disks and do a RAIDZ or RAIDZ2 of each of the rest of the same sized devices.
Then assign them all to the same zpool - or maintain two different pools; one performance oriented mirror pool and one storage optimized RAIDZ/2 pool.
ZFS doesn't solve every possible "I have drives lying around" problem - but what it supports it supports well.
I've been relying on nexenta+zfs for years. There is nothing like a "zpool scrub" report to ease your data worries. Not to mention the throughput :)
RE: Question about mdadm-like alternatives - Added by Richard Elling about 1 year ago
You can start with a mirror, say the 2x 2TB disks and then grow it (zpool add command) with more mirrors (or even a raidz). This would be the most flexible, high performance option. This sort of option is especially useful when you don't need all of the space right now, but expect to grow over time.
-- richard
RE: Question about mdadm-like alternatives - Added by Daniel Buus about 1 year ago
Hi guys, thanks for the replies :)
Getting a bit confused, though. If I may recap, judging from Christian's response in particular, it appears I came to the correct conclusion previously: that you cannot "create a zpool from other zpools" or do anything else WITHIN zfs which will allow you to group smaller drives in "chunks" large enough to become members of one big zpool (as in 2TB + 2TB + 21TB + 4½TB, for instance).
BUT, I can use SVM, right? Even though it's "the past" as pointed out. As in 2TB RAW + 2TB RAW + 2TB SVM + 2 TB SVM.
I know that this isn't best practice, because ZFS doesn't have direct access to the hardware, but AFAICT the only issue is with performance, right?
I'm not concerned about performance at all. For one because this is just my own "stash" drive where space and data integrity is far more important. Secondly, because I still get pretty awesome performance with a similar setup on my current ubuntu system, only with mdadm JBOD (to group up smaller devices) + mdadm RAID (to assemble large raw drives and the JBOD groups) + XFS. Read performance here ranges between 250 and 300 MBps, which is pleeenty for me :)
I like the idea about being able to use my older ½TB drives for as long as they're healthy, and they replace them with larger drives over time when they fail, obsoleting the SVM groups and as a plus freeing up SATA ports for even more storage to be added to the ZFS pool.
So, best practices aside, would this be the best way to go about it, seeing as this is what I want to do?
Thanks, Daniel :)
RE: Question about mdadm-like alternatives - Added by Christian o about 1 year ago
Just to clarify - you can actually create
a single zpool backed by a number of vdevs e.g.:
- a RAIDZ2 vdev of all your 8 500GB disks => I guess 3000GB very redundant space
- a RAIDZ of your 6 1000GB disks => I guess 5000 GB redundant space
- a mirror of your 2 2000GB disks => 2000GB redundant space
your single zpool would then have around 10TB of secure storage (out of a total of 14TB raw).
Now for the bold (since you want secure storage it is probably not for you) you could also put all kinds of odd sized drives in a 14TB no redundancy zpool and tell your zfs layer to keep multiple copies of each block. (see ZFS Copies). But this would only yield 7TB of storage - so it is probably not the first choice you should look at.
/C
RE: Question about mdadm-like alternatives - Added by Daniel Buus about 1 year ago
Okay, that's pretty much how I figured it would work if you couldn't create a vdev from other vdevs. It's not a bad setup, really - you have the RAIDZ2 on the array with the most disks (the one that's most prone to lose a second disk while recovering from the loss of one disk), and the two 2TB disks as a simple mirror (which could in turn be converted into a RAIDZ and then a RAIDZ2 as you added larger drives, I guess).
The weakest link here would probably be the 6 1TB disks, as only one here is allowed to fail. Two of these go down, It's all a goner. So that would have to be a RAIDZ2 too. I just had two disks report errors on my RAID5 array, and I just don't think I can take that kind of suspense once more ;)
So that brings it down to 9TB effective, and worst-case scenario, I'd still only really be safeguarded against one dead drive, since if the two 2TB drives went down, so would the pool.
Actually, I'm thinking that should I really abondon the idea of abstracting through SVM, it would be better to use partitions (I believe it's "slices" in Solaris terminology?)? As in, slice the 2TB drives into four chunks each, the 1TB drives into 2 each, and then use the 500GB drives "raw" to create 4 RAIDZ2 slices of (2 x 1/4 of a 2TB drive + 3 x 1/2 of a 1TB drive + 2 x 500GB drives). That's 2½TB effective in each of four RAIDZ2 vdevs, pool them together, and I would have 10TB of RAIDZ2-guarded pool space.
This would yield the same capacity with (technically) the same redundancy, and still allow any two drives to fail. It would also allow quicker recoveries, as at the most a quarter of the total storage would have to be processed, as opposed to all of it for the SVM approach. The downside here, though, as I see it, this pool would never be able to survive more than three disk failures (one 2TB and 2 1TBs or 1 1TB and 2 500GBs), as each smaller drive plays almost just as large a part in the aggregated pool as each of the largest.
By grouping the 500GB drives into abstract SVM units of four each, theoretically up to eight drives are allowed to fail, provided they're all 500GB drives. I could also survive 4 1TB drives failing. Though never more than both of the 2TB drives. Actually, these are very appropriate conditions, since all the 500GB drives are from 2007, the 1TB ones are from early 2009, and the 2TB ones are brand-new :) I'm guessing the 500 TB drives are gonna go down first ;)
What I like about the slices approach, though, is that I'd be able to use both Linux and Solaris/Nexenta to access them, as there'd be no platform-specific LVM/SVM/mdadm abstraction layer in-between, it'd be just raw disks with ZFS on top.
What do you think about this approach? I guess the slices approach is more best-practice than the SVM abstraction one, right? Is ZFS intelligent enough to distribute its data appropriately across such vdevs? And is there any significant performance loss usings slices rather than entires drives?
Thanks for all the help, this is really quite fun :)
RE: Question about mdadm-like alternatives - Added by Daniel Buus about 1 year ago
Hmm... Actually, it would allow up to 5 drives to fail, 1 2TB and four 500GBs. Although, this would have to be almost just as "lucky" failures as with the eight theoretical maximum failures on the SVM approach.
RE: Question about mdadm-like alternatives - Added by Daniel Buus about 1 year ago
One more thing; by "Is ZFS intelligent enough to distribute its data appropriately across such vdevs?", I'm considering the fact that the larger drives are part of two or all four arrays respectively, and therefore it would be better to not try to distribute data reads and writes across the vdevs but rather treat them as a JBOD to avoid disk thrashing.