Monday 26 January 2009

My Storage/RAID System

So I thought I would document what I have as my computer architecture in my house and the advantages and disadvantages of it.

The core of the system is the file server, the core of that is the raid card: an Adaptec AAR-2820SA. PCI-X for bandwidth into the card. The PCIX severely limited my choice of motherboard so I had to go for quite an expensive one, the advantage with this was that the bus architecture meant that it could support a transfer from the GigE interface to the memory at the same time as a transfer to the PCIX bus without contention. Initially 1Gbyte memory and 2GHz of processor.
I installed the at time current version of Fedora Core, configured 3 * 500G drives as RAID5 and started to do some profilling.

I apologise for the lack of exact numbers ahead, but this was quite some time ago I did this:
What soon became apparent was that individually the performance of the system was close to the theoretical maximum. That is, for writes of data to the RAID, the performace limitation was the speed of the 3 hard drives, with caching effective (i.e. a number of short writes) the only write speed limit was the PCI-X bus. Similarly for reads I could source data as fast as the drives could spool it. In terms of network performance I could (for none application data) get up to 80% the capacity of the GigE interface (it took 5 other computers working together to be able to sink/source this much data). All seemed perfect.
However I soon noticed some problems: For samba transfers of a mixture of files the data throughput was about a quarter of this, it didn't take long to blame the network protocols or at least Microsoft's implementation of them (NFS was better than windows networking, but still not perfect).
After much fiddling with things like packet sizes and various bits of Samba config the solution I hit on was to go the dumb approach and dump a whole extra bunch of ram in the server, now up to 3G. This was where things got weird, because at about the same time I had bought 3 * 750G drives and expanded the array across these extra drives so that the raid card had twice as many drives to access and so by the numbers should be limited only by the PCIX bus (i.e. able to saturate the GigE interface for large file transfers):

Tests of raw transfers hadn't changed, if you were just streaming TCP traffic without NFS or Windows networking then everything was as fast as it always was. However any windows machine I threw at it was unable to get more than 20% GigE network utilisation, if you threw 2 windows machines at the file transfers then you got 20% aggregate between the 2 machines (usually 10% on each machine). This pointed the finger of blame squarely at a problem on the file server. Interestingly at the same time you could happily have an NFS transfer going (getting about 60% = 75MB/sec) without impacting the performance of the windows transfer. This confused me like nobody's business!
A check of what was going on on the server showed that under the maxed out windows condition most of the system memory was being used for file cacheing. over 2.5G of memory seemed to be being used by the kernel, as the transfer progressed the usage increased until this plateau was reached and then this slowly decreased after the transfer completed. Interestingly regardless of the state of the memory usage the transfer rate never changed. This made no sense as if the memory usage was as a write cache then the transfer rate would have been impacted when the plateau was reached, but the transfer rate was constant. If it was a read cache, then why did it die off quite so significantly at the end of the transfer?

Frankly I've never figured this out and it still haunts me why the windows file transfer performance is so poor, I can only think it is something to do with poor implementation of the protocols by Samba because they were reverse engineered; however the tests I have read online seem to indicate that Samba's performance is superior to native Windows!

So my performance confusion aside how do I have it configured?
Simple, with an 8 port card I have 3 drives as 1/2 of the array, 3 more as the other 1/2 and then 1 as a hot spare. When I need to expand the array I buy 3 of the largest hard drives available at the time. e.g. I first bought 3*500G, then 3 * 750G, then 3 * 1TB the 3 * 1TB replacing the original 3*500G. Since the my controller allows raid5 with online expansion it's quite easy, in the first place you simply expand the drive array onto the extra space, then grow the drive to the maximum size (my) Linux supports (2TByte). This means that for example at the moment I have 2 raid 5 arrays across the array. Each array occupies all 6 drives(most efficient use of parity drive), the Linux drive size limit means that multiple arrays are needed. Hoever it does mean that the top 250G on each 1TB drive is unused. In theory I could create a new drive in this space, but I am not yet desperate for space given the file system management issues this would cause, but I'll explain that later.
These arrays are stitched together into a single file system in Linux via LVM. I do this because I prefer 1 large file system as I find it easier to manage by using that and quotas than by having many individual file systems of different sizes. However I have found by profilling a performance hit my using LVM to stitch too many logical volumes into a single file system, so I try and keep the number of logical volumes stitched together in LVM to below 4 (after a restore from backup I've recently been able to get it back down to 2 logical volumes for a 3.4TB file system).

So how do I do all this?
Well to create things in the first place:
In fdisk, access the partition (our raid arrays are always sdx)
fdisk /dev/sdx
and add in a partition

Create a physical volume on the partition you have just created
pvcreate /dev/sda1

If it doesn't already exist then create a volume group called anything (e.g. mainvg) using the partition that now has the pv on it
vgcreate mainvg /dev/sda1

create a logical volume called fslv of size 550G (for example) using volume group mainvg
lvcreate -n fslv -L 550G mainvg

Then you need to make the ext3 system from fslv. The resize options are set to what I believe are the maximums to allow it to be expanded later.
mke2fs -j -E resize=1073741824 -b 4096 -i 16384 -L fs /dev/mainvg/fslv -E resize=1073741824

I personally disable mount checks and interval checks for my large data partitions - then run them manually occasionally. Don't forget to run them though as I once did and later regretted.
tune2fs -c 0 -i 0 -r 0 /dev/mainvg/fslv

Create an fstab entry for it (using your favourite editor to edit /etc/fstab)
/dev/datavg/fslv /data/fs ext3 acl 0 4

then mount it using the fstab entry
mount /data/fs

If you're using selinux then change security context
chcon -t mnt_t /data/fs

And you're done.

Let's assume that later you buy more drives and add them. At this point I go into the supplied drivers and use them.
cd /usr/StorMan
./StorMan.sh

Then use the GUI to expand the partition onto the new drives and then expand the partition as appropriate. If you're creating a new partition then using what is said above and below should be enough to figure out what you want to do depending on if you're expanding the current file system or adding a new file system.

Resizing is tricky with lvm, as there is no pvresize command, otherwise you could just re-size the partition in fdisk then use ext2online. Instead we expand the drive, add a new partition, a new physical volume and then add that to the volume group and then expand the file system to use the rest of the volume group.

In fdisk add a new partition to the end of current drive. If you want to do this without rebooting partprobe should do the job, however I've had mixed success with that tool, so these days i just reboot at a convenient time.

Next extend the mainvg volume group by adding sda2 (or whatever identity you're up to) to it
vgextend mainvg /dev/sda2

extend the logical volume by 10G, or however much you need to (ideally the exact size you just added) - you can itterate this to use every last gig!
lvextend -L +10G /dev/mainvg/mainlv

Finally expand the ext2 file system to use all space available on the lv
for old systems:
ext2online /dev/mainvg/mainlv
for newer systems:
resize2fs /dev/mainvg/mainlv

And that's it.

Now to do performance monitoring I usually just time a copy of a whole bunch of data. A mix of small and large files are normally good (I find my video collection and by photo collection make good large and small file size datasets), as long as the dataset is appriciably larger than the total of all the caches involved.

For network performance file copies are a good test, however to get raw throughput rates:
On the server
nc -u -l 5000

On your other computer (192.168.1.45 in this case is the IP address of the server you are testing):
pv < /dev/zero | nc -u 192.168.1.45 5000

I normally try this with a number of computers at once to be sure that the server is my limit, not the other computer, of course if you do this then each connection pair needs a different port number (i.e. change 5000 to 5001, 5002 etc) One thing I've just discovered and need to run on my next set of tests is keeping an eye on stuff in /proc for hints, e.g.
watch less /proc/sys/net/ipv4/tcp_mem

So all that said and done, what next?
Well at the moment the architecture seems to have plenty of life left in it. Were I buying it now there'd probably be equivalent PCIE RAID card available. I managed to get hold of a full tower case that had 6*5 1/2" slots, this meant I could have 2 of these in giving me proper hot swap on all my drives. By the way they are absolutely fabulous except for 2 problems, first the drives get quite hot and second don't lose the drive mounting screws that come with them, replacements are next to impossible to find.
However I don't think in the future I'd go to the extra expense of using hardware RAID, although the performance of the card itself is awesome, and appreciated for times when I'm doing some re-organising of my data it's really not needed for even the most demanding home applications(which is what mine is). In the future I'd spend less on the raid card itself and the motherboard and just do software raid, comparisons of performance with friends who have done this mean that mine may be the fastest, but I really don't think it's worth it.

As for how do I backup this 4TB behemoth? That's a story for another post...

No comments: