After upgrading to snv_121, I have not found any problem with it…. well, until 2 days ago.  I have 2 directory on a ZFS RAIDZ volume that I could not access.

This is a simple 5 bay, port multiplier enabled (Silicon Image, Sil3726 I believe) enclosure, connected to the NAS box via a USB2 to eSATA converter which I have mentioned before.  Each bay is a 1TB drive of various makes.

I performed a scrub and to my surprise, there were a few errors, but thanks to the RAIDZ, nothing is lost.  The scrub finished without any issue, but I still could not access the 2 directories over CIFS.  I can access other directories just fine.  When I ssh into the box, the content of the directory is intact… this would point to the CIFS/SMB service being the source of the issue.

I tried to copy the content to another directory within the same volume using simple cp or rsync.  Strangely, the new directory also could not be access over CIFS/SMB.  I am suspecting that the UTF-8 filenames of files might have something to do with it.  Since I first observed this problem in snv_121, I thought, why not boot to snv_118 with the wonderful snapshot that each upgrade took before the installation.

So I chose to boot to the previous snapshot.  Then tried to access the drive…. DRIVES WERE NOT ACCESSABLE!  Perform a zpool status, showed that the volume I had issue with was OFFLINE, all the HD in the enclosure was indicated as OFFLINE.  “My God, did I lose everything?!”  I thought to myself.

I rebooted the enclosure and wait for power-up.  Did the zpool status again.  Okay, everyting was reported to be fine.  Go to that storage pool…. IT WAS EMPTY!!!  “OMG, I really did lose everything!!!”

I rebooted the NAS box to go back to snv_121, thinking that the new ZFS version might caused snv_118 not to recognize it…. again, the storage pool has good status, but have no data.  Suddenly, it dawned on me that I read somewhere to export the volume and reimport it.  So I tried that…. IT WORKED!  The storage pool came back to live with everything intact.

So a lesson learnt…. if you run into a problem like this, tried to do an zpool export <poolname> and zpool import <poolname>.  I am not sure why ZFS didnt’ automatically recognize the data.  I didn’t dare to do a scrub since I have no idea how distructive it could have been…  I hope this helps someone out there who might have similar issue.

I am very impressed with the upgrade process of OpenSolaris.  No other operating system that I know of offer the kind of upgrade assurance (well, I am pretty paranoid actually).  Each version upgrade, the Update Manager in OpenSolaris (if you use ZFS as root file system), will automatically create a snapshot of the existing installation and then install the new version.  If you run into problem with the new version, as long as the boot manager still comes up, you can simply boot to the previous working snatpshot and everything would be as was.

While I understand the above, the upgrade still makes me a bit uneasy…. after biting my finger a few times, I decided to go ahead with the upgrade.  I fired up the Update Manager and chose to upgrade all packages…  I wasn’t watching the progress, but continue to write to the NAS while the upgrade was in progress.  I think this is important because it means that uptime wax maximized even during an upgrade.

After reboot, the new version came up flawlessly.  I haven’t yet seen any performance improvement per se with my daily use (only used it for 2 days)… but will report back if I find anything interesting.

All in all, everything was where it was supposed to be… OpenSolaris is really quite a matured OS.

Now this is interesting.  If you have been following my quest for a perfect self-assembled NAS box, you would have read about the initial performance statistics.  I am not trying to be 100% scientific, but provided observation that I have seen.

My main pool of storage is on 6 1TB WD Green Drive.  2 of them are on the Atom 330 MB with Intel 945G’s built-in SATA port and 4 on the Sil3124 add-on card.  The 6 disks are formatted into a RAIDZ volume.  Performance number is around 14MB/s write and 6.5MB/s read.  The writing, while averaging 14MB/s, fluctuate a lot.  I wasn’t sure if it was due to the CPU, Interface or network interface.

I also have another Sil4726 storage box that has 5 WD 1TB Green drive that I have hook up to the NAS via a USB2 to eSATA converter.  The 5 drives has also been formatted as a RAIDZ volume.  Because it is on USB2, I was not as keen on using it as the main storage pool.  However, I just found some old drives with older data archive that I decided to move to this storage volume.

To my surprise, this pool of storage is able to substain a 14MB/s write with almost no fluctuation in speed (again, i am just looking at the number reported by Teracopy).

I have 2 theory on this observation:

1) Atom 330′s max performance for RAIDZ is maybe around 14MB/s on OpenSolaris.

2) It was a surprise to me that the USB2 pool is actually able to sustained that kind of write speed, which is faster than through native SATA.  Therefore, I think that the PCI interface card is too limiting in the bandwidth that the write speed actually fluctuate so much.  The USB2 interface, while technically inferior in speed, was linked more closely to the southbridge that it is actually faster.  You probably will get better performane with a higher end PCI Express add-on card if you build a NAS on a more powerful MB.

08.22.2009

The thing about any system that you have put too much data on is the stability of it.  While I haven’t seen any major issue with snv_117 and have been fairly happy with it for sometime, OpenSolaris has just been upgraded to snv_118.

For OpenSolaris, upgrading an OS is actually a fairly safe operation… even without any preparation.  If you have to upgrade Windows Vista from SP1 to SP2.  The only safe way to do it is to take a image dump of the partition with programs like Norton Ghost or Clonezilla and the like.  Upgrade the OS and if anything bad happens, rollback to the previous version using the image.

OpenSolaris, if you take advantage of using ZFS as the root filesystem, the update manager actually takes a ZFS snapshot of the OS, then install the new version of the OS on.  At boot, you can easily choose to boot to a previous version that was snapshotted (is there such a word)…. no fuss no hassle…

Even with that in mind, it is still a scary thought to upgrade… Let me think about it a bit more…

Have been away for about a month on various vacation break…  During this time, my OpenSolaris NAS box (I called the server Falcon), has been rock stable.  I am using it daily and had not need to reboot it for more than a month.

I have received a few comments asking for some performance number.  Purely based on my own observation and memory, I have seen an average of about 12-14MB/s for writing over Gigabit Ethernet.  However, the speed of writing varies a lot, anywhere from 2-3MB/s to 15MB/s.  I get this stats from Teracopy’s calculation… so I can’t tell you for sure if it is totally accurate.  Considering that the same machine writing to a USB2 hard drive (WDC Essential 1TB) would score about 17MB/s, I think this is a pretty good speed for a 6 disk RAIDZ pool using Atom 330.

The strange thing is that reading is only about 6.5MB/s.  It hasn’t been much of a problem for me since I store a lot of media on that NAS pool for viewing on various machines that I have around the house.  However, since I haven’t done any HD streaming, I am not sure if this is going to be an eliminating factor or not.

My feeling is that Atom 330 is underpowered.  However, it depends on what your main objective is.  My main objective is to have a low power NAS device for storing all my media, Atom 330 serves this purpose well.

Have been working the NAS hard for a few days… stability has improved a lot but when moving large amount of data into the machine is still somewhat unstable.

As I am migrating all my data from all sorts of portable drives into the server, I am often copying 100+GB at a time.  I can’t find any specific patterns yet, but it seems that while gani driver is hugely better than the default Realtek driver, it can still have stability issue.  Good thing is that gani driver never totally locks up.  I have yet need to reboot it to regain network connection.  However, I think while copying huge amount of data (unqualified at this time), gani Realtek driver can still become temporarily unresponsive.

In order to avoid having to recopy everything, I use Microsoft’s SyncToy 2.0 to verify if the directories are in sync.  I could have use Rsync, but most of my client machines are on Windows, so this looks to be the best free tool.  However, I found a strange behaviour with SyncToy 2.  If I am syncing large directory and it needs to copy a lot of data to the destination drive, if the target path becomes unresponsive, SyncToy would spit out the error that the target path is not found.  However, trying to rerun it, it would still spit out the same error, even though the network share is working just fine.

I think there is a bug in the way SyncToy maintains the ‘work file’ data (it saves some checksum I think in a file on the directory).

Maybe it is time to find another sync program…  Can’t wait for OpenSolaris’s ZFS dedupe functionality to arrive…

I have been googling around for the instability of my OpenSolaris NAS box when copying large amount of data to see if this is common issue and finally found the following link:

http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/#comments

After reading the post, a lot of the symptom that I saw start to make sense.  The Realtek 8111c/8168 driver is definitely the culprit for the system instability.  At first, I have always thought that it was the SMB/CIFS service’s fault. In fact, I think in snv_111b (2009.6), SMB/CIFS is actually not quite robust.  As reported before, after reading and writing to it for a while, the directory shares would disappear and when executing “sharemgr show -vp”, it would hang.

After upgrading snv_117, the hanging did not occur again after extensive reading/writing.  However, the NAS would go off the network after random length of time.  I suspected that network load might be a cause.  However, on one instance, I was simply copying large amount of data within the server and the NAS box went offline.  That would point to some sort of CPU workload or disk workload that was causing the instability.

So I follow the advice above and some info below (some detail):

http://schlaepfer.nine.ch/twiki/bin/view/Schlaepfer/SelfMadeNas2

and upgraded the network driver to the gani driver…  So far, it has survived half a day with quite a bit of activity without issue.

Will keep testing it.  One thing to note, the gani + CIFS write speed (to a RAIDZ volume with 6 1TB disk), fluctuate quite a bit… it ranges from 11MB/s to an average of  3.5MB/s… I think the performance is on par with the regular rge driver. … hopefully stability will be much better.

Stay tuned…

I am checking out this new storage server with much interest:

http://www.amazon.com/Acer-Aspire-AH340-UA230N-Home-Server/dp/B001WGX15W/ref=sr_1_1?ie=UTF8&s=electronics&qid=1247022235&sr=8-1

Heard a lot about Windows Home Server 2008 already from Paul Thurott’s “Windows Weekly“.  Windows Server 2008 is also capable of provide redundancy to data.  It is almost like Drobo except it is even more expandable.  You can hook up any storage media to it and that become part of the storage pool.  For folders that you want extra protection, Windows Server 2008 will make sure that the data is replicated to different physical device.

The above unit comes with 1TB installed already and have 3 more drive cages to install drive internally.  You can add more drive to it via USB and eSATA.  With USB hub, you can add many many more drives.  Not sure if the eSATA port will support port multiplier or not… but I guess I can always use the CoolDrive eSATA to USB2 converter.

I guess you can never have too much storage….

07.07.2009

Installed the development build of OpenSolaris so I am now at snv_117.  Overall, it seems that the smb service is slightly more stable but bearly.

Upon installation, copying files to the NAS started okay but then for a set of data files, for some reason, it just refuse to go through (about 40 files).  The client would complain of losing the network connection.  However, unlike snv_111b, the smb service never become completely hung.  The encouraging thing was that “sharemgr show -vp” would continue to see the share service without issue.    I tried to break the 40 files into batches of 10, but immediately stuck on the 1st 10.  The copying process just hang (I use teracopy… it is a must have utility).

I then started a scrub on the volume to make sure that there wasn’t any issue with the volume…. and guess what, copying while the scrub is happening in the background actually seem to make the copying processing more stable.  I was able to copy a lot of data that has yet to be migrated to the NAS…. go figure.

Will report back on the stability of the release with regard to CIFS/SMB.

07.07.2009

Have been using the OpenSolaris NAS box for a while now.  I have about 2TB of stuff already on it.  Mostly photos and videos that I have collected over the years.

OpenSolaris 2009.6 based on snv_111b had this weird issue with CIFS (SMB) service.  When copying a large amount of data, after a while (random length of time), the SMB share would disappear (all my clients are Windows) and you cannot access the share.

Certain occasions, the host become completely unreachable, but at times when you can still rsh into the machine, you will see that command: <code>sharemgr show -vp</code> would hang.  Nothing shows up in dmesg or the svc logs.

Restarting the service doesn’t do anything and each time, only a reboot would fix the issue.  This is highly frustrating.  Will have to upgrade to DEV build to see if more recent builds have solve this issue.

Next Page »