Clustering VS. Mainframe

David Mandala plug-discuss@lists.plug.phoenix.az.us
22 Jan 2003 14:42:45 -0700


Hmm, it does work for some storage clusters. Appletalk clusters may have
a problem but other database are able to cluster and accommodate a dead
node without the need to reboot the clients. A well designed cluster
built for a purpose can be designed to avoid choke points. Again
depending upon the devices and the design it can and should take less
then 1 second to detect a failed hardware point. Anything at the time
limits you are describing (3 minutes) is unacceptable performance.

With the correct design the customer, be it web, database or calculation
cluster never knows a node went down, nor should they.

Sorry a Google cluster is not just a web cluster. A Google cluster
consists of approximately 80 machines. Some are database slices (their
database is HUGE), and some are logic and some are web. If any machine
in the cluster fails the you never see it does not matter if it is a
database, logic or web machine, sucker will even hand off to another
cluster if need be. Google had well over a thousand of these types of
clusters, I'm fairly sure it 3 or 4 thousand of these types of clusters
by now, could be even higher.

Not all problems lend themselves to clusters, those that do can make use
of commodity hardware, and Linux and save big bucks compared to a big
iron machine.

Cheers,

Davidm 

On Wed, 2003-01-22 at 14:24, Eric Lee Green wrote:
> On Wednesday 22 January 2003 01:03 pm, David Mandala wrote:
> > A few more reasons that people go to clusters are:
> >
> > 1) Failover/Down time. If a single unit in a cluster dies the rest keep
> > on working. 
>  
> This works for web clusters because of the non-persistent nature of web 
> connections. It does NOT work for storage clusters. If a member of a 
> clustered Appletalk storage network goes down, any Mac connected to that 
> particular node must be re-booted before it will re-connect, for example. You 
> can believe me or not, I have a Fortune 5 customer using one of my storage 
> clusters and that's what happens when a node fails. 
> 
> SMB is a little more graceful, it just pops up a requester saying that the 
> connection has been broken, and requests that you re-attach. Do note that any 
> writes outstanding at the time that the node goes down are *LOST*.
> 
> Finally, virtually all clusters have a "choke point". In the case of web 
> clusters, that's often the database server. In the case of distributed file 
> servers (such as clusters built using GFS, the Global File System), that's 
> often the lock server or the Fiber Channel bus between the RAID array and the 
> leaf nodes, or the master controller board on the RAID array. This choke 
> point goes down, the whole cluster goes down.
> 
> So let's make it highly available, you say? Fine and dandy. Been there, done 
> that. It takes me approximately 90 seconds to detect that the master node of 
> a highly redundant pair has failed. It then takes me another 60 to 90 seconds 
> to bring up the services on the slave node. So that's approximately 3 minutes 
> that the cluster is not available. During that time, the Macs are frozen. The 
> Windows boxes are popping up their window saying you need to reconnect. Any 
> writes in progress are lost. 3 minutes is a lot better than 3 days, but is 
> nowhere near what a modern mainframe can achieve -- less than three minutes 
> of downtime PER YEAR. 
> 
> In a big iron mainframe, if one CPU goes down, the rest keeps on working -- 
> completely transparently. There is no 3 minute switchover. If one memory card 
> goes down, the rest keeps on working -- completely transparently. And of 
> course with RAID, hard drive failures aren't an issue either. You can 
> hot-swap CPU's or add CPU's on an as-needed basis with a big iron mainframe. 
> Same deal with memory cards. An IBM mainframe has uptime in the nine nines 
> range. 
> 
> > If the cluster is big enough it may even be hard to notice
> > that a single unit dies. (Google uses special clusters for their search
> > engine.)
> 
> Google's search engine is a special case cluster that is enabled by the fact 
> that it's a web cluster. As a web cluster, a failing node results in a short 
> read. You click the refresh button, you get connected again to a non-failed 
> node, and things work again. Secondly, all interactive accesses are reads. 
> Writes to their back-end databases are done as a batch process then 
> distributed in parallel to the various nodes of the cluster, they are not 
> real-time updates. This approach is utterly unsuited for a general purpose 
> storage cluster, whether we are talking about a file storage cluster or a 
> database cluster. I use a similar approach to replicate storage migration 
> data to the multiple nodes of a storage cluster, but I have complete control 
> over the database and all software that accesses it -- if we were offering a 
> database service to the outside world, this approach would not work *at all*. 
> 
> > 2) Cost, for many problems that can be split over a cluster it is
> > usually cheaper to build a big cluster then buy one machine and item one
> > becomes a factor too.
> 
> The operant point is "can be split over a cluster". Not all problems can be 
> split over a cluster (or if they can, it is in a very clunky and unreliable 
> way). In reality, if it is a CPU-intensive application, a cluster is cheaper. 
> If it is an IO-intensive application, often a big iron machine is cheaper. I 
> would not attempt to run my corporation's accounting systems off of an Oracle 
> database cluster. I'd want to run it off of a big honkin' database server 
> that had fail-safe characteristics. 
> 
> 
> > Some mainframes look like large clusters to the software running on
> > them. The IBM 390 running Linux can have thousands of Linux instances
> > running on the same machine. The software thinks it's on a cluster but
> > the actual hardware is the mainframe. This is a special purpose item.
> > The mainframe of course cost big $$$ but there are cases where this is
> > cheaper then a cluster of real hardware when you calculate MBTF and
> > floorspace and head costs.
> 
> Indeed. 
-- 
David IS Mandala
gpg fingerprint 8932 E7EF CCF5 1B8C 1B5C A92E C678 795E 45B2 D952
Phoenix, AZ (480) 460-7546 HP, (602) 741-1363 CP
http://www.them.com/~davidm/