Linux as backup (failover) machine

Sat, 4 Nov 2000 18:32:41 -0700

> transactionally correct . . .

Transactionally correct.  That's my new favorite term
of the month.  :)

You said that both boxen are in the same data center,
and you can't install a second NIC in the NT box.  Here's
a thought.  If both boxen have an unused serial port, and
they're within a few hundred feet of each other, how about
hooking up a null modem serial cable between them.  The
Linux box could determine if the NT box was dead based
on, say, a combination of RS-232 signals being in a
specific state.  Or you could get fancy and have the NT
box pulse a "heartbeat" (raise and drop DTR every second
or somethin like that).

Trent is correct that if the web server is doing any
transactions (or write operations), that will be a
PITA.  If that is the case, I think I would have the
Linux web server simply serve up a static page that reads
"We are experiencing technical difficulties.  Our fsckin
NT box has blue screened *AGAIN*.  This is the 729th time
this *WEEK*.  Below, please find a mesmerizing flaming logo
and links to Unix uptime statistics.  Please stand by."
until the NT box comes back up.  Well, as "up" as any NT
box can be, that is.

D

* On Sat, Nov 04, 2000 at 02:57:30PM -0700, Trent Shipley wrote:
> 
> 
> > -----Original Message-----
> > From: plug-discuss-admin@lists.PLUG.phoenix.az.us
> > [mailto:plug-discuss-admin@lists.PLUG.phoenix.az.us]On Behalf Of Kevin
> > Buettner
> > Sent: Friday, November 03, 2000 7:08 PM
> > To: plug-discuss@lists.PLUG.phoenix.az.us
> > Subject: Re: Linux as backup (failover) machine
> >
> >
> > On Nov 4,  8:11am, Ken Bowley wrote:
> >
> > > I've been posed with a question, and I'm a little stumped...  please
> > > bear with me.
> > >
> > > Problem:
> > > Make a Linux machine automatically kick in as a failover machine for
> > > http when the NT machine goes down.
> > >
> > > Restrictions:
> > > Need to be able to monitor the NT box without installing anything
> > > extra on the NT machine.  Linux machine needs to be able to kick in
> > > automatically when the NT box goes down, and give control back to
> > > the NT box when it comes back up.  No access to installing any type
> > > of router/proxy between the NT and Linux box and the rest of the
> > > net.
> > >
> > > Please send your ideas either directly to myself, or to the list if
> > > this problem is of interest to others.
> >
> > First, I'm sure that there's some code already out there somewhere
> > for this, but it doesn't sound terribly difficult to implement from
> > scratch either.  (Maybe about five lines of Perl?)
> >
> > Anyway, the NT box in pingable, right?
> >
> > Set up a script which continuously pings the NT box; when the
> > pings stop coming back, do an ifconfig on your network interface
> > to the NT box's IP address.
> >
> > The reqlinquishing control part is harder, but could be easily
> > solved if the NT machine had two network adapters; you could ping
> > the second one to know when to give up the NT machine's IP
> > address.
> >
> > So... thinking about this some more, it'd probably be best if
> > both machines had two network cards.  Weird things happen
> > when two machines attempt to use the same IP address.
> >
> > So here's how it'd look:
> >
> > ====+==+==============+==+========= Network
> >     |  |              |  |
> >    A| B|             C| D|
> >     |  |              |  |
> >    -+--+-           --+--+-
> >   |  NT  |         | Linux |
> >   --------         ---------
> >
> > Now suppose that NT is supplying its services via interface A and
> > that you want Linux to use C when it acts as the failover.
> >
> > So...  start out with C disabled ("ifconfig eth0 down", or somesuch).
> > Ping B via D.  When the pings stop coming back, do "ifconfig eth0 up ..."
> > Now, you continue to ping B from D, and when the pings resume, just
> > do "ifconfig eth0 down" again to allow the NT machine to take over
> > again.
> >
> > It may be possible to make it work with a single NIC on the NT box,
> > but I have doubts about the reliability.  (But someone who knows
> > more about networking that I do might have some ideas.)
> >
> > Note too that you can tighten the whole arrangement up by doing:
> >
> > ====+=================+============ Network
> >     |                 |
> >    A|                C|
> >     |                 |
> >    -+-----         ---+----
> >   |  NT  +----~----+ Linux |
> >   -------- B     D ---------
> >
> > where the cable between B and D is a crossover cable.  That way too
> > you could assign B and D network addresses intended for private
> > networks (192.168.X.Y or 10.X.Y.Z).
> >
> > Okay, so maybe it's around 25 lines of Perl.  (It sounds interesting
> > enough that I'm tempted to code it myself.)
> >
> 
> If _In Search of Clusters, Second Edition_ by Gregory F. Pfister, is any
> indication you are looking at a lot more than 25 lines of code.  Also, since
> you are going to want to run the failover monitor on the Linux box as a
> background daemon, it brings into question using a scripting language for
> the implementation.
> 
> Not being able to install a proxy or router between the dual failover boxes
> is not much of a limitation.  That is a dead end because it just introduces
> another point of failure.
> 
> Not being able to alter the primary may make mean that your boss just
> ordered miracle-ware.  This is particularly true if the failover has to be
> transactionally correct . . . and if the box is mission critical, then the
> accountants are going to INSIST that no data be lost or created during the
> failover.  (Transactional semantics may mean that the project cannot be done
> in-house. . . .)
> 
> Failback is just as problematic, though you will get to recycle a lot of
> code (but not all of it.  The problems are not identical.)
> 
> Unless you can find a canned freeware solution you might want to tell them
> to look at buying another NT license (you might get away with a workstation
> instead of a server version), two MTS licenses, and a proprietary failover
> system.
> 
> Also, Oracle has a feature called "standby database" that is standard.  It
> probably won't help with your problem, but it might be useful as an example.