They're actually far more flexible (they have a lot fewer rules enforced by the data storage engine), but the cost of that flexibility is that the application programming is a couple orders of magnitude more complicated than with single-server systems and RDBMS's.
The RDBMS provides transactional semantics in the data storage layer; with distributed systems your application has to handle all of that, and you have to do it without creating your own n^2 lock contention, which means completely rethinking how you interact between application and datastore.
It's a hugely fun problem, but it's also among the most difficult tasks in all of Software Engineering.

Unless you actually need huge scale, you're generally better off with the well-understood semantics and simpler programming of a more "typical" system.

If you need to write a small web-based application (prototype, startup, demo, etc...) and want to avoid the cost, time, and effort of setting up your own servers then something like AppEngine can be worthwhile, since it hides almost all of that distributed-systems complexity from you, but you still have to give up transactions and accept a very different interaction mode.

Directly using the hyper-scale distributed systems, on the other hand, is really only for really big systems with lots of talented software engineers working on them.  It takes many staff hours to maintain even a fairly small (~500 node) Hadoop instance each month.  Maintaining something like Facebook's scale takes large dedicated teams of very skilled systems engineers just to keep it all running.

Sometime, I'd like to do a presentation to PLUG on distributed systems, but it's most appropriate for Devel, and I can't make Devel meetings more than about once a year...


Trent Shipley wrote:
> Do those massive, distributed, and fast Internet platforms give up flexibility?  
> A RDBMS is designed as a general solution for storing and querying structured 
> data.  If the Internet solutions are general solutions why haven't they 
> displaced the enterprise scale solutions?
> 
> ________________________________
> From: Joseph Sinclair <plug-discussion@stcaz.net>
> To: Main PLUG discussion list <plug-discuss@lists.plug.phoenix.az.us>
> Sent: Wed, July 14, 2010 6:45:26 PM
> Subject: Re: App Engine?
> 
> MySQL IS a single-server environment.  No single MySQL instance spans multiple 
> servers.  Clustering doesn't make software distributed, it makes it clustered 
> (which is COMPLETELY different).
> Cassandra is NOTHING like MySQL.  It actually is a distributed column-oriented 
> datastore (and it's NOT an RDBMS).  Cassandra is not clustered either, it's 
> *distributed*.
> Try this:
>   Cluster 50 MySQL instances; randomly pull power (without warning or shutdown) 
> on 10.  Is the cluster still able to serve all rows?  Did you loose any data or 
> transactions?
>   Run a 50-node Cassandra instance (single instance, 50 machines); randomly pull 
> power (without warning or shutdown) on 10.  Is the instance still able to serve 
> all rows?  Did you loose any data?
> That experiment will show you one of the MANY ways distributed systems are 
> different from clustered (without having to run 2000 machines to see the 
> difference).
> 
> Facebook uses actual distributed software (things like Hadoop, Hive, Cassandra, 
> etc...)  They don't run their site off of MySQL (or Oracle, for that matter).
> Digg uses distributed systems as well, because scaling to their load is 
> "increasingly difficult with MySQL" (http://about.digg.com/node/564).
> There isn't a clustered solution possible that would handle their scale, in fact 
> they haven't been using a cluster, in the traditional sense, for years.
> 
> All of them use things like MySQL for smaller, internal-facing systems, but none 
> of them use *any* RDBMS for a user-facing site.
> I can show you conclusively that MySQL (and any RDBMS) fails at large scale 
> because the n^2 locking problem kills it.
> Clustering is fine for an Enterprise application.  It's death for an Internet 
> application.
> 
> Amazon runs amazingly fast, have you actually used Amazon.com (you do realize 
> that their cloud offerings are the same infrastructure they use to run their own 
> sites?).
> Google.com gets search results in <1 second every time.  Try doing that with 
> MySQL or Oracle.  Neither is capable of even storing a small part of the index; 
> their internal limits won't permit a table that big, much less an indexed table.
> 
> If all you've ever built is enterprise apps with less than 100,000 users, you'll 
> never understand why enterprise solutions don't scale to Internet numbers (100 
> million users or more).  5 years ago, I would have agreed with you; that was 
> before I had to write software that could process more than 40,000,000 
> transactions per day and produce multidimensional analyses of all that data.
> There's a completely different world of scale between 100,000 users and 100 
> million users, and solutions for the smaller scale are completely useless at the 
> larger scale.
> There are lots of people who think dumping their LAMP site on EC2 will make it 
> fast, they're wrong.
> You have to design for scale when you build the software.
> 
> AppEngine, BTW, is also good for small low-volume applications, just because it 
> can be MUCH cheaper to run a small app on AppEngine than to run a hosted server 
> (particularly if you want to write in Java or Python rather than PHP).
> 
> I've developed on Google's and Amazon's platforms.  I've also written (and am 
> writing) the kind of distributed infrastructure those two use to enable their 
> huge sites.  Not many systems require that kind of scale; for those that do 
> there's no alternative to real lock-free/contention-free distributed systems.
> 
> When was the last time your app generated <100ms response times doing multiple 
> PKI operations on 4M-40M files while sustaining >2000 requests/minute on <10 
> commodity servers with no special hardware?
> 
> ==Joseph++
>