OT: Dell disks

Lisa Kachold lisakachold at obnosis.com
Tue Jun 19 06:28:08 MST 2012


Hi Mark,

On Mon, Jun 18, 2012 at 10:05 PM, Mark Jarvis <m.jarvis at cox.net> wrote:

>
> I'm considering buying a Dell desktop (Inspiron 620), but a few years ago
> I was warned off them because Dell did something different to their disks
> so that you had to buy replacement/additional disks only from Dell. Any
> chance that it's still true?
>
> Unless you have a hardware RAID card, and you are buying a desktop, you
should not have enterprise grade drives, but check with Dell Support for
the model you are interested in.

You are referring to  TLER/ERC/CCTL:

Hard drive manufacturers are drawing a distinction between "desktop" grade
and "enterprise" grade drives. The "desktop" grade drives can take a long
time (~2 minutes) to respond when they find an error, which causes most
RAID systems to label them as failed and drop them from the array. The
solution provided by the manufacturers is for us to purchase the
"enterprise" grade drives, at twice the cost, which report errors promptly
enough so that this isn't a problem. This "enterprise" feature is called
TLER, ERC, and CCTL.

*The Problem:*

There are three problems with this situation:

The first is that it flies in the face of the word *Inexpensive* in the
acronym *Redundant Arrays of Inexpensive Disks
(RAID)*<http://www-2.cs.cmu.edu/%7Egarth/RAIDpaper/Patterson88.pdf>
.

The second is that when a drive starts to fail, you want to know about it,
as Miles Nordin wrote in a long
thread<http://opensolaris.org/jive/thread.jspa?threadID=119639&tstart=0>
:
*
Posssible Solutions:*

For a while, Western Digital released a program (WDTLER.EXE) that made it
possible to enable TLER on desktop grade drives. This no longer works.

*Linux:*

This message <http://marc.info/?l=linux-raid&m=128640221813394&w=2> implies
that it's impossible to tell a drive to cancel its bad read operation:

You can set the ERC values of your drives. Then they'll stop processing
their internal error recovery procedure after the timeout and continue
to react. Without ERC-timeout, the drive tries to correct the error on
its own (not reacting on any requests), mdraid assumes an error after a
while and tries to rewrite the "missing" sector (assembled from the
other disks). But the drive will still not react to the write request
as it is still doing its internal recovery procedure. Now mdraid
assumes the disk to be bad and kicks it.

There's nothing you can do about this viscious circle except either
enabling ERC or using Raid-Edition disk (which have ERC enabled by default).

Evidence that using ATA ERC commands don't always work:
Both Linux and FreeBSD can use normal desktop drives without TLER, and in
fact you *would not even want TLER* in such a case, since *TLER can be
dangerous* in some circumstances. Read on.


*What is TLER/CCTL/ERC?*
TLER (Time-Limited Error Recovery
CCTL (Command Completion Time Limit)
ERC (Error Recovery Control)

These basically mean the same thing: limit the number of seconds the
harddrive spends on trying to recover a weak or bad sector. TLER and the
other variants are typically configured to 7 seconds, meaning that if the
drive has not managed to recover that sector within 7 seconds, it will give
up and forfeit recovery, and return an I/O error to the host instead.

The behavior without TLER is that up to 120 seconds (20-60 is more
frequent) may pass before a disk gives up recovery. This behavior causes
haywire on all Hardware RAID and Windows-based software/onboard/driver RAIDs.
The RAID consider typically is configured to consider disks that don't
respond in 10 seconds as completely failed; which is bizarre to say the
least! This smells like the vendors have some sort of deal causing you to
buy HDDs at twice the price just for a simple firmware fix. LOL!! Don't get
yourself buttraped; read on!


*When do i need TLER?*
You need TLER-capable disks when using any Hardware RAID or any
Windows-based software RAID; bummer if you're on Windows platform! But this
also means Hardware RAID on any OS (FreeBSD/Linux) would also need TLER
disks; even when configured to run as 'JBOD' array. There may be
controllers with different firmware that allow you to set the timeout limit
for I/O; but i've not yet heard about specific products, except some LSI
1068E in IR mode; but reputable vendors like Areca (FW1.43) certainly
require TLER-enabled disks or they will drop-out like candy whenever you
encounter a bad/weak sector that needs longer recovery than 10 seconds.

Basically, if you use a RAID platform that DEMANDS the disks to respond
within 10 seconds, and will KICK OUT disks that do not respond in time,
then you need TLER.

*When don't I need TLER?*
When using FreeBSD/Linux software RAID on a HBA controller; which is a
RAID-less controller. Areca HW RAID running in JBOD mode is still a RAID
controller; it controls whether the disks are detached, not the OS. With a
true HBA like LSI 1068E (Intel SASUC8i) your OS would have control about
whether to detach the disk or not; and Linux/BSD won't, at least not for a
simple bad sector. Not sure about Apple OSX actually, but since it's based
on FreeBSD i could speculate that it would have the same behavior as
FreeBSD; perhaps tuned differently.

*Why don't you want TLER even if your disks are capable?*

If you don't need TLER, then you don't want TLER! Why? Well because *TLER
is dangerous!*   Nonesense? Consider this:

1. You have a nice RAID5 array on Hardware RAID, being a valuable customer
you spent the premium price on TLER capable disks.
2. Now one of your disk dies; oh bummer! But hey I have RAID5; I'
protected, RIGHT?
3. So I buy a new disk, and replace the failed one! So easy,
4. A bad sector on of the remaining member disks, and it caused TLER to
forfeit; now I got an I/O error during rebuilding my degraded array and the
rebuild stopped and I lost access to my data!

The danger in TLER lies that if you lost your redundancy, then if a weak
sector occurs that COULD be recovered, TLER will force the drive to STOP
TRYING after 7 seconds. If it didn't fix it by then, and you lost your
redundancy, then TLER is a harmful property instead of a useful one.

TLER works best when you got alot of redundancy and can swap disks easily,
and want disks that show any sign of weakness - if even just a fart - to be
kicked out and replaced ASAP, without causing hickups which are
unacceptable to a heavy-duty online money transaction server, for example.
So TLER can be useful, but for consumers this is more like an interesting
way for vendors to make some more money from you poor souls!


*What is Bit-Error Rate and how does it relate to TLER?*

Uncorrectable Bit-Error Rate, has been steady at 10^-14, but capacities are
growing and the BER rate stays the same. That means that modern
high-capacity harddrives now are more likely to be affected by amnesia;
they sometimes really cannot read a sector. This could be physical damage
to the sector itself, or just a weak charge meaning no physical damage to
that sector but just unreadable.

So 2TB 512-byte sector disks have a relative high BER rate. This makes them
even more susceptible to dropping out of conventional Windows/Hardware
RAIDs, and is why the TLER feature has become more important. But i
consider it to be rather a curse than a blessing.

*So, explain again please:  Why don't I need TLER on Linux/BSD?

* Simple: the OS does not detach a disk that times out, but resets the
interface and re-tries the I/O. Also when using ZFS, it will write to a bad
sector, causing that bad sector to be instantly
fixed/healed/correctedsince writing to a bad sector makes the disk
perform a sector swap right
away. In the SMART data, the "Current Pending Sector" (active bad sector)
would then become "Reallocated Sector Count" (passive bad sector which no
longer causes harm and cannot be seen or used by the host Operating System
anymore).

*That includes ZFS?*
Yes. ZFS is, of course, the most reliable and advanced filesystem you can
use to store your files, right now. It's free, it's available, it's hot. So
use it whenever you can.

-- 
(503) 754-4452 Android
(623) 239-3392 Skype
(623) 688-3392 Google Voice
**
<http://it-clowns.com>Safeway.com
Automation Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.PLUG.phoenix.az.us/pipermail/plug-discuss/attachments/20120619/051b46b3/attachment.html>


More information about the PLUG-discuss mailing list