Very good points, Paul.
I've seen several RAID 5 failures - something most people don't
expect. Sometimes it is caused by a controller failure, sometimes two
or more disks fail within hours or days of each other (people seem to
believe that just because an array is working that everything is
optimal), and twice because the wrong disk drive was replaced (hot-
swapped). What a mess!
Backups are seldom validated (fully or even partially), and usually
not validated on different hardware. I've seen a couple of cases
where a tape drive would write without errors and could read its own
tapes, but nothing else could read it. And, incremental backups are
almost always a problem in a full recovery scenario.
It absolutely pays to configure your system (from both a hardware and
Ingres perspective) correctly and robustly, and then to utilize best
practices to manage the environment. It also pays to have a
comprehensive DR plan in place - something that is sorely missing at
so many companies.
Below is a link to a white paper on best practices that could be very
useful for many. Enjoy!
http://www.comp-soln.com/BestPractices.pdf
Chip
> Very good points, Paul.
>
[quoted text clipped - 4 lines]
> optimal), and twice because the wrong disk drive was replaced (hot-
> swapped). What a mess!
My favorite was when the HP technician came to replace a
failed drive in a SAN and re-initialized the ENTIRE array. Completely
zeroed it. Backups were a week old.
> Backups are seldom validated (fully or even partially), and usually
> not validated on different hardware. I've seen a couple of cases
> where a tape drive would write without errors and could read its own
> tapes, but nothing else could read it. And, incremental backups are
> almost always a problem in a full recovery scenario.
Never seen this with DLT or LTO. But with DAT, I've seen the little
"tape fairies"
do just about anything. But that is what you get for using a home
camcorder technology
to protect your enterprise ;-)
> It absolutely pays to configure your system (from both a hardware and
> Ingres perspective) correctly and robustly, and then to utilize best
> practices to manage the environment. It also pays to have a
> comprehensive DR plan in place - something that is sorely missing at
> so many companies.
NOTHING beats an automated weekly restore of your production database to a
development system sourcing only the artifacts you ship offsite (the
tapes or disks).
You ARE NOT allowed to use the same hardware at any point in the weekly
restore.
You need a separate host, tape drive, and disk array.
Until you do that, you don't really have verified backups.
But lastly, most of our clients NEVER do a DR test of their network.
This is so
often overlooked, especially when complicated SANs are involved. So often,
backups of switches, VPN concentrators, routers, firewalls, load
balancers, and
other network appliances is ignored or worse, forgotten.
Check out RANCID for a nice open source solution to the network device
configuration backup problem.
http://www.shrubbery.net/rancid/
Cheers,
Mike Leo
Chip Nickolett - 17 Jan 2008 05:40 GMT
Hi Mikey,
These were all DLT problems.
One case was funny because the customer used an expensive and rare DG
DLT 7000 tape array. It had 4-5 tape drives and could be configured
to have separately addressable drives (our recommendation) or run in a
RAID-3 configuration. The array was so expensive that they only had
one onsite and one at SunGard. The onsite drive was used for weekly
validation but the first test at SunGard failed miserably because the
tapes could not be read. After that we reconfigured the device to
treat each drive separately and then validated on other hardware - no
problems after that.
Anyway, you can never be too safe when working with backups.
Chip
> Never seen this with DLT or LTO. But with DAT, I've seen the little
> "tape fairies" do just about anything.