Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / Ingres Topics / January 2008

Tip: Looking for answers? Try searching our database.

[Info-Ingres] Wednesday morning fun

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Paul White - 16 Jan 2008 00:19 GMT
Today is a good day to remind yourself about your disaster recovery
procedures.

- where's last nights backup
- do the backup messages get checked every day
- when was the last time they tested DR procedure
- what are the a single points of failure in the hardware / procedures.
- is ingres configured across multiple drives / controller



One of our clients has lost a raid controller - both drives from the
mirror are gone - the service tech is on the way.
The system failed before the nightly checkpoint and backup ran
overnight.
Thankfully the message logs say Monday night checkpoint is successful.
Fingers crossed they have only lost one day's data.
The customer was trying to restore the data across a wireless LAN from
another building across a busy road... v slow!!

The first indications from the restore is that they have loaded the
wrong tape
the C: drive is dated December (message logs, scripts)
the D: drive is dated last wednesday (data, ckp, dmp.. in fact the
entire ingres tree)

To make life a bit more exciting:
The standby machine has a different ingres patch currently under test.
One of the databases on the standby machine is in production.
It turns out the backup messages we've been religiously checking have
been coming from the standby machine.

Anyway, they have found the right restore media and have connected a
laptop to the system so we can carry the data across the road.



So, I'm thinking, if your sysadmin and dba are down at the coffee shop,
go down and join them and ask them a few pointed questions.


Chip Nickolett - 16 Jan 2008 02:41 GMT
Very good points, Paul.

I've seen several RAID 5 failures - something most people don't
expect.  Sometimes it is caused by a controller failure, sometimes two
or more disks fail within hours or days of each other (people seem to
believe that just because an array is working that everything is
optimal), and twice because the wrong disk drive was replaced (hot-
swapped).   What a mess!

Backups are seldom validated (fully or even partially), and usually
not validated on different hardware.  I've seen a couple of cases
where a tape drive would write without errors and could read its own
tapes, but nothing else could read it.  And, incremental backups are
almost always a problem in a full recovery scenario.

It absolutely pays to configure your system (from both a hardware and
Ingres perspective) correctly and robustly, and then to utilize best
practices to manage the environment.  It also pays to have a
comprehensive DR plan in place - something that is sorely missing at
so many companies.

Below is a link to a white paper on best practices that could be very
useful for many.  Enjoy!

   http://www.comp-soln.com/BestPractices.pdf

Chip
Michael Leo - 16 Jan 2008 13:44 GMT
> Very good points, Paul.
>
[quoted text clipped - 4 lines]
> optimal), and twice because the wrong disk drive was replaced (hot-
> swapped).   What a mess!

My favorite was when the HP technician came to replace a
failed drive in a SAN and re-initialized the ENTIRE array.  Completely
zeroed it.  Backups were a week old.
> Backups are seldom validated (fully or even partially), and usually
> not validated on different hardware.  I've seen a couple of cases
> where a tape drive would write without errors and could read its own
> tapes, but nothing else could read it.  And, incremental backups are
> almost always a problem in a full recovery scenario.

Never seen this with DLT or LTO.  But with DAT, I've seen the little
"tape fairies"
do just about anything.  But that is what you get for using a home
camcorder technology
to protect your enterprise ;-)
> It absolutely pays to configure your system (from both a hardware and
> Ingres perspective) correctly and robustly, and then to utilize best
> practices to manage the environment.  It also pays to have a
> comprehensive DR plan in place - something that is sorely missing at
> so many companies.
NOTHING beats an automated weekly restore of your production database to a
development system sourcing only the artifacts you ship offsite (the
tapes or disks).

You ARE NOT allowed to use the same hardware at any point in the weekly
restore.
You need a separate host, tape drive, and disk array.

Until you do that, you don't really have verified backups.

But lastly, most of our clients NEVER do a DR test of their network.
This is so
often overlooked, especially when complicated SANs are involved.  So often,
backups of switches, VPN concentrators, routers, firewalls, load
balancers, and
other network appliances is ignored or worse, forgotten.

Check out RANCID for a nice open source solution to the network device
configuration backup problem.

  http://www.shrubbery.net/rancid/

Cheers,

Mike Leo
Chip Nickolett - 17 Jan 2008 05:40 GMT
Hi Mikey,

These were all DLT problems.

One case was funny because the customer used an expensive and rare DG
DLT 7000 tape array.  It had 4-5 tape drives and could be configured
to have separately addressable drives (our recommendation) or run in a
RAID-3 configuration.  The array was so expensive that they only had
one onsite and one at SunGard.  The onsite drive was used for weekly
validation but the first test at SunGard failed miserably because the
tapes could not be read.  After that we reconfigured the device to
treat each drive separately and then validated on other hardware - no
problems after that.

Anyway, you can never be too safe when working with backups.

Chip

> Never seen this with DLT or LTO.  But with DAT, I've seen the little
> "tape fairies" do just about anything.
Paul White - 16 Jan 2008 05:22 GMT
More fun. More questions to ask:

- Is the standby machine configured in the same way as production?
- Directories for database, checkpoint and dump the same?
- DBMS cache configured the same?
- Is each standby database configured the same way?
- Are all tables journalled?

Luckily for me, the standby machine was previously the production server
so my job has been a little easier.

Anyway, the replacement controller has arrived.  One HDD is not
responding, the other looked promising for a second until the
replacement decided it was all too much and has also gone to heaven.
According to the IBM hardware techie, the drives will be unreadable on
any other controller. Hmmm so much for standards. They are now replacing
the motherboard.

Customer has finally located the correct backup media.  It looks like
our fears are confirmed, they have lost a full day's data.  (not to
mention the factory is still down and trucks are lined up out into the
street)
Very slow recovery time still.

I've restored ckp, dmp and jnl directories to the appropriate locations.
I found all of the databases existed on the standby machine but some
were not journalled and some had not been checkpointed so I first ran
ckpdb +j on each of these to automatically create appropriate
directories. Thankfully each DB has just one location.

Problem.  
The standby machine installation is all in C:\IngresII\ingres\
The prod machine directories are D:\ingres\IngresII\ingres....

So with my trusty hex editor, I've modified the locations in both
dmp\aaaaaaaa.cnf and dmp\c0000123.dmp. Careful here. There is a length
specifier in hex for each field. Then copied the aaaaaaaa.cnf to the
data and ckp locations.

Rollforwarddb proceeds quite nicely.
Here some a sample error messages if you get the miscount the path
length or make a typing mistake.

E_CL030F_CK_COMMAND_ERROR    The operating system command used to
perform the database checkpoint returned an error of 1. The command was
'ckxcopy  "C:\IngresII\ingres\data\default\mstreet"
"C:\IngresII\ingres\ckp\default\mstreet \c0073001.ckp" RESTORE'
system() failed with operating system error 0 (The operation completed
successfully.)

E_DM9004_BAD_FILE_OPEN    Disk file open error on database:mstreet
table:Not a table pathname:D:\IngresII\ingres\data\default\mstreet
filename:aaaaaaaa.cnf
open() failed with operating system error 3 (The system cannot find the
path specified.)

- Now almost done, run verifydb. sysmod, optimizedb and ckpdb +j on
every DB and that should be done
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.