Database Forum / DB2 Topics / June 2006
Unable to start HADR reason code 7
|
|
Thread rating:  |
gumby - 31 May 2006 02:02 GMT I'm having trouble getting HADR to work with the sample databases on two HS20 xSeries blades, Red Hat ES4 up3, DB2 8.2.4, getting the following error.
SQL1768N Unable to start HADR. Reason code = "7" - The primary database failed to establish a connection to its standby database within the HADR timeout interval.
What things should I check besides the remote host and remote service parameters on the standby database, which seem to be correct. Each of the servers can see each other via pings etc. I have sucessfully setup HADR on a single server.
thanks dub
Mark A - 31 May 2006 02:15 GMT > I'm having trouble getting HADR to work with the sample databases on > two HS20 xSeries blades, Red Hat ES4 up3, DB2 8.2.4, getting the [quoted text clipped - 11 lines] > thanks > dub I assume you have already started HADR on the standby database first, before you started HADR on the primary. If that is true, then try logging on the standby database and activating the standby database
db2 activate database sample
Then retry starting HADR on primary.
gumby - 31 May 2006 07:31 GMT Yes, I think the control center runs the following commands anyway. And if I activate the standby it says it is already activated. Here are the final commands run.
-- Start HADR on standby database -- DEACTIVATE DATABASE SAMPLE START HADR ON DATABASE SAMPLE AS STANDBY -- -- Start HADR on primary database -- DEACTIVATE DATABASE SAMPLE START HADR ON DATABASE SAMPLE AS PRIMARY
Just to clarify, I have sucessfully setup HADR bewteen 2 different databases on the same server using the control center gui. My problem is between databases on two different servers. I have tried the manual command method and the control center, both with the same results.
Using the control center commands
Standby diag file ends with
2006-05-31-16.24.26.101725-240 E476637G362 LEVEL: Event PID : 27068 TID : 3086558912 PROC : db2hadrs (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup)
2006-05-31-16.24.25.999932-240 I477000G398 LEVEL: Warning PID : 27057 TID : 3086558912 PROC : db2agent (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE APPHDL : 0-14 APPID: *LOCAL.sample.060531202426 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21152 MESSAGE : Info: HADR Startup has completed.
Primary diag files ends with
2006-05-31-16.24.32.714718+600 E128512G336 LEVEL: Event PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to P-Boot (was None)
2006-05-31-16.24.32.719416+600 I128849G318 LEVEL: Warning PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20301 MESSAGE : Info: Primary Started.
2006-05-31-16.26.18.769577+600 I129489G321 LEVEL: Event PID : 5376 TID : 2947414960 PROC : db2hmon INSTANCE: sample NODE : 000 FUNCTION: DB2 UDB, Automatic Table Maintenance, db2HmonEvalStats, probe:900 STOP : Automatic Runstats: evaluation has finished on database SAMPLE
2006-05-31-16.26.33.712145+600 I129811G571 LEVEL: Error PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20390 MESSAGE : HADR primary did not establish connection with standby within timeout and will shut down. BY FORCE option required to start primary without standby. Timeout seconds = DATA #1 : Hexdump, 4 bytes 0x12C13A3C : 7800 0000 x...
2006-05-31-16.26.33.712399+600 I130383G418 LEVEL: Error PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20390 RETCODE : ZRC=0x8280001A=-2105540582=HDR_ZRC_NO_STANDBY "Comm time-out in unforced HADR primary start, to avoid split-brain"
2006-05-31-16.26.33.712573+600 I130802G319 LEVEL: Warning PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduP, probe:20302 MESSAGE : Info: Primary Finished.
2006-05-31-16.26.33.712704+600 I131122G422 LEVEL: Error PID : 9575 TID : 3085870784 PROC : db2hadrp (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduEntry, probe:21100 RETCODE : ZRC=0x8280001A=-2105540582=HDR_ZRC_NO_STANDBY "Comm time-out in unforced HADR primary start, to avoid split-brain"
Any assistance greatly appreciated
cheers dub dub
Mark A - 31 May 2006 08:52 GMT > Yes, I think the control center runs the following commands anyway. And > if I activate the standby it says it is already activated. Here are the [quoted text clipped - 19 lines] > cheers > dub Can you post your db config parms (HADR section only) on both primary and standby databases?
Also, post output from "db2level" and the OS you are using.
Steve Pearson (news only) - 31 May 2006 20:20 GMT >From the snippets of diag log shown, it appears that the standby was not able to establish a socket connection with the primary (primary listens, standby connects). It seems fairly common that this is not correctly configured on the first attempt. We've seen issues with incorrect HADR parameters, DNS problems, failure to properly set up service names, and inability to correctly map across a NAT.
Double check that your HADR comms parameters mesh up correctly (each side properly refers to itself in LOCAL params and to the other in REMOTE params).
HADR_LOCAL_HOST HADR_LOCAL_SVC HADR_REMOTE_HOST HADR_REMOTE_SVC
Ensure that your service names are registered and/or use IP addresses. Try using fully-specified network naming (a.b.c.d) for host names if you haven't already.
HTH.
Regards, - Steve P. -- Steve Pearson, IBM DB2 UDB for LUW Development, IBM Software Group DB2 "Portland" Team, IBM Beaverton Lab, Beaverton, OR, USA
Phil Sherman - 31 May 2006 20:29 GMT Your ability to get this working on a single system indicates that you have the knowledge to do this from the database perspective.
A common cause of problems when going from one system to two systems, especially with Linux, is the requirement to pass through the firewalls. Make sure they are configured to allow the HADR ports to pass traffic.
Phil Sherman
> I'm having trouble getting HADR to work with the sample databases on > two HS20 xSeries blades, Red Hat ES4 up3, DB2 8.2.4, getting the [quoted text clipped - 11 lines] > thanks > dub Mark A - 01 Jun 2006 01:52 GMT > Your ability to get this working on a single system indicates that you > have the knowledge to do this from the database perspective. [quoted text clipped - 4 lines] > > Phil Sherman He is using the GUI interface. I was able to configure HADR on a local Windows box with the GUI, but not with remote Linux boxes. Using command line configuration scripts on remote Linux boxes worked fine.
gumby - 01 Jun 2006 04:09 GMT I'm Running Red Hat ES4 up3
[sample@tank ~]$ uname -r 2.6.9-34.ELsmp
[sample@tank ~]$ db2level DB21085I Instance "sample" uses "32" bits and DB2 code release "SQL08024" with level identifier "03050106". Informational tokens are "DB2 v8.1.2.104", "s060120", "MI00152", and FixPak "11". Product is installed at "/opt/IBM/db2/V8.1".
STANDBY - tank HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = tank HADR local service name (HADR_LOCAL_SVC) = DB2_HADR_2 HADR remote host name (HADR_REMOTE_HOST) = dozer HADR remote service name (HADR_REMOTE_SVC) = DB2_HADR_1 HADR instance name of remote server (HADR_REMOTE_INST) = sample HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
PRIMARY - dozer HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = dozer HADR local service name (HADR_LOCAL_SVC) = DB2_HADR_1 HADR remote host name (HADR_REMOTE_HOST) = tank HADR remote service name (HADR_REMOTE_SVC) = DB2_HADR_2 HADR instance name of remote server (HADR_REMOTE_INST) = sample HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
/etc/services # Local services DB2_sample 60000/tcp DB2_sample_1 60001/tcp DB2_sample_2 60002/tcp DB2_sample_END 60003/tcp DB2_HADR_1 55001/tcp DB2_HADR_2 55002/tcp
Currently doing some more tests with and without the GUI on the linux boxes.
gumby - 01 Jun 2006 06:49 GMT Are there any requirements for the servers to be cataloged. I mean on the primary, should there be a catalog/node entry (no sure of the correct terms) to the standby. And likewise should there be an entry on the standby pointing to the primary.
Should they be described by running the command db2 list node directory.?
Mark A - 01 Jun 2006 07:08 GMT > Are there any requirements for the servers to be cataloged. I mean on > the primary, should there be a catalog/node entry (no sure of the [quoted text clipped - 3 lines] > Should they be described by running the command db2 list node > directory.? No, the nodes or databases on the other sever (standby or primary) do not need to be catalogued in the local node or db directory.
gumby - 02 Jun 2006 02:21 GMT These are the commands I have run to try to get HADR going. (Basically cut paste from what the GUI displays). Do they look okay ? any suggestions ? I have used your (Mark) suggested ports the others are the ones that the HADR GUI comes up with.
-- -- Copy backup images from primary to standby system. -- -- Location on primary system : /home/sample -- Location on standby system : /home/sample -- -- Restore database on standby system - TANK - SAMPLET (sample) - SAMPLET (SAMPLE) -- RESTORE DATABASE SAMPLE FROM "/home/sample" TAKEN AT 20060602104857 REPLACE HISTORY FILE WITHOUT PROMPTING -- -- Configure databases for client reroute - DOZER - sample - SAMPLE -- UPDATE ALTERNATE SERVER FOR DATABASE SAMPLE USING HOSTNAME tank PORT 60000 -- -- Configure databases for client reroute - TANK - SAMPLET (sample) - SAMPLET (SAMPLE) -- UPDATE ALTERNATE SERVER FOR DATABASE SAMPLE USING HOSTNAME dozer PORT 60000 -- -- Update service file on primary system - DOZER -- Service name : DB2_HADR_1 -- Port number : 18819 -- Service name : DB2_HADR_2 -- Port number : 18820 -- -- Update service file on standby system - TANK -- Service name : DB2_HADR_1 -- Port number : 18819 -- Service name : DB2_HADR_2 -- Port number : 18820 -- -- Update HADR configuration parameters on primary database - DOZER - sample - SAMPLE -- UPDATE DB CFG FOR SAMPLE USING HADR_LOCAL_HOST dozer UPDATE DB CFG FOR SAMPLE USING HADR_LOCAL_SVC DB2_HADR_1 UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_HOST tank UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_SVC DB2_HADR_2 UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_INST sample UPDATE DB CFG FOR SAMPLE USING HADR_SYNCMODE NEARSYNC UPDATE DB CFG FOR SAMPLE USING HADR_TIMEOUT 300 CONNECT TO SAMPLE QUIESCE DATABASE IMMEDIATE FORCE CONNECTIONS UNQUIESCE DATABASE CONNECT RESET -- -- Update HADR configuration parameters on standby database - TANK - SAMPLET (sample) - SAMPLET (SAMPLE) -- UPDATE DB CFG FOR SAMPLE USING HADR_LOCAL_HOST tank UPDATE DB CFG FOR SAMPLE USING HADR_LOCAL_SVC DB2_HADR_2 UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_HOST dozer UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_SVC DB2_HADR_1 UPDATE DB CFG FOR SAMPLE USING HADR_REMOTE_INST sample UPDATE DB CFG FOR SAMPLE USING HADR_SYNCMODE NEARSYNC UPDATE DB CFG FOR SAMPLE USING HADR_TIMEOUT 300 -- -- Start HADR on standby database - TANK - SAMPLET (sample) - SAMPLET (SAMPLE) -- DEACTIVATE DATABASE SAMPLE START HADR ON DATABASE SAMPLE AS STANDBY -- -- Start HADR on primary database - DOZER - sample - SAMPLE -- DEACTIVATE DATABASE SAMPLE START HADR ON DATABASE SAMPLE AS PRIMARY ;
when run it this order the standby dabases end up in a remote catch-up pending state as reported by the HADR GUI management tool and the snapshot command.
HADR Status Role = Standby State = Remote catchup pending Synchronization mode = Nearsync Connection status = Disconnected, 02-06-2006 11:17:01.627914 Heartbeats missed = 0 Local host = tank Local service = DB2_HADR_2 Remote host = dozer Remote service = DB2_HADR_1 Remote instance = sample timeout(seconds) = 300 Primary log position(file, page, LSN) = S0000000.LOG, 0, 0000000000000000 Standby log position(file, page, LSN) = S0000001.LOG, 0, 0000000001388000 Log gap running average(bytes) = 0
diag results.
2006-06-02-11.07.01.737895-240 E79516G362 LEVEL: Event PID : 25906 TID : 3086079680 PROC : db2hadrs (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to S-RemoteCatchupPending (was S-LocalCatchup)
2006-06-02-11.07.01.636782-240 I79879G398 LEVEL: Warning PID : 25247 TID : 3086079680 PROC : db2agent (SAMPLE) 0 INSTANCE: sample NODE : 000 DB : SAMPLE APPHDL : 0-39 APPID: *LOCAL.sample.060602150702 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21152 MESSAGE : Info: HADR Startup has completed.
Is there a way to run a command that would test the connection the standby database is attempting to do, to the primary when starting up HADR. ??
Mark A - 02 Jun 2006 05:38 GMT > These are the commands I have run to try to get HADR going. (Basically > cut paste from what the GUI displays). Do they look okay ? any > suggestions ? I have used your (Mark) suggested ports the others are > the ones that the HADR GUI comes up with. The ports don't really matter so long as no one else is using them. Each database must have its own ports (in case you more than one database on that server using HADR). They should be documented in /etc/services
Here are the scripts that I use (run in order as indicated):
Assumptions:
server01 - primary db server server02 - standby db server db2inst1 - instance on primary server db2inst2 - DB2 instance on standby server (but can be the same as primary) database - sample
SCRIPT01 - RUN ON PRIMARY SERVER01
# Activate log retain and set log archive path (not necessary if logretain already enabled some other way) db2 update db cfg for sample using LOGARCHMETH1 DISK:/db2/archive_logs
#Create offline backup of db to be restored on standby server02 (I am backing up to a shared mount point) db2 "BACKUP DATABASE sample TO /db_backup/SAMPLE COMPRESS WITHOUT PROMPTING"
db2 update db cfg for sample using HADR_LOCAL_HOST server01 db2 update db cfg for sample using HADR_REMOTE_HOST server02
db2 update db cfg for sample using HADR_LOCAL_SVC 18819 db2 update db cfg for sample using HADR_REMOTE_SVC 18820 db2 update db cfg for sample using HADR_REMOTE_INST db2inst2 db2 update db cfg for sample using HADR_SYNCMODE nearsync db2 update db cfg for sample using HADR_TIMEOUT 30 db2 update db cfg for sample using LOGINDEXBUILD ON
#Recommended parms for HADR because logs are sent to standby server db2 update db cfg for sample using DBHEAP 2048 db2 update db cfg for sample using LOGBUFSZ 256
#This is the host name for automatic client re-route: db2 update alternate server for database sample using hostname server02 port 50000
SCRIPT02 - RUN ON STANDBY SERVER02
# Restore database on standby server02 db2 RESTORE DATABASE sample FROM /db_backup/SAMPLE TAKEN AT 20060204213007 replace history file
# Activate log retain and set log archive path db2 update db cfg for sample using LOGARCHMETH1 DISK:/db2/archive_logs
db2 update db cfg for sample using HADR_LOCAL_HOST server02 db2 update db cfg for sample using HADR_REMOTE_HOST server01
db2 update db cfg for sample using HADR_LOCAL_SVC 18820 db2 update db cfg for sample using HADR_REMOTE_SVC 18819 db2 update db cfg for sample using HADR_REMOTE_INST db2inst1 db2 update db cfg for sample using HADR_SYNCMODE nearsync db2 update db cfg for sample using HADR_TIMEOUT 30 db2 update db cfg for sample using LOGINDEXBUILD ON
#Recommended parms for HADR because logs are sent to standby server db2 update db cfg for sample using DBHEAP 2048 db2 update db cfg for sample using LOGBUFSZ 256
#This is the host name for automatic client re-route: db2 update alternate server for database sample using hostname server01 port 50000
db2 start hadr on db sample as standby
SCRIPT03 - RUN ON PRIMARY SERVER01
db2 start hadr on db sample as primary
gumby - 05 Jun 2006 01:53 GMT Thanks Mark,
used those scripts but still no luck, same error. Confirmed the ports are not used by aanything else and I have monitored the port on the primary for activity when the standby should be connecting and confirmed that the standby is attempting to contact it when the HADR commands are run.
Checked the diag logs I get the following
2006-06-05-10.40.47.894980+600 I406658G426 LEVEL: Severe PID : 18468 TID : 3086255808 PROC : db2hadrs (SAMPLE) 0 INSTANCE: db2inst3 NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280 MESSAGE : Failed to connect to primary. rc: DATA #1 : Hexdump, 4 bytes 0xBFE10050 : 1900 0F81 ....
2006-06-05-10.40.47.895157+600 I407085G370 LEVEL: Severe PID : 18468 TID : 3086255808 PROC : db2hadrs (SAMPLE) 0 INSTANCE: db2inst3 NODE : 000 DB : SAMPLE FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280 RETCODE : ZRC=0x810F0019=-2129723367=SQLO_CONN_REFUSED "Connection refused"
Any other ideas ? I think I will attempt a re-install of DB2, fixpacks and sample instances and try again incase I stuffed up something with the installation or user setup.
cheers dub
Mark A - 05 Jun 2006 03:01 GMT > Thanks Mark, > [quoted text clipped - 31 lines] > cheers > dub If reinstall does not work, you should open a PMR with IBM support (assuming you have a support contract).
Steve Pearson (news only) - 05 Jun 2006 18:08 GMT Have you tried using fully qualified network naming or IP addresses for the HADR host configuration parameters? We've seen cases where there are unexpected problems with name resolution on the primary. For example, "host" may not work where "host.subnet.domain.com" may work. I'm not a networking guru and can't explain in detail why this occurs, but HADR is fairly picky that the host name configured for matches the host that attempts to connect to the primary as determined via host name resolution.
Regards, - Steve P. -- Steve Pearson, IBM DB2 UDB for LUW Development, IBM Software Group DB2 "Portland" Team, IBM Beaverton Lab, Beaverton, OR, USA
gumby - 06 Jun 2006 05:53 GMT Thanks everyone...
got this finally resolved using IP address for host parameters under HADR (thx Steve and Mark). This was a really frustrating problem, as the hosts were all well defined in /etc/hosts and resolved fine on the OS level, anywho all is well now...
For example:
Standby
HADR database role = STANDBY HADR local host name (HADR_LOCAL_HOST) = 10.18.78.64 HADR local service name (HADR_LOCAL_SVC) = 55002 HADR remote host name (HADR_REMOTE_HOST) = 10.18.78.62 HADR remote service name (HADR_REMOTE_SVC) = 55001 HADR instance name of remote server (HADR_REMOTE_INST) = db2inst1 HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
HADR database role = PRIMARY HADR local host name (HADR_LOCAL_HOST) = 10.18.78.62 HADR local service name (HADR_LOCAL_SVC) = 55001 HADR remote host name (HADR_REMOTE_HOST) = 10.18.78.64 HADR remote service name (HADR_REMOTE_SVC) = 55002 HADR instance name of remote server (HADR_REMOTE_INST) = db2inst2 HADR timeout value (HADR_TIMEOUT) = 120 HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
thanks again
dub
Dubravko Akmacic Automation Engineer Industrial Markets - Strip & Plate BlueScope Steel Limited
Mark A - 01 Jun 2006 07:06 GMT > I'm Running Red Hat ES4 up3 > [quoted text clipped - 41 lines] > Currently doing some more tests with and without the GUI on the linux > boxes. This may not help, but I would use the port number in your db config, and not the service name (but leave the service names in the /etc/services).
I assume the database names are the same on primary and standby (not specified in your post above).
When the HADR database role is STANDARD, then that means that HADR has not been started. So manually "start HADR on db xxxxxxx as standby" (on tank), and then (if successful) "start HADR on db xxxxxxx as primary" (on dozer). You must start HADR on the standby first.
If the above does not work, then you should check the ports (at the OS level) to make sure no one else is using 55001 and 55002. The recommended HADR ports start with 18819 (although I have no reason why, and don't know if this matters).
A useful monitoring tool of the current HADR status without the GUI is to take a database snapshot (refer to HADR section): db2 get snapshot for database on xxxxxxxx
As I said previously, I was not able to get the GUI to work for HADR on Linux, but there are very few commands needed to get it working, so it is easy to script from the command line.
|
|
|