Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion Groups
Database Servers
DB2InformixIngresMS SQLOraclePervasive.SQLPostgreSQLProgressSybase
Desktop Databases
FileMakerFoxProMS AccessParadox
General
General DB TopicsDatabase Theory
Related Topics
Java Development.NET DevelopmentVB DevelopmentMore Topics ...

Database Forum / General PostgreSQL Topics / April 2006

Tip: Looking for answers? Try searching our database.

Postmaster Problem

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Tony O'Bryan - 02 Mar 2006 19:20 GMT
I am normally the admin for our PostgreSQL servers, but someone else tried
to kill a runaway query while I was out sick.  Unfortunately, he tried
killing the runaway query by killing the postmaster.  Now we are dealing
with a persistent problem that just won't go away.

1) Spontaneous back end deaths.
2) Spontaneous backend disconnects (which are probably because of 1).
3) pg_clog files disappearing.

The first symptom of the problem was a message I received while within psql.
It was something like this: "Some backend process died abnormally, causing
your session to terminate."  I could reconnect to the database afterwards.

The second symptom was another use receiving an error message along the
lines of:

"ERROR:  could not access status of transaction 97201274
DETAIL:  could not open file "/var/lib/pgsql/data/pg_clog/005C": No such
file or directory".

The query was a simple count(*) operation: "select count(*) from
[sometable]".

I read prior Usenet postings for this error and did the following:

1) touch /var/lib/pgsql/data/pg_clog/005C.temp
2) chown postgres.postgres /var/lib/pgsql/data/pg_clog/005C.temp
3) chmod 0600 /var/lib/pgsql/data/pg_clog/005C.temp
4) dd if=/dev/zero of=/var/lib/pgsql/data/pg_clog/005C.temp bs=8192 count=1
5) mv /var/lib/pgsql/data/pg_clog/005C.temp /var/lib/pgsql/data/pg_clog/005C

Then I reloaded PostgreSQL (the prior postings didn't indicated a need to
reload or restart the database, so I chose the least disruptive option):

/etc/init.d/postgresql reload

I tried the query again, and I got an error message saying that PostgreSQL
couldn't read the new 005C file past a certain offset.  So I reran the dd
command above, but used a count of 100 instead of 1.  The count query then
worked.  I crossed my fingers and hoped the problem was solved.

However, the same problem reappeared the next morning.  My suspicion was
that PostgreSQL's metadata tables weren't properly updated since I
performed the above operations on a running database.  I then brought
PostgreSQL down completely (/etc/init.d/postgresql stop), repeated the
procedure, then restarted the database.  Again, the problem seemed to be
fixed.

That was until a user called and reported an error message about a
spontaneous backend disconnection (I viewed the message myself, and it
indeed said that data could not be received from the server).  I'm not
certain that was caused by PostgreSQL, because the developer of that
particular program has had long standing problems caused by his own bugs.

I came into work this morning, and tried the same select count(*) query that
I mentioned above, and I got the same file not found error I posted above.
Following the same procedures I outlined above fixed the problem, but now I
know the fix is only temporary.

My questions are:

1) What kind of database damage can I expect to find?

2) I think my daily pg_dumps are succeeding, so is my only real option to
start a new database and restore from a backup?

3) Why is the clog I create disappearing?  Is it because of vacuuming?

4) Can this problem be fixed without resorting to creating a new database
cluster?  This cluster hosts about a dozen active databases spread across 4
departments.
Tony O'Bryan - 03 Mar 2006 13:02 GMT
> I am normally the admin for our PostgreSQL servers, but someone else tried
> to kill a runaway query while I was out sick.  Unfortunately, he tried
> killing the runaway query by killing the postmaster.  Now we are dealing
> with a persistent problem that just won't go away.

The only quick solution appears to be copying the database cluster.

1) Create a new cluster at a different location using initdb
2) Disable remote access to the old cluster.
3) Dump the cluster.
4) Start a parallel instance of the postmaster at the new location.
5) Import the dump into the new instance.
6) Shut down both instances.
7) Rename the old instance to an archival area.
8) Rename the new instance to that of the old instance.
9) Restart the database.

The new instance is now fixed of all the problems caused by killing the
postmaster.
Schmidty - 02 Apr 2006 01:31 GMT
>>I am normally the admin for our PostgreSQL servers, but someone else tried
>>to kill a runaway query while I was out sick.  Unfortunately, he tried
[quoted text clipped - 15 lines]
> The new instance is now fixed of all the problems caused by killing the
> postmaster.

I'm not an expert with Postgres but do use it a lot and after reading
your problem and before looking at your follow up I would have suggested
 backing up your data with a pg_dump and re-installing the DB cluster.
Interesting how simple and yet effective Postgres is! I love this
database!!!

Schmidty
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.