Discussion:
[Fwd: Re: Question about pgfsck]
(too old to reply)
Dan Hrabarchuk
2004-10-07 16:25:24 UTC
Permalink
Is there a FAQ or a HOWTO on recovering from database corruption? If
not, then I am willing to write it. I will need help getting the
document together because as much as I google, I come up empty handed.

What I need is something that tells you:

A> You get errors in your server log like:
PANIC: open of /psql/pg_clog/0C95 failed: No such file or directory
- or -
PANIC: ZeroFill failed to write /psql/pg_xlog/xlogtemp.27517: No space
left on device :)
- or -
LOG: server process (pid 48146) was terminated by signal 6
- or -
LOG: ReadRecord: record with zero length at 39D/A53F7800
- or -
LOG: ReadRecord: unexpected pageaddr 39A/78D4000 in log file 922,
segment 15, offset 9256960
- or -
ERROR: Invalid page header in block 2879878 of channeldata

etc.

Which are serious? What needs to be done?

B> How to find all corrupt tuples, whether they are in indexes or the
tables themselves.

C> How to eliminate the corrupt tuples. Delete them of course. But how
when the server errors on deleting them??

pgfsck looks like an awesome tool. It allows you to search areas of your
database that are rarely used. So you don't get a 1AM phone call from
the sales guy needing some report for tomorrow at 7AM and he can't get
it.

I want to have something that can be easily digested and doesn't require
advanced knowledge of postgres. What does the beginning PostgreSQL admin
do? A database disaster guide.

Thanks

Dan Hrabarchuk

-----Forwarded Message-----
Subject: Re: Question about pgfsck
Date: Thu, 07 Oct 2004 10:38:33 +0200
The problem is obviously with the structure of the tuples. So you need
to identify it by something then does not require parsing the tuples.
Hence the suggestion of ctid or oid or something else fixed.
Good luck,
3. Finding the file with the error.
And the tablename, ergo, filename
4. Finally fixing the error.
Delete the record is the usual way.
SELECT * FROM channeldata where cd_id=6268 and tstamp<???2004-09-20??? and
tstamp>???2004-09-15???;
Will cause a signal 6 and have the database backend restart.
If I change it to: DELETE FROM channeldata where cd_id=6268 and
tstamp<???2004-09-20??? and tstamp>???2004-09-15???;
I get an error and the transaction does not commit. How do you delete
the records? Or is this not the usual way.
I can recover the data after it's deleted. I just need to delete it.
Thanks again.
Dan
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Tom Lane
2004-10-07 16:55:55 UTC
Permalink
Post by Dan Hrabarchuk
Is there a FAQ or a HOWTO on recovering from database corruption? If
not, then I am willing to write it. I will need help getting the
document together because as much as I google, I come up empty handed.
...
I want to have something that can be easily digested and doesn't require
advanced knowledge of postgres.
I think this is an impossible goal. The facts of the matter are that it
generally *isn't* easy to get out of a corrupt-data scenario, and it
certainly isn't something that can be reduced to a cookbook recipe that
a novice can follow. If we had bugs with effects as predictable as
that, we would have found them and fixed them long ago. I think every
corruption situation is unique and has to be approached as a fresh
problem.

Part of that feeling stems from the fact that you usually want to try
to identify *why* the corruption occurred, not just fix its immediate
effects, and that almost always requires examining the entrails at a
pretty low level of detail. If you don't find out why it happened then
you have no confidence that it won't happen again.

Don't get me wrong --- I think a HOWTO on this is a fine idea. I'm just
telling you that it's going to have to be a fairly detailed document
that is going to teach the reader quite a bit about postgres internals.

You can collect a lot of material for it by trawling the PG mailing list
archives for old threads in which people were having corruption
problems. Aside from the obvious of "corruption", try looking for
threads mentioning REINDEX, pg_resetxlog, pg_filedump, locating tuples
by ctid, the zero_damaged_pages parameter, memtest86, and badblocks,
as those are the common tools for this sort of activity.

You can search the archives either directly at
http://archives.postgresql.org/ or via Oleg and Teodor's indexer
at http://www.pgsql.ru/db/pgsearch/. I tend to use both as they
seem to have different searching behavior. Google has a good indexer
but a rather incomplete set of PG list archives, so that's usually my
last choice...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Loading...