[HACKERS] Point in Time Recovery

Discussion:

[HACKERS] Point in Time Recovery

(too old to reply)

Simon Riggs

2004-07-14 07:21:22 UTC

Can you give us some suggestions of what kind of stuff to test? Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works? Does kill -9 simulate a 'power off'?

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

I'd love some help assembling a test document with numbered tests...

Best regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Mark Kirkwood

2004-07-22 01:51:04 UTC

Here is one for the 'idiot proof' category:

1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup

Obviously this does not work as the backup is performed with the
database shutdown.

This got me wondering for 2 reasons:

1) Some alternative database servers *require* a procedure like this to
enable their version of PITR - so the potential foot-gun thing is there.

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Post by Simon Riggs
I was hoping some fiendish plans would be presented to me...
But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.
I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...
We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Tom Lane

2004-07-22 02:39:15 UTC

Post by Mark Kirkwood
1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup
Obviously this does not work as the backup is performed with the
database shutdown.

Huh? It works fine.

The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

I'm becoming more and more convinced that we should bite the bullet and
move pg_xlog/ to someplace that is not under $PGDATA. It would just
make things a whole lot more reliable, both for backup and to deal with
scenarios like yours above. I tried to talk Bruce into this on the
phone the other day, but he wouldn't bite. I still think it's a good
idea though. It would
(1) eliminate the problem that a tar backup of $PGDATA would restore
stale copies of xlog segments, because the tar wouldn't include
pg_xlog in the first place.
(2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
xlog segments that you still need.

A possible compromise is that we should strongly suggest that pg_xlog
be pushed out to another place and symlinked if you are going to use
WAL archiving. That's already considered good practice for performance
if you have a separate disk spindle to put WAL on. It'll just have
to be good practive for WAL archiving too.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Bruce Momjian

2004-07-22 03:06:22 UTC

I think we should push the partially complete WAL file to the archive
location before shutdown. I talked to you or Jan about it and you (or
Jan) wouldn't bite either, but I think when someone shuts down, they
assume they have things fully archived and can recover fully with a
previous backup and the archive files.

When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK. Maybe we would
need to give it a special file extension so we only use it when we don't
have a full version.

---------------------------------------------------------------------------

Post by Tom Lane

Post by Mark Kirkwood
1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup
Obviously this does not work as the backup is performed with the
database shutdown.

Huh? It works fine.
The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.
I'm becoming more and more convinced that we should bite the bullet and
move pg_xlog/ to someplace that is not under $PGDATA. It would just
make things a whole lot more reliable, both for backup and to deal with
scenarios like yours above. I tried to talk Bruce into this on the
phone the other day, but he wouldn't bite. I still think it's a good
idea though. It would
(1) eliminate the problem that a tar backup of $PGDATA would restore
stale copies of xlog segments, because the tar wouldn't include
pg_xlog in the first place.
(2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
xlog segments that you still need.
A possible compromise is that we should strongly suggest that pg_xlog
be pushed out to another place and symlinked if you are going to use
WAL archiving. That's already considered good practice for performance
if you have a separate disk spindle to put WAL on. It'll just have
to be good practive for WAL archiving too.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
Bruce Momjian | http://candle.pha.pa.us
***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Tom Lane

2004-07-22 03:29:08 UTC

Post by Bruce Momjian
I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Bruce Momjian

2004-07-22 03:36:05 UTC

Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

In fact, if you shutdown your database and want to reproduce it on
another machine, how do you do it? Seems you have to copy pg_xlog
directory over to the new machine.

In fact, moving pg_xlog to a new location doesn't make that clear
either. Seems documentation might be the only way to make this clear.

One idea would be to just push the partial WAL file to the archive on
server shutdown and not reuse it and start with a new WAL file on
startup. At least for a normal system shutdown this will give us an
archive that contains all the information that is in pg_xlog.

---------------------------------------------------------------------------

Post by Tom Lane

Post by Bruce Momjian
I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.
2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.
3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.
I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.
regards, tom lane

--
Bruce Momjian | http://candle.pha.pa.us
***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Tom Lane

2004-07-22 03:42:22 UTC

Post by Bruce Momjian
Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

That's only because you are clinging to the broken assumption that
pg_xlog/ is part of the database, rather than part of the logs.
Separate that out as a distinct entity, and all gets better.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Bruce Momjian

2004-07-22 03:54:48 UTC

Post by Tom Lane

Post by Bruce Momjian
Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

That's only because you are clinging to the broken assumption that
pg_xlog/ is part of the database, rather than part of the logs.
Separate that out as a distinct entity, and all gets better.

Imagine this. I stop the server. I have a tar backup and a copy of
the archive. I should be able to take them to another machine and
recover the system to the point I stopped.

You are saying I need a copy of pg_xlog directory too, and I need to
remove pg_xlog after I untar the data directory and put the saved
pg_xlog into there before I recover.

Should we create a server-side function that forces all WAL files to the
archive, including partially written ones. Maybe that fixes the problem
with people deleting pg_xlog before they untar. You tell them to run
the function before recovery. If the system can't be started, the it is
possible the WAL files are no good too, not sure.
--
Bruce Momjian | http://candle.pha.pa.us
***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly

Simon Riggs

2004-07-22 07:43:18 UTC

Post by Tom Lane

Post by Bruce Momjian
I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.
2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.
3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.
I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

Not read whole chain of conversation...but this idea came up before and
was rejected then. I agree with the 3 objections to that thought above.

There's already enough copies of full xlogs around to worry about.

If you need more granularity, reduce size of xlog files....

(Tom, SUID would be the correct timeline id in that situation? )

More later, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly

Bruce Momjian

2004-07-28 16:25:36 UTC

Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to archive the most recent WAL file every
minute to some other location or something.

---------------------------------------------------------------------------

Post by Tom Lane

Post by Bruce Momjian
I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.
2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.
3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.
I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.
regards, tom lane

--
Bruce Momjian | http://candle.pha.pa.us
***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

JEDIDIAH

2004-09-30 15:54:50 UTC

Post by Bruce Momjian
Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.
I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a

If a particular transaction is so important that it absolutely
positively needs to be archived offline for PITR, then why not just mark
it that way or allow for the application to trigger archival of this
critical REDO?

Post by Bruce Momjian
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to archive the most recent WAL file every
minute to some other location or something.

[deletia]

--
Negligence will never equal intent, no matter how you
attempt to distort reality to do so. This is what separates |||
the real butchers from average Joes (or Fritzes) caught up in / | \
events not in their control.

Mark Kirkwood

2004-07-22 03:12:21 UTC

Well that is interesting :_)

Here is what I am doing on the removal front (I am keeping pg_xlog *now*):

$ cd $PGDATA
$ pg_ctl stop
$ ls|grep -v pg_xlog|xargs rm -rf

The contents of the archive directory just before recovery starts:

$ ls -l $PGDATA/../7.5-archive
total 49212
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000000
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000001
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000002

But here is recovery startup log:

LOG: database system was shut down at 2004-07-22 14:58:57 NZST
LOG: starting archive recovery
LOG: restore_command = "cp /data1/pgdata/7.5-archive/%f %p"
cp: cannot stat `/data1/pgdata/7.5-archive/00000001.history': No such
file or directory
LOG: restored log file "000000010000000000000000" from archive
LOG: checkpoint record is at 0/A4D3E8
LOG: redo record is at 0/A4D3E8; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 496; next OID: 17229
LOG: archive recovery complete
LOG: database system is ready

regards

Mark

Post by Tom Lane
Huh? It works fine.
The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Tom Lane

2004-07-22 20:19:53 UTC

Post by Mark Kirkwood
2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Mark Kirkwood

2004-07-22 23:08:09 UTC

Excellent - Just updated and it is all good!

This change makes the whole "how do I do my backup" business nice and
basic - which the right way IMHO.

regards

Mark

Post by Tom Lane

Post by Mark Kirkwood
2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.
regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Simon Riggs

2004-07-22 23:48:34 UTC

Post by Tom Lane

Post by Mark Kirkwood
2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.

There is a code path on secondary checkpoints that indicates that crash
recovery can occur even when the database was shutdown, since the code
forces recovery whether it was or not. On that basis, this may work, but
is yet untested. I didn't mention this because it might interfere with
getting hot backup to work...

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Mark Kirkwood

2004-07-23 00:05:13 UTC

I have tested the "cold" backup - and retested my previous scenarios
using "hot" backup (just to be sure) . They all work AFAICS!

cheers

Mark

Post by Simon Riggs

Post by Tom Lane

Post by Mark Kirkwood
2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.
There is a code path on secondary checkpoints that indicates that crash
recovery can occur even when the database was shutdown, since the code
forces recovery whether it was or not. On that basis, this may work, but
is yet untested. I didn't mention this because it might interfere with
getting hot backup to work...
Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Simon Riggs

2004-07-24 20:35:10 UTC

Post by Mark Kirkwood
I have tested the "cold" backup - and retested my previous scenarios
using "hot" backup (just to be sure) . They all work AFAICS!
cheers

Yes, I'll drink to that! Thanks for your help.

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly

Tom Lane

2004-07-23 02:01:24 UTC

Post by Simon Riggs

Post by Tom Lane
Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.

I did.

It's really not risky. The fact that the code doesn't look beyond the
checkpoint record when things seem to be kosher is just a speed
optimization (and probably a rather pointless one...) We have got to be
able to detect the end of WAL in any case, so we'd just find there are
no more records and stop.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

17 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Simon Riggs 2004-07-14 07:21:22 UTC

Mark Kirkwood 2004-07-22 01:51:04 UTC

Tom Lane 2004-07-22 02:39:15 UTC

Bruce Momjian 2004-07-22 03:06:22 UTC

Tom Lane 2004-07-22 03:29:08 UTC

Bruce Momjian 2004-07-22 03:36:05 UTC

Tom Lane 2004-07-22 03:42:22 UTC

Bruce Momjian 2004-07-22 03:54:48 UTC

Simon Riggs 2004-07-22 07:43:18 UTC

Bruce Momjian 2004-07-28 16:25:36 UTC

JEDIDIAH 2004-09-30 15:54:50 UTC

Mark Kirkwood 2004-07-22 03:12:21 UTC

Tom Lane 2004-07-22 20:19:53 UTC

Mark Kirkwood 2004-07-22 23:08:09 UTC

Simon Riggs 2004-07-22 23:48:34 UTC

Mark Kirkwood 2004-07-23 00:05:13 UTC

Simon Riggs 2004-07-24 20:35:10 UTC

Tom Lane 2004-07-23 02:01:24 UTC

about - legalese

Loading...