Discussion:
Problems dumping database from 7.3.1
(too old to reply)
Gordon Ross
2004-05-12 11:43:08 UTC
Permalink
I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is running on Slackware distro with a 2.2.19 kernel under VMWare 4)

However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards, whenever I tried to access either the base table or child tables, I get the message saying that pgsql has lost communication with the server.

Assuming that I had stumbled across a bug, I decided to upgrade to 7.4.2. Before starting, I tried to do a pg_dump, and it failed. Again, it failed partway through, loosing the connection to the server.

Below is the erorr I am getting

***@Vmware1:~$ pg_dump -f /tmp/sw4.sql -U postgres switchboards
pg_dump: connection not open
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table "StaffMember" failed: PQendcopy() failed.
pg_dump: Error message from server: FATAL: The database system is starting up

pg_dump: The command was: COPY public."StaffMember" ("Id", "ExDirectory", "Name", "JobTitle", "Title", "FirstName", "MiddleInitial", "LastName", "Status", "WelshSpeakingAbility", "JobTitleC", "OrgGroupID", "OtherNumber", "PostNumber", "BandId", "JobDescE", "JobDescW", "EMailAddr", "NDSObject", "SectionId", dnid) TO stdout;
***@Vmware1:~$

The log file says that the server process was terminated by a signal 11, and it then restarts.


Any ideas as to how I can bring my database back to life ?

Thanks,

GTG

Gordon Ross,
Network Manager/Rheolwr Rhydwaith
Countryside Council for Wales/Cyngor Cefn Gwlad Cymru


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Tom Lane
2004-05-12 13:10:36 UTC
Permalink
Post by Gordon Ross
I've been using a PGSQL 7.3.1 system as my development box for some time, and so far, it worked fine. (This is running on Slackware distro with a 2.2.19 kernel under VMWare 4)
However, yesterday, I added a column to a base table (that is inherited by several other tables). Shortly afterwards, whenever I tried to access either the base table or child tables, I get the message saying that pgsql has lost communication with the server.
Sounds pretty messy. If you're really lucky, an in-place update to
7.3.6 might fix it, but I can't offhand think of any bug fixes in the
7.3.* branch that had anything to do with inheritance.

Of course it's not necessarily true that this has anything to do with
inheritance, either. What we'd need to know before we could speculate
much further is where exactly the crash is happening. A gdb backtrace
from the SEGV would help a lot. I'd suggest:

1. Start a psql session. Figure out the PID of the connected backend
(pg_stat_activity may help, or just use "ps").

2. In another shell window, attach to the backend process with gdb:
gdb /path/to/postgres-executable PID
...
gdb>
(You can do this either as root or as the postgres user.)

3. Tell gdb to let the backend continue:
gdb> cont

4. Do whatever it takes in the psql session to provoke crash. gdb
should announce that the backend has gotten a signal 11. Now say
gdb> bt
gdb> quit

and send along the output of bt.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Gordon Ross
2004-05-12 14:14:55 UTC
Permalink
Post by Tom Lane
4. Do whatever it takes in the psql session to provoke crash. gdb
should announce that the backend has gotten a signal 11. Now say
gdb> bt
gdb> quit
and send along the output of bt.
Program received signal SIGSEGV, Segmentation fault.
0x806ba0a in nocachegetattr () at eval.c:88
88 eval.c: No such file or directory.
(gdb) bt
#0 0x806ba0a in nocachegetattr () at eval.c:88
#1 0x80d41ec in ExecEvalVar () at eval.c:88
#2 0x80d549c in ExecEvalExpr () at eval.c:88
#3 0x80d5809 in ExecTargetList () at eval.c:88
#4 0x80d5a9b in ExecProject () at eval.c:88
#5 0x80d5b79 in ExecScan () at eval.c:88
#6 0x80dbcee in ExecSeqScan () at eval.c:88
#7 0x80d3929 in ExecProcNode () at eval.c:88
#8 0x80d76d0 in ExecProcAppend () at eval.c:88
#9 0x80d3919 in ExecProcNode () at eval.c:88
#10 0x80dbac1 in ExecResult () at eval.c:88
#11 0x80d3909 in ExecProcNode () at eval.c:88
#12 0x80d288e in ExecutePlan () at eval.c:88
#13 0x80d1f80 in ExecutorRun () at eval.c:88
#14 0x8121aab in ProcessQuery () at eval.c:88
#15 0x811ff7d in pg_exec_query_string () at eval.c:88
#16 0x8121093 in PostgresMain () at eval.c:88
#17 0x8108b9a in DoBackend () at eval.c:88
#18 0x810848f in BackendStartup () at eval.c:88
#19 0x810763c in ServerLoop () at eval.c:88
#20 0x810719a in PostmasterMain () at eval.c:88
#21 0x80e46ff in main () at eval.c:88
#22 0x400e92eb in __libc_start_main (main=0x80e4530 <main>, argc=2,
ubp_av=0xbffffd44, init=0x806a7b0 <_init>, fini=0x81785fc <_fini>,
---Type <return> to continue, or q <return> to quit---
rtld_fini=0x4000c130 <_dl_fini>, stack_end=0xbffffd3c)
at ../sysdeps/generic/libc-start.c:129

Gordon Ross,
Network Manager/Rheolwr Rhydwaith
Countryside Council for Wales/Cyngor Cefn Gwlad Cymru


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly
Tom Lane
2004-05-12 16:51:10 UTC
Permalink
Post by Gordon Ross
Post by Tom Lane
4. Do whatever it takes in the psql session to provoke crash
and send along the output of bt.
Program received signal SIGSEGV, Segmentation fault.
0x806ba0a in nocachegetattr () at eval.c:88
88 eval.c: No such file or directory.
(gdb) bt
#0 0x806ba0a in nocachegetattr () at eval.c:88
#1 0x80d41ec in ExecEvalVar () at eval.c:88
#2 0x80d549c in ExecEvalExpr () at eval.c:88
#3 0x80d5809 in ExecTargetList () at eval.c:88
Hmm. The file/line references are obviously wrong, but let's assume
that the function names are right (they at least look plausible).
What exactly was the query that provoked this?
Also, could we see \d output for the table(s) involved?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)
Gordon Ross
2004-05-12 17:24:09 UTC
Permalink
switchboards=# \d "DNOwner";
Table "public.DNOwner"
Column | Type | Modifiers
-------------+---------+----------------------------------------------------
Id | integer | not null default nextval('"DNOwner_id_seq"'::text)
ExDirectory | boolean | default 'f'
Name | text |
dnid | integer |

switchboards=# select * from "DNOwner";
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.


The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select on those produce the same result.

GTG
Post by Gordon Ross
Post by Tom Lane
4. Do whatever it takes in the psql session to provoke crash
and send along the output of bt.
Program received signal SIGSEGV, Segmentation fault.
0x806ba0a in nocachegetattr () at eval.c:88
88 eval.c: No such file or directory.
(gdb) bt
#0 0x806ba0a in nocachegetattr () at eval.c:88
#1 0x80d41ec in ExecEvalVar () at eval.c:88
#2 0x80d549c in ExecEvalExpr () at eval.c:88
#3 0x80d5809 in ExecTargetList () at eval.c:88
Hmm. The file/line references are obviously wrong, but let's assume
that the function names are right (they at least look plausible).
What exactly was the query that provoked this?
Also, could we see \d output for the table(s) involved?

regards, tom lane


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Tom Lane
2004-05-12 19:35:44 UTC
Permalink
Post by Gordon Ross
switchboards=# select * from "DNOwner";
server closed the connection unexpectedly
The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select on those produce the same result.
Hmm. Does 'select * from only "DNOwner";' work, by any chance?

I have a suspicion now that you were bit by the bug repaired in this
7.3.2 bug fix:

2003-01-04 19:56 tgl

* src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix
inherited UPDATE for cases where child column numbering doesn't
match parent table. This used to work, but was broken in 7.3 by
rearrangement of code that handles targetlist sorting. Add a
regression test to catch future breakage.

What I am thinking happened is that you did an 'UPDATE "DNOwner"' to
insert values for the new column, and that because of the aforesaid
bug, the updates of the child tables actually corrupted the data (by
storing the intended new dnid value into the wrong column position).
The crash is probably happening because SELECT is trying to interpret
an integer as a text field, or something along that line.

If that theory is correct, then the parent table should still be okay
(thus the request to try an ONLY select). However, I'm afraid all your
child tables are toast, or at least all the rows you updated this way
are toast. You could maybe get to a dumpable state by deleting all the
child rows, but if the DELETE crashes then you'll have little recourse
but to drop the child tables completely.

Updating Postgres to 7.3.2 or later will prevent fresh occurrences of
the same problem, but unfortunately won't bring back your old data :-(.
I hope you have backups ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly
Gordon Ross
2004-05-12 20:20:51 UTC
Permalink
*SIGH* Oh, well. I suppose I should have updated up development machine a while ago. ./configure; make here we come.

Thank you for looking at this. At least I can be re-assured that the problem is fixed in later versions of Postgres.

GTG
Post by Gordon Ross
switchboards=# select * from "DNOwner";
server closed the connection unexpectedly
The DNOwner table is the base table. There are other tables that inherit from this table, and doing a simple select on those produce the same result.
Hmm. Does 'select * from only "DNOwner";' work, by any chance?

I have a suspicion now that you were bit by the bug repaired in this
7.3.2 bug fix:

2003-01-04 19:56 tgl

* src/backend/optimizer/prep/prepunion.c (REL7_3_STABLE): Fix
inherited UPDATE for cases where child column numbering doesn't
match parent table. This used to work, but was broken in 7.3 by
rearrangement of code that handles targetlist sorting. Add a
regression test to catch future breakage.

What I am thinking happened is that you did an 'UPDATE "DNOwner"' to
insert values for the new column, and that because of the aforesaid
bug, the updates of the child tables actually corrupted the data (by
storing the intended new dnid value into the wrong column position).
The crash is probably happening because SELECT is trying to interpret
an integer as a text field, or something along that line.

If that theory is correct, then the parent table should still be okay
(thus the request to try an ONLY select). However, I'm afraid all your
child tables are toast, or at least all the rows you updated this way
are toast. You could maybe get to a dumpable state by deleting all the
child rows, but if the DELETE crashes then you'll have little recourse
but to drop the child tables completely.

Updating Postgres to 7.3.2 or later will prevent fresh occurrences of
the same problem, but unfortunately won't bring back your old data :-(.
I hope you have backups ...

regards, tom lane


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Loading...