Discussion:
Problems with pg_dump
(too old to reply)
Stefan Holzheu
2004-01-26 12:50:43 UTC
Permalink
Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
No we have the problem that pg_dump does not finish. We get the
following error message:

pg_dump: lost synchronization with server: got message type "5", length
842281016
pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
failed:
PQendcopy() failed.
pg_dump: Error message from server: lost synchronization with server: got
message type "5", length 842281016
pg_dump: The command was: COPY messungen.aggr_stddev_hour (id, von, wert,
counts) TO stdout;

The database size is about 10 GB. OS is Redhat Linux 7.1.

Help appreciated!

Stefan
--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ***@postgresql.org so that your
message can get through to the mailing list cleanly
Tom Lane
2004-01-26 14:39:58 UTC
Permalink
Post by Stefan Holzheu
Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
No we have the problem that pg_dump does not finish. We get the
pg_dump: lost synchronization with server: got message type "5", length
842281016
pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
That's interesting. What are the column data types in that table? What
character set encoding are you using? Can you do a dump if you select
the dump-data-as-INSERT-commands option?
Post by Stefan Holzheu
The database size is about 10 GB.
How big is the particular table?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Stefan Holzheu
2004-01-26 15:31:12 UTC
Permalink
Post by Tom Lane
Post by Stefan Holzheu
Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
No we have the problem that pg_dump does not finish. We get the
pg_dump: lost synchronization with server: got message type "5", length
842281016
pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
That's interesting. What are the column data types in that table? What
character set encoding are you using? Can you do a dump if you select
the dump-data-as-INSERT-commands option?
Tabelle »messungen.aggr_stddev_hour«
Spalte | Typ | Attribute
--------+--------------------------+-----------
id | integer | not null
von | timestamp with time zone | not null
wert | double precision |
counts | integer | not null
Indexe:
»idx_aggr_stddev_hour_id_von« eindeutig, btree (id, von)

Encoding: LATIN9

The error does not occur always and not always with the same table.
However, the error occurs only on that kind of aggregation tables. There
is a cron-job keeping the tables up to date, starting all 10 minutes.
The job does delete and inserts on the table. Could this somehow block
the dump process? Normally it should not?
Post by Tom Lane
Post by Stefan Holzheu
The database size is about 10 GB.
How big is the particular table?
The table has 9 000 000 entries.

Stefan
--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html
Tom Lane
2004-01-26 16:19:46 UTC
Permalink
Post by Stefan Holzheu
pg_dump: lost synchronization with server: got message type "5", length
842281016
The error does not occur always and not always with the same table.
Oh, that's even more interesting. Is the failure message itself
consistent --- that is, is it always complaining about "message type 5"
and the same bizarre length value? The "length" looks like it's really
ASCII text ("2408" to be specific), so somehow libpq is misinterpreting
a piece of the COPY datastream as the start of a new message.
Post by Stefan Holzheu
However, the error occurs only on that kind of aggregation tables. There
is a cron-job keeping the tables up to date, starting all 10 minutes.
The job does delete and inserts on the table. Could this somehow block
the dump process? Normally it should not?
It's hard to see how another backend would have anything to do with
this, unless perhaps the error is dependent on a particular data value
that is sometimes present in the table and sometimes not. It looks to
me like either libpq or the backend is miscounting the number of data
bytes in the COPY datastream. Would it be possible for you to use a
packet sniffer to capture the communication between pg_dump and the
backend? If we could look at exactly what's going over the wire, it
would help to pin down the blame.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org
Stefan Holzheu
2004-01-26 19:29:33 UTC
Permalink
Post by Tom Lane
Post by Stefan Holzheu
pg_dump: lost synchronization with server: got message type "5", length
842281016
The error does not occur always and not always with the same table.
Oh, that's even more interesting. Is the failure message itself
consistent --- that is, is it always complaining about "message type 5"
and the same bizarre length value? The "length" looks like it's really
ASCII text ("2408" to be specific), so somehow libpq is misinterpreting
a piece of the COPY datastream as the start of a new message.
The failure looks similarly but the number do not exactly the same:

pg_dump: lost synchronization with server: got message type "4", length
858863113
pg_dump: SQL command to dump the contents of table "aggr_max_hour"
failed: PQendcopy() failed.
pg_dump: Error message from server: lost synchronization with server:
got message type "4", length 858863113
pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert,
counts) TO stdout;
Post by Tom Lane
Post by Stefan Holzheu
However, the error occurs only on that kind of aggregation tables. There
is a cron-job keeping the tables up to date, starting all 10 minutes.
The job does delete and inserts on the table. Could this somehow block
the dump process? Normally it should not?
It's hard to see how another backend would have anything to do with
this, unless perhaps the error is dependent on a particular data value
that is sometimes present in the table and sometimes not. It looks to
me like either libpq or the backend is miscounting the number of data
bytes in the COPY datastream. Would it be possible for you to use a
packet sniffer to capture the communication between pg_dump and the
backend? If we could look at exactly what's going over the wire, it
would help to pin down the blame.
Well, I never used a packet sniffer but if anyone tells me what to
capture, I'd be glad to.

PostgreSQL is really a great piece of software, keep on making it even
better!

Stefan
--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Tom Lane
2004-01-26 22:48:35 UTC
Permalink
Post by Stefan Holzheu
Well, I never used a packet sniffer but if anyone tells me what to
capture, I'd be glad to.
Since we don't know where exactly in the table the error occurs, I think
you'll just have to capture the whole connection traffic between pg_dump
and the backend. You could help keep it to a reasonable size by doing a
selective dump of just one table, viz "pg_dump -t tablename dbname",
although I guess you might have to try a few times to reproduce the
error that way ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Stefan Holzheu
2004-02-25 07:58:03 UTC
Permalink
Post by Stefan Holzheu
pg_dump: lost synchronization with server: got message type "4", length
858863113
pg_dump: SQL command to dump the contents of table "aggr_max_hour"
failed: PQendcopy() failed.
got message type "4", length 858863113
pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert,
counts) TO stdout;
Post by Tom Lane
Post by Stefan Holzheu
However, the error occurs only on that kind of aggregation tables.
There is a cron-job keeping the tables up to date, starting all 10
minutes. The job does delete and inserts on the table. Could this
somehow block the dump process? Normally it should not?
It's hard to see how another backend would have anything to do with
this, unless perhaps the error is dependent on a particular data value
that is sometimes present in the table and sometimes not. It looks to
me like either libpq or the backend is miscounting the number of data
bytes in the COPY datastream. Would it be possible for you to use a
packet sniffer to capture the communication between pg_dump and the
backend? If we could look at exactly what's going over the wire, it
would help to pin down the blame.
Well, I never used a packet sniffer but if anyone tells me what to
capture, I'd be glad to.
In order to capture the pg_dump traffic with tcpdump we changed the
pg_dump command from
pg_dump DB > ...
to
pg_dump -h 127.0.0.2 DB > ...
Since that time the dump completed without any errors. Before using
local connection pg_dump failed to about 50%.

Anyway, we can live with that solution. But still I'd be curious to know
what is the problem with local connections. Is there a way to capture
traffic of local socket connections?


Stefan
--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
Tom Lane
2004-02-25 14:42:43 UTC
Permalink
[ pg_dump shows data corruption over local socket but not TCP ]
That is really odd.
Anyway, we can live with that solution. But still I'd be curious to know
what is the problem with local connections. Is there a way to capture
traffic of local socket connections?
No, not that I know of. You could modify libpq to dump everything it
receives from the socket to a log file, and then compare the results
from local socket and TCP cases.

The whole thing is bizarre though. AFAICS it must represent a kernel
bug. What is your platform exactly?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
scott
2004-02-25 17:53:25 UTC
Permalink
take my name and e-mail out of your address!!!! Wrong Scott!
----- Original Message -----
From: "Tom Lane" <***@sss.pgh.pa.us>
To: "Stefan Holzheu" <***@bitoek.uni-bayreuth.de>
Cc: "ADMIN" <pgsql-***@postgresql.org>
Sent: Wednesday, February 25, 2004 6:42 AM
Subject: Re: [ADMIN] Problems with pg_dump
Post by Tom Lane
[ pg_dump shows data corruption over local socket but not TCP ]
That is really odd.
Anyway, we can live with that solution. But still I'd be curious to know
what is the problem with local connections. Is there a way to capture
traffic of local socket connections?
No, not that I know of. You could modify libpq to dump everything it
receives from the socket to a log file, and then compare the results
from local socket and TCP cases.
The whole thing is bizarre though. AFAICS it must represent a kernel
bug. What is your platform exactly?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Stefan Holzheu
2004-03-02 07:36:44 UTC
Permalink
Post by Tom Lane
[ pg_dump shows data corruption over local socket but not TCP ]
That is really odd.
Anyway, we can live with that solution. But still I'd be curious to know
what is the problem with local connections. Is there a way to capture
traffic of local socket connections?
No, not that I know of. You could modify libpq to dump everything it
receives from the socket to a log file, and then compare the results
from local socket and TCP cases.
The whole thing is bizarre though. AFAICS it must represent a kernel
bug. What is your platform exactly?
It's a RH7.1 with standard rpm kernel-2.4.2-2. The special thing is a
proprietary infortrend kernel modul rd124f necessary to access a
IFT-2101 raid.
Unfortunately Infortrend does not supply a modul for a more recent linux
distribution. The machine was originally intended to run Oracle on
Windows...

I will check whether the error occurs on an other machine with different
kernel but identical database and postgres version.

Regards, Stefan
--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Loading...