Discussion:
dump/restore results in duplicate key violation with 7.4.6. Bug?
(too old to reply)
danda
2004-11-05 19:37:33 UTC
Permalink
Hi, I am getting an error when attempting to perform a
pg_dump/pg_restore cycle (data only).

Source database is 7.4.3, Target is 7.4.6. Source database was
originally using 'SQL_ASCII' as the encoding, but all data is in UTF-8.
Target database uses 'UNICODE'. The table that is giving me problems
contains over 400,000 rows. (It holds category data from the dmoz
project. ) In both the source and target database, there is a unique
constraint on the topic key. Upon restore I am getting duplicate key
violations on 12 rows. All of the problem rows contain non latin1 data.

What I've tried:
==========

1) SQL Dump / Restore

source machine:
pg_dump -a -t category -U postgres dbname > cats.dump.sql

target machine:
psql
\i cats.dump.sql

which outputs:
\i /tmp/category.dump.sql
SET
SET
SET
SET
SET
UPDATE 1
SET
psql:/tmp/category.dump.sql:24: ERROR: duplicate key violates unique
constraint "category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988
Top/Kids_and_Teens/International/Korean/ê²ì 0 0"
SET
UPDATE 1

( I also tried it using inserts instead of copy from, but with similar
results. )



2) Binary (custom) Dump / Restore
source machine:
pg_dump -F c -Z 8 -t category -U postgres dbname > category_dump.custom.gz

target machine:
pg_restore -d dbname -a category_dump.custom.gz

which churns for a while and then gives me this error:

pg_restore: ERROR: duplicate key violates unique constraint
"category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988
Top/Kids_and_Teens/International/Korean/ê²ì 0 0"
pg_restore: [archiver (db)] error returned by PQendcopy


3) Remove unique constraints.

I then removed the unique constraint in the target database so I could
at least import the data. After that I was able to view exactly which
rows have been duplicated:

select sub.topic, sub.cnt from (select topic, count(*) as cnt from
category group by topic) sub where cnt > 1;
topic | cnt
-----------------------------------------------------------------------+-----
Top/Adult/World/Japanese/ãªã³ã©ã¤ã³ã·ã§ãã/ã°ã㺠| 2
Top/Adult/World/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã²ã¼ã /éçºå/ãè¡ | 10
Top/Adult/World/Korean/ëê±° | 4
Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì | 2
Top/Adult/World/Korean/ìì /문í | 2
Top/Adult/World/Korean/ì±ì¸ì©í | 2
Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í | 2
Top/Kids_and_Teens/International/Japanese/ã²ã¼ã | 2
Top/Kids_and_Teens/International/Japanese/ã³ã³ãã¥ã¼ã¿ | 2
Top/Kids_and_Teens/International/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã©ã¸ãª | 4
Top/Kids_and_Teens/International/Japanese/趣å³ã»ã¹ãã¼ã/ããã¡ã | 2
Top/Kids_and_Teens/International/Korean/ìì | 5
(12 rows)

Performing this same query on the source database:

select sub.topic, sub.cnt from (select topic, count(*) as cnt from
category group by topic) sub where cnt > 1;
topic | cnt
-------+-----
(0 rows)


4) Attempted to identify category_id of duplicate rows:

select category_id from category where topic =
'Top/Adult/World/Korean/ëê±°';
category_id
-------------
(0 rows)

I believe this failed due to some sort of encoding or font problem
between xterm and psql and DB or even X clipboard. Note that the data
does display correctly when viewed in mozilla. Still it would be nice
to be able to copy/paste psql result string and use it as input and
actually find a match!

5) Manual inspection of one of the rows.

I chose the topic 'Top/Adult/World/Korean/ëê±°' to pursue further. I
executed the following query and looked for multiple instances of that
string. There should be 4 according to our duplicates query above.

select category_id, topic from category where topic like
'Top/Adult/World/Korean%';
category_id | topic
-------------+----------------------------------------------
328048 | Top/Adult/World/Korean/ë¹ëì¤,CD
381025 | Top/Adult/World/Korean/ë¹ëì¤,CD/ë°±ìCD
400131 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X
400136 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í¬
400133 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/íìì
5830581 | Top/Adult/World/Korean/미ëì´
5830906 | Top/Adult/World/Korean/ë¹ì¦ëì¤
589823 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/기ê´,ë¨ì²´
324253 | Top/Adult/World/Korean
367742 | Top/Adult/World/Korean/ëê±°
378503 | Top/Adult/World/Korean/ìì
590650 | Top/Adult/World/Korean/ìì /ë§í
378504 | Top/Adult/World/Korean/ìì /문í
590649 | Top/Adult/World/Korean/ìì /ì ëë©ì´ì
5828700 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í
5812536 | Top/Adult/World/Korean/ì±í
5832542 | Top/Adult/World/Korean/ë¹ì¦ëì¤/ì·¨ì,ì±ì©
364360 | Top/Adult/World/Korean/ê²ì´
324254 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸
592044 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì
5852704 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ë§í
406487 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ì±ì¸ìí
365940 | Top/Adult/World/Korean/ì±ì¸ì©í
(23 rows)

Yet I only see one row that matches the string exactly. It is the one
with category_id = 367742.


6) Attempt to import same data back into source database ( 7.4.3 )

Acting on the theory that this is possibly a new problem in 7.4.6, I
tried the following in the source DB, (still with SQL_ASCII encoding)
which worked just fine:

create table category_tmp as select * from category;

alter table category_tmp add constraint category_tmp_topic_key unique
(topic);
NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index
"category_tmp_topic_key" for table "category_tmp"
ALTER TABLE

pg_dump -a -t category_tmp -U postgres dbname > category_tmp_dump.sql

psql

delete from category_tmp ;
DELETE 461153

\i category_tmp_dump.sql
SET
SET
SET
SET


7) Experiments with encoding

a) Used GNU recode to recode the sql dump file to UTF-8. Resulting file
was unchanged, meaning data was already UTF-8.

b) Updated the encoding in source database to 'UNICODE' to match target
database.

update pg_database set encoding = 6 where datname = 'dbname';

c) Re-imported the data back into the source database again (as in 6).
Worked fine again.



So at this point I am mostly at a loss. I would have thought that after
changing the source DB to UNICODE encoding it should exhibit the same
behavior as the target. I can think of two explanations:

1: initdb does something with the encoding beyond setting
pg_database(encoding).

2: there is a bug in 7.4.6 that does not exist in 7.4.3

I suppose the next step is to create a new DB in 7.4.3 using UNICODE and
attempt to import the data in the same manner. But right now I need a
break.


Dan Libby



---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Dan Libby
2004-11-05 20:37:54 UTC
Permalink
Some more info...
Post by danda
So at this point I am mostly at a loss. I would have thought that
after changing the source DB to UNICODE encoding it should exhibit the
1: initdb does something with the encoding beyond setting
pg_database(encoding).
2: there is a bug in 7.4.6 that does not exist in 7.4.3
I suppose the next step is to create a new DB in 7.4.3 using UNICODE
and attempt to import the data in the same manner. But right now I
need a break.
Okay, I tried the following test using the same input file on both
machines. All steps identical.

1) create database test_enc with encoding = 'UNICODE';

2) Create the schema for the 'category' table.

3) \i category_dump.sql

It works on 7.4.3, but I get the duplicate key violation on 7.4.6.

7.4.3 is running on gentoo and was built with emerge postgresql.

7.4.6 is running on Redhat 9 and was built with
./configure --prefix=/var/lib/pgsql --with-python --with-perl


I can probably come up with a clean test case using simplified data.
I'm still hoping someone has the magical solution though...

Dan Libby

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
Tom Lane
2004-11-14 00:12:28 UTC
Permalink
1) I don't understand why a difference in locale should cause a
duplicate key error, especially when both databases were created using
'UNICODE' encoding. Is this valid behavior or a postgres bug?
The past reports I've seen of such misbehavior appeared to result
from strcoll() failing to cope well with multibyte sequences that
were illegal according to the selected locale's idea of the character
set in use. You can get burnt by this quite easily if you set the
database encoding to something not compatible with the locale.
If you didn't make that mistake, then I'd bet on Postgres having a
more liberal idea of what are valid characters in the encoding than
the locale definition does. Whether that is Postgres' bug or the
locale definition's is impossible to say without more data.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Loading...