dump/restore results in duplicate key violation with 7.4.6. Bug?

danda

2004-11-05 19:37:33 UTC

Hi, I am getting an error when attempting to perform a
pg_dump/pg_restore cycle (data only).

Source database is 7.4.3, Target is 7.4.6. Source database was
originally using 'SQL_ASCII' as the encoding, but all data is in UTF-8.
Target database uses 'UNICODE'. The table that is giving me problems
contains over 400,000 rows. (It holds category data from the dmoz
project. ) In both the source and target database, there is a unique
constraint on the topic key. Upon restore I am getting duplicate key
violations on 12 rows. All of the problem rows contain non latin1 data.

What I've tried:
==========

1) SQL Dump / Restore

source machine:
pg_dump -a -t category -U postgres dbname > cats.dump.sql

target machine:
psql
\i cats.dump.sql

which outputs:
\i /tmp/category.dump.sql
SET
SET
SET
SET
SET
UPDATE 1
SET
psql:/tmp/category.dump.sql:24: ERROR: duplicate key violates unique
constraint "category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988
Top/Kids_and_Teens/International/Korean/ê²ì 0 0"
SET
UPDATE 1

( I also tried it using inserts instead of copy from, but with similar
results. )

2) Binary (custom) Dump / Restore
source machine:
pg_dump -F c -Z 8 -t category -U postgres dbname > category_dump.custom.gz

target machine:
pg_restore -d dbname -a category_dump.custom.gz

which churns for a while and then gives me this error:

pg_restore: ERROR: duplicate key violates unique constraint
"category_topic_key"
CONTEXT: COPY category, line 133302: "1227568 503988
Top/Kids_and_Teens/International/Korean/ê²ì 0 0"
pg_restore: [archiver (db)] error returned by PQendcopy

3) Remove unique constraints.

I then removed the unique constraint in the target database so I could
at least import the data. After that I was able to view exactly which
rows have been duplicated:

select sub.topic, sub.cnt from (select topic, count(*) as cnt from
category group by topic) sub where cnt > 1;
topic | cnt
-----------------------------------------------------------------------+-----
Top/Adult/World/Japanese/ãªã³ã©ã¤ã³ã·ã§ãã/ã°ããº | 2
Top/Adult/World/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã²ã¼ã /éçºå/ãè¡ | 10
Top/Adult/World/Korean/ëê±° | 4
Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì | 2
Top/Adult/World/Korean/ìì /ë¬¸í | 2
Top/Adult/World/Korean/ì±ì¸ì©í | 2
Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í | 2
Top/Kids_and_Teens/International/Japanese/ã²ã¼ã | 2
Top/Kids_and_Teens/International/Japanese/ã³ã³ãã¥ã¼ã¿ | 2
Top/Kids_and_Teens/International/Japanese/ã¨ã³ã¿ã¼ãã¤ã³ã¡ã³ã/ã©ã¸ãª | 4
Top/Kids_and_Teens/International/Japanese/è¶£å³ã»ã¹ãã¼ã/ããã¡ã | 2
Top/Kids_and_Teens/International/Korean/ìì | 5
(12 rows)

Performing this same query on the source database:

select sub.topic, sub.cnt from (select topic, count(*) as cnt from
category group by topic) sub where cnt > 1;
topic | cnt
-------+-----
(0 rows)

4) Attempted to identify category_id of duplicate rows:

select category_id from category where topic =
'Top/Adult/World/Korean/ëê±°';
category_id
-------------
(0 rows)

I believe this failed due to some sort of encoding or font problem
between xterm and psql and DB or even X clipboard. Note that the data
does display correctly when viewed in mozilla. Still it would be nice
to be able to copy/paste psql result string and use it as input and
actually find a match!

5) Manual inspection of one of the rows.

I chose the topic 'Top/Adult/World/Korean/ëê±°' to pursue further. I
executed the following query and looked for multiple instances of that
string. There should be 4 according to our duplicates query above.

select category_id, topic from category where topic like
'Top/Adult/World/Korean%';
category_id | topic
-------------+----------------------------------------------
328048 | Top/Adult/World/Korean/ë¹ëì¤,CD
381025 | Top/Adult/World/Korean/ë¹ëì¤,CD/ë°±ìCD
400131 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X
400136 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í¬
400133 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/íìì
5830581 | Top/Adult/World/Korean/ë¯¸ëì´
5830906 | Top/Adult/World/Korean/ë¹ì¦ëì¤
589823 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ê¸°ê´,ë¨ì²´
324253 | Top/Adult/World/Korean
367742 | Top/Adult/World/Korean/ëê±°
378503 | Top/Adult/World/Korean/ìì
590650 | Top/Adult/World/Korean/ìì /ë§í
378504 | Top/Adult/World/Korean/ìì /ë¬¸í
590649 | Top/Adult/World/Korean/ìì /ì ëë©ì´ì
5828700 | Top/Adult/World/Korean/ì±ì¸ì¬ì´í¸_X/ë§í
5812536 | Top/Adult/World/Korean/ì±í
5832542 | Top/Adult/World/Korean/ë¹ì¦ëì¤/ì·¨ì,ì±ì©
364360 | Top/Adult/World/Korean/ê²ì´
324254 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸
592044 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ìì
5852704 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ë§í
406487 | Top/Adult/World/Korean/ì±ì¸_ì¬ì´í¸/ì±ì¸ìí
365940 | Top/Adult/World/Korean/ì±ì¸ì©í
(23 rows)

Yet I only see one row that matches the string exactly. It is the one
with category_id = 367742.

6) Attempt to import same data back into source database ( 7.4.3 )

Acting on the theory that this is possibly a new problem in 7.4.6, I
tried the following in the source DB, (still with SQL_ASCII encoding)
which worked just fine:

create table category_tmp as select * from category;

alter table category_tmp add constraint category_tmp_topic_key unique
(topic);
NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index
"category_tmp_topic_key" for table "category_tmp"
ALTER TABLE

pg_dump -a -t category_tmp -U postgres dbname > category_tmp_dump.sql

psql

delete from category_tmp ;
DELETE 461153

\i category_tmp_dump.sql
SET
SET
SET
SET

7) Experiments with encoding

a) Used GNU recode to recode the sql dump file to UTF-8. Resulting file
was unchanged, meaning data was already UTF-8.

b) Updated the encoding in source database to 'UNICODE' to match target
database.

update pg_database set encoding = 6 where datname = 'dbname';

c) Re-imported the data back into the source database again (as in 6).
Worked fine again.

So at this point I am mostly at a loss. I would have thought that after
changing the source DB to UNICODE encoding it should exhibit the same
behavior as the target. I can think of two explanations:

1: initdb does something with the encoding beyond setting
pg_database(encoding).

2: there is a bug in 7.4.6 that does not exist in 7.4.3

I suppose the next step is to create a new DB in 7.4.3 using UNICODE and
attempt to import the data in the same manner. But right now I need a
break.

Dan Libby

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings