Discussion:
evil characters #bfef cause dump failure
(too old to reply)
Christian Fowler
2004-11-15 17:46:19 UTC
Permalink
I have been trying to track down the source of why my 7.4.5 database won't
reimport it's own dump ( http://archives.postgresql.org/pgsql-admin/2004-10/msg00213.php )

After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
column causes a truncated COPY line to be written (and thus the *entire*
COPY block fails). Exporting as inserts did not fix the problem either.

Any thoughts on why this might be so or how it can be avoided? Evil
thought of the day is if someone were to go around and paste this
multi-byte character in various websites' html forms it could cause a lot
of trouble.

Also, the behavior of the restore / psql import to complete the COPY
fields from the *following* line seems not good. It would be nice if the
missing columns could just be written as NULL's. 6 bad rows makes a 6 gig
dump worthless. Or perhaps an option to import each copy row in it's own
transaction so 5+ million copied rows don't fail for 6 bogus ones. Perhaps a
--this_is_an_emergency_so_please_do_everything_you_can_to_restore_as_much_as_possible
option.

If any of the core dev's want some small debug dumps I created, I'd be
happy to pass them on.

[ \ /
[ >X< Christian Fowler | spider AT viovio.com
[ / \ http://www.viovio.com | http://www.tikipro.org

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html
Tom Lane
2004-11-15 19:44:49 UTC
Permalink
Post by Christian Fowler
After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
column causes a truncated COPY line to be written (and thus the *entire*
COPY block fails).
What database encoding and locale are you using?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Christian Fowler
2004-11-15 20:29:38 UTC
Permalink
[shell]$ env
PGCLIENTENCODING=UNICODE
LANG=en_US.UTF-8

db=# \encoding
UNICODE
Post by Tom Lane
Post by Christian Fowler
After much wrestling, it appears the hex byte sequence #bfef in a VARCHAR
column causes a truncated COPY line to be written (and thus the *entire*
COPY block fails).
What database encoding and locale are you using?
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
[ \ /
[ >X< Christian Fowler | spider AT viovio.com
[ / \ http://www.viovio.com | http://www.tikipro.org

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Tom Lane
2004-11-15 20:35:46 UTC
Permalink
Post by Christian Fowler
[shell]$ env
PGCLIENTENCODING=UNICODE
LANG=en_US.UTF-8
db=# \encoding
UNICODE
I was more concerned about the database encoding, which the above
doesn't prove. Try "SHOW server_encoding"

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Christian Fowler
2004-11-15 20:44:20 UTC
Permalink
db-# ;
server_encoding
-----------------
SQL_ASCII
(1 row)

whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
everything UNICODE end-to-end. I must have forgotten --encoding on my
initdb? Anything I can do at this point?
Post by Tom Lane
Post by Christian Fowler
[shell]$ env
PGCLIENTENCODING=UNICODE
LANG=en_US.UTF-8
db=# \encoding
UNICODE
I was more concerned about the database encoding, which the above
doesn't prove. Try "SHOW server_encoding"
regards, tom lane
[ \ /
[ >X< Christian Fowler | spider AT viovio.com
[ / \ http://www.viovio.com | http://www.tikipro.org

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Tom Lane
2004-11-15 21:00:22 UTC
Permalink
Post by Tom Lane
server_encoding
-----------------
SQL_ASCII
whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
everything UNICODE end-to-end. I must have forgotten --encoding on my
initdb? Anything I can do at this point?
Hmm ... the safe way would be dump-n-reload but that's not working for
you. What you can try is to alter the pg_database.encoding value for
that database, then start fresh backends (any existing ones won't notice
the change). Worst case if that doesn't make life good is to change it
back.

The real problem is that you've got invalid unicode data in the database
(I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
then #ef starts a 3-byte sequence, so if this comes within 2 characters
of end-of-line that would explain your dump problem). You had better
fix the data first before trying to lock down the encoding. Once you
change the encoding, backend internal operations will start spitting up
on any stored bad data, whereas right now it's just passing it through
unchanged.

The safest way might be a dump-n-reload in any case, since reloading
into a fresh UNICODE database will catch bad data. If you try manual
repairs you're likely to miss some places :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html
Markus Bertheau
2004-11-15 23:13:21 UTC
Permalink
Post by Tom Lane
The real problem is that you've got invalid unicode data in the database
(I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
then #ef starts a 3-byte sequence, so if this comes within 2 characters
of end-of-line that would explain your dump problem).
FWIW, 1-byte UTF-8 sequences are always < 128. BF can only appear
inside, not at the beginning of, a UTF-8 byte sequence with more than 1
byte.

Compare

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

It has a table that gives anyone who can tell bits from bytes a quick
understanding of how the UTF-8 encoding works.
--
Markus Bertheau <***@bluetwanger.de>


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Iain
2004-11-16 01:17:54 UTC
Permalink
It seems that this kind of thing pops up from time to time. I don't have v8
available right now to check, but is SQL_ASCII still the default DB
encoding? I'm wondering is unicode wouldn't be a better choice these days.

regards
Iain
----- Original Message -----
From: "Tom Lane" <***@sss.pgh.pa.us>
To: "Christian Fowler" <***@viovio.com>
Cc: "pgsql-admin list" <pgsql-***@postgresql.org>
Sent: Tuesday, November 16, 2004 6:00 AM
Subject: Re: [ADMIN] evil characters #bfef cause dump failure
Post by Tom Lane
Post by Tom Lane
server_encoding
-----------------
SQL_ASCII
whoa! yikes, I bet this has a lot to do with it? I really wanted to keep
everything UNICODE end-to-end. I must have forgotten --encoding on my
initdb? Anything I can do at this point?
Hmm ... the safe way would be dump-n-reload but that's not working for
you. What you can try is to alter the pg_database.encoding value for
that database, then start fresh backends (any existing ones won't notice
the change). Worst case if that doesn't make life good is to change it
back.
The real problem is that you've got invalid unicode data in the database
(I'm not an expert, but I think that #bf is a 1-byte UTF8 sequence and
then #ef starts a 3-byte sequence, so if this comes within 2 characters
of end-of-line that would explain your dump problem). You had better
fix the data first before trying to lock down the encoding. Once you
change the encoding, backend internal operations will start spitting up
on any stored bad data, whereas right now it's just passing it through
unchanged.
The safest way might be a dump-n-reload in any case, since reloading
into a fresh UNICODE database will catch bad data. If you try manual
repairs you're likely to miss some places :-(
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Tom Lane
2004-11-16 01:34:51 UTC
Permalink
Post by Iain
It seems that this kind of thing pops up from time to time. I don't have v8
available right now to check, but is SQL_ASCII still the default DB
encoding? I'm wondering is unicode wouldn't be a better choice these days.
IIRC you can select the default encoding at build time, so this is
really a question for packagers not the development team.

You make a good point though --- I'm a bit tempted to make it default to
UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8
support these days.

BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
choice; it just passes 8-bit data with no interpretation. So it's not
*that* unreasonable a default. You can store UTF8 data in it without
any problem, you just won't have the niceties like detection of bad
character sequences.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html
Peter Eisentraut
2004-11-16 07:18:53 UTC
Permalink
Post by Tom Lane
You make a good point though --- I'm a bit tempted to make it default
to UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on
UTF8 support these days.
Recall that in 8.0 the default encoding will be derived from the locale.
So if the postgres account has a reasonable locale set (presumably
chosen somewhere during the system installation), everything will work
out.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/


---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
Iain
2004-11-16 08:45:09 UTC
Permalink
Hi,
Post by Peter Eisentraut
Recall that in 8.0 the default encoding will be derived from the locale.
So if the postgres account has a reasonable locale set (presumably
chosen somewhere during the system installation), everything will work
out.
That's seems pretty reasonable, though I think that standardizing on unicode
(and I guess that means UTF-8) is really the way to go. It was designed as
the universal standard after all.

Thanks for the feedback,
Iain









---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
Peter Eisentraut
2004-11-16 09:43:52 UTC
Permalink
Post by Iain
That's seems pretty reasonable, though I think that standardizing on
unicode (and I guess that means UTF-8) is really the way to go. It was
designed as the universal standard after all.
It may have been designed that way, but it is a failure in practice.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ***@postgresql.org
Iain
2004-11-16 10:27:22 UTC
Permalink
Hi Peter,

I know that the unicode standards are far from perfect, but I'm wondering
why you consider it a failure.

Is it technical, or just an acceptance thing?

From my personal perspective, I never had any interest in such things as
encodings or internationlization until I started working in Japan, then I
realized what a nightmare it is. I expect you can imagine, but most people
(like me a year ago) couldn't. If everyone was already using unicode, I
don't think we'd have anything to worry about.

regards
Iain
----- Original Message -----
From: "Peter Eisentraut" <***@gmx.net>
To: "Iain" <***@mst.co.jp>
Cc: "Tom Lane" <***@sss.pgh.pa.us>; "pgsql-admin list"
<pgsql-***@postgresql.org>
Sent: Tuesday, November 16, 2004 6:43 PM
Subject: Re: [ADMIN] evil characters #bfef cause dump failure
Post by Peter Eisentraut
Post by Iain
That's seems pretty reasonable, though I think that standardizing on
unicode (and I guess that means UTF-8) is really the way to go. It was
designed as the universal standard after all.
It may have been designed that way, but it is a failure in practice.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
---------------------------(end of broadcast)---------------------------
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Peter Eisentraut
2004-11-16 12:08:18 UTC
Permalink
Post by Iain
I know that the unicode standards are far from perfect, but I'm wondering
why you consider it a failure.
If it were a success, then everyone would be using it and we wouldn't have
this discussion. Certainly, Unicode is reasonable and the best option in
many cases. But the original idea of replacing all other character sets
won't really happen anytime soon. Since you work in Japan, I would think you
know about the problems, since much of the technical opposition comes from
there.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Markus Bertheau
2004-11-16 08:37:01 UTC
Permalink
Post by Tom Lane
Post by Iain
It seems that this kind of thing pops up from time to time. I don't have v8
available right now to check, but is SQL_ASCII still the default DB
encoding? I'm wondering is unicode wouldn't be a better choice these days.
IIRC you can select the default encoding at build time, so this is
really a question for packagers not the development team.
You make a good point though --- I'm a bit tempted to make it default to
UNICODE for the Red Hat build, since Red Hat is pretty gung-ho on UTF8
support these days.
BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
choice; it just passes 8-bit data with no interpretation. So it's not
*that* unreasonable a default. You can store UTF8 data in it without
any problem, you just won't have the niceties like detection of bad
character sequences.
This is, by the way, a reason why this encoding should be renamed to
SQL_8BIT (or something along these lines) and UNICODE to UTF-8.
--
Markus Bertheau <***@bluetwanger.de>


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
Christian Fowler
2004-11-16 16:06:50 UTC
Permalink
I strongly agree with this. I have always been uncomfortable selecting
"UNICODE" and never quite sure if it is the UTF8, UTF16, or UTF32
encoding.

SQL_8BIT or SQL_RAW make much more sense than SQL_ASCII given that Tom
said this is a lack of encoding. I fear I might have high-bits chopped off
or something.

However, back to my problem... if a #bfef character is shoved into a
VARCHAR, one's dump is hosed. If I went to various websites and entered
this in, I could cause a lot of pain. I believe I noticed some characters
(like new line and tab) are converted to <80> or similar. Could/should
this be extended to more character ranges - particularly high byte chars
for people with the SQL_ASCII (lackof) encoding?
Post by Markus Bertheau
This is, by the way, a reason why this encoding should be renamed to
SQL_8BIT (or something along these lines) and UNICODE to UTF-8.
--
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
[ \ /
[ >X< Christian Fowler | spider AT viovio.com
[ / \ http://www.viovio.com | http://www.tikipro.org

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Loading...