Mis-interpreted extended character

Discussion:

(too old to reply)

Andrew Biagioni

2003-12-13 15:18:18 UTC

Our database ( (PostgreSQL) 7.3.5 ) uses Unicode encoding:

egreek=# \l
List of databases
Name | Owner | Encoding
-------------+----------+----------
egreek | andrew | UNICODE

For some reason, If I try to use an extended character (ASCII code >
127) in a string, I get this peculiar result:

egreek=# SELECT LOWER ( 'Niño' );
egreek'#

(note: it fails to see the closing single-quote character; I hit
CTRL-C to quit the operation)

and:

egreek=# SELECT LOWER ( 'Niño ' );
ERROR: Unicode >= 0x10000 is not supported

(note: I added a space before the closing quote).

It looks to me like the 'ñ' character (spanish n with tilde - ASCII 164,
as far as I can tell) is messing with the parser somehow.

Furthermore -- this is data coming from a form in a web site used
globally, so I have little control over what is passed to me (me =
PostgreSQL DB). I can't (without lotsa hassles) "demand" that
higher-ASCII characters be encoded on the way in, and decoded on the way
out.

Any suggestions/info/hints?

Thanks,

Andrew

Chris

2003-12-13 15:38:46 UTC

Permalink

Post by Andrew Biagioni
[...]
For some reason, If I try to use an extended character (ASCII code >
[...]

Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.

Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.

Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.

According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).

Short answer: not PostgreSQL's fault.

Bye, Chris.

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Andrew Biagioni

2003-12-16 01:25:37 UTC

Permalink

Chris,

Post by Chris

Post by Andrew Biagioni
[...]
For some reason, If I try to use an extended character (ASCII code >
[...]

Probably your terminal is set to ISO-8859-1 ("latin 1") or something
like that, while your database is set to unicode as you showed.
Hence the mismatch. In unicode (for example UTF-8) non-US-ASCII
characters are encoded with two bytes (as opposed to one byte > 127
as happens with ISO-8859-1). Solution is to have everything agree on the
encoding. Terminal + DB or Web Browser + DB.
Btw. you _do_ actually have an influence on what encoding a web browser
uses by setting the "encoding" HTTP header.
According to my experience, if you have to deal with only western
european encodings, you're better off (still) with ISO-8859-1 (or
ISO-8859-15 to have the EUR symbol too).
Short answer: not PostgreSQL's fault.

Agreed -- and thanks for the above info.

It ALSO turns out that Java has its own issues with >127 characters,
which I'm going to look into -- but it was nice to prove you right (and
solve half my problem!) by setting the web page encoding...

Post by Chris
Bye, Chris.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Thanks again,

Andrew