Discussion:
weired behavior... after pg_resetxlog-> dump->initdb-->reload.
(too old to reply)
Rajesh Kumar Mallah
2004-06-16 07:53:12 UTC
Permalink
I recovered my database by pg_resetxlog and then did a dump , initdb ,
reload.
one of the tables exhibited this phenomenon. I thought it was worth
mentioning
to the developers.

Regds
mallah.


rt2=# SELECT name from users group by name having count(*) > 1 limit 20;

+----------------------+
| name |
+----------------------+
| �˾ |
| ����163.com |
| �@hotmail.com |
| p� |
+----------------------+
(4 rows)

Time: 1961.199 ms
rt2=#
rt2=# SELECT name from users group by name having count(*) > 1 ;
+------+
| name |
+------+
+------+
(0 rows)

Time: 789.184 ms


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Tom Lane
2004-06-16 13:55:09 UTC
Permalink
Post by Rajesh Kumar Mallah
rt2=# SELECT name from users group by name having count(*) > 1 limit 20;
(4 rows)
rt2=# SELECT name from users group by name having count(*) > 1 ;
(0 rows)
Weird. What PG version is this? What does EXPLAIN show for each of
these queries (maybe EXPLAIN ANALYZE too)? Can you provide a
self-contained test case?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)
Rajesh Kumar Mallah
2004-06-16 15:00:38 UTC
Permalink
Dear Tom,

Sorry i could not replicate that issue . But
here is something which was the original problem
and is replicable.


Regds
Mallah.



rt2=# CREATE UNIQUE INDEX users_1 on users_sample(name);
ERROR: could not create unique index
DETAIL: Table contains duplicated values.
rt2=#
rt2=# SELECT name from users_sample group by name having count(*) > 1
limit 20;
+------+
| name |
+------+
+------+
(0 rows)

Time: 766.725 ms
rt2=# SELECT version();
rt2=# SELECT name from users_sample group by name having count(*) > 1 ;
+------+
| name |
+------+
+------+
(0 rows)

Time: 905.586 ms
rt2=# SELECT version();
+----------------------------------------------------------------------------------------------------------+
|
version |
+----------------------------------------------------------------------------------------------------------+
| PostgreSQL 7.4.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.2.3
20030502 (Red Hat Linux 3.2.3-34) |
+----------------------------------------------------------------------------------------------------------+
(1 row)

Time: 338.191 ms
Post by Tom Lane
Post by Rajesh Kumar Mallah
rt2=# SELECT name from users group by name having count(*) > 1 limit 20;
(4 rows)
rt2=# SELECT name from users group by name having count(*) > 1 ;
(0 rows)
Weird. What PG version is this? What does EXPLAIN show for each of
these queries (maybe EXPLAIN ANALYZE too)? Can you provide a
self-contained test case?
regards, tom lane
Tom Lane
2004-06-16 15:35:10 UTC
Permalink
Post by Rajesh Kumar Mallah
here is something which was the original problem
and is replicable.
Hm. Does EXPLAIN show that the GROUP BY query is using hash
aggregation? Does its behavior change if you turn off enable_hashagg?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Rajesh Kumar Mallah
2004-06-16 15:54:06 UTC
Permalink
Dear Tom,

It does.


rt2=# explain SELECT name from users_sample group by name having
count(*) > 1 ;
+-----------------------------------------------------------------------+
| QUERY PLAN |
+-----------------------------------------------------------------------+
| HashAggregate (cost=25.00..26.00 rows=200 width=78) |
| Filter: (count(*) > 1) |
| -> Seq Scan on users_sample (cost=0.00..20.00 rows=1000 width=78) |
+-----------------------------------------------------------------------+
(3 rows)

Time: 375.619 ms
rt2=# explain analyze SELECT name from users_sample group by name having
count(*) > 1 ;
+------------------------------------------------------------------------------------------------------------------------+
| QUERY
PLAN |
+------------------------------------------------------------------------------------------------------------------------+
| HashAggregate (cost=25.00..26.00 rows=200 width=78) (actual
time=567.981..567.981 rows=0 loops=1) |
| Filter: (count(*) >
1)
|
| -> Seq Scan on users_sample (cost=0.00..20.00 rows=1000 width=78)
(actual time=0.046..160.706 rows=100489 loops=1) |
| Total runtime: 577.212
ms
|
+------------------------------------------------------------------------------------------------------------------------+
(4 rows)
Post by Tom Lane
Post by Rajesh Kumar Mallah
here is something which was the original problem
and is replicable.
Hm. Does EXPLAIN show that the GROUP BY query is using hash
aggregation? Does its behavior change if you turn off enable_hashagg?
regards, tom lane
Tom Lane
2004-06-16 18:17:33 UTC
Permalink
Post by Rajesh Kumar Mallah
It does.
And?
Post by Rajesh Kumar Mallah
Post by Tom Lane
Hm. Does EXPLAIN show that the GROUP BY query is using hash
aggregation? Does its behavior change if you turn off enable_hashagg?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Rajesh Kumar Mallah
2004-06-17 03:43:43 UTC
Permalink
Post by Tom Lane
Post by Rajesh Kumar Mallah
It does.
And?
Yep the problem of original posting could be replicated on
disabling hash aggregates. On disabling i could get the repeated rows.


Regds
Mallah.


rt2=# SET enable_hashagg TO off;
SET
Time: 329.533 ms
rt2=# explain analyze SELECT name from users_sample group by name having
count(*) > 1 ;
+------------------------------------------------------------------------------------------------------------------------------+
| QUERY
PLAN |
+------------------------------------------------------------------------------------------------------------------------------+
| GroupAggregate (cost=69.83..78.33 rows=200 width=78) (actual
time=2411.411..2749.733 rows=4 loops=1) |
| Filter: (count(*) >
1)
|
| -> Sort (cost=69.83..72.33 rows=1000 width=78) (actual
time=2411.300..2532.821 rows=100489 loops=1) |
| Sort Key:
name
|
| -> Seq Scan on users_sample (cost=0.00..20.00 rows=1000
width=78) (actual time=0.024..170.258 rows=100489 loops=1) |
| Total runtime: 2915.439
ms
|
+------------------------------------------------------------------------------------------------------------------------------+
(6 rows)

Time: 3265.102 ms
rt2=# SELECT name from users_sample group by name having count(*) > 1 ;
+----------------------+
| name |
+----------------------+
| ??? |
| ????163.com |
| ?@hotmail.com |
| p? |
+----------------------+
(4 rows)

Time: 3358.030 ms
rt2=# SET enable_hashagg TO on;
SET
Time: 330.148 ms
rt2=# SELECT name from users_sample group by name having count(*) > 1 ;
+------+
| name |
+------+
+------+
(0 rows)

Time: 752.395 ms
rt2=#
Post by Tom Lane
Post by Rajesh Kumar Mallah
Post by Tom Lane
Hm. Does EXPLAIN show that the GROUP BY query is using hash
aggregation? Does its behavior change if you turn off enable_hashagg?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
regards, tom lane
Tom Lane
2004-06-17 13:36:01 UTC
Permalink
Post by Rajesh Kumar Mallah
Yep the problem of original posting could be replicated on
disabling hash aggregates. On disabling i could get the repeated rows.
Okay. What I suspect is happening is that there are entries in the
column that are equal according to the datatype's comparison function,
but are not bitwise equal and therefore yield different hash codes.
This makes it a crapshoot whether they are seen to be equal or not
when hash aggregation is used. We identified a similar bug in the
inet/cidr datatypes just a few weeks ago.

What exactly is the datatype of the "name" column? If it's a text
type, what database encoding and locale settings (LC_COLLATE/LC_CTYPE)
are you using? Can you investigate exactly what's stored within each
of these groups of matching names?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Rajesh Kumar Mallah
2004-06-18 05:16:32 UTC
Permalink
Post by Tom Lane
Post by Rajesh Kumar Mallah
Yep the problem of original posting could be replicated on
disabling hash aggregates. On disabling i could get the repeated rows.
Okay. What I suspect is happening is that there are entries in the
column that are equal according to the datatype's comparison function,
but are not bitwise equal and therefore yield different hash codes.
This makes it a crapshoot whether they are seen to be equal or not
when hash aggregation is used. We identified a similar bug in the
inet/cidr datatypes just a few weeks ago.
What exactly is the datatype of the "name" column?
name | character varying(120) | not null
Post by Tom Lane
If it's a text
type, what database encoding and locale settings (LC_COLLATE/LC_CTYPE)
are you using?
List of databases
+-----------------+----------+-----------+
| Name | Owner | Encoding |
+-----------------+----------+-----------+
| bric | postgres | UNICODE |

+--------------------------------+-----------------+
| name | setting |
+--------------------------------+-----------------+
| lc_collate | en_US.UTF-8 |
| lc_ctype | en_US.UTF-8 |
| lc_messages | en_US.iso885915 |
| lc_monetary | en_US.iso885915 |
| lc_numeric | en_US.iso885915 |
| lc_time | en_US.iso885915 |
Post by Tom Lane
Can you investigate exactly what's stored within each
of these groups of matching names?
Can you tell me how to do it please?
Post by Tom Lane
regards, tom lane
Regds
Mallah.

Loading...