Incremental clustering?

Discussion:

(too old to reply)

John Siracusa

2004-01-04 22:57:07 UTC

The docs say:

"Clustering is a one-time operation: when the table is subsequently updated,
the changes are not clustered. That is, no attempt is made to store new or
updated rows according to their index order. If one wishes, one can
periodically recluster by issuing the command again."

and

"During the cluster operation, a temporary copy of the table is created that
contains the table data in the index order. Temporary copies of each index
on the table are created as well. Therefore, you need free space on disk at
least equal to the sum of the table size and the index sizes."

I don't relish totally locking and making a temporary copy of my biggest
table, but that's exactly the table that seems to need clustering the most.
Will subsequent cluster command also make a complete copy? Some form of
"incremental clustering" would be nice...

-John

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ***@postgresql.org)

Christopher Browne

2004-01-04 23:24:53 UTC

Permalink

Post by John Siracusa
"Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered. That is, no attempt is made
to store new or updated rows according to their index order. If one
wishes, one can periodically recluster by issuing the command
again."
and
"During the cluster operation, a temporary copy of the table is
created that contains the table data in the index order. Temporary
copies of each index on the table are created as well. Therefore,
you need free space on disk at least equal to the sum of the table
size and the index sizes."
I don't relish totally locking and making a temporary copy of my
biggest table, but that's exactly the table that seems to need
clustering the most. Will subsequent cluster command also make a
complete copy? Some form of "incremental clustering" would be
nice...

Unfortunately, making this work in a manner that allows concurrency is
likely to be really troublesome.

The cluster operation potentially has to reorder all the tuples, and
the fact that the table is already _partially_ organized only
diminishes the potential. If the new data, generally added "at the
end," has values that are fairly uniformly distributed across the
index, then the operation really will have to reorder all of the
tuples...

It would be pretty sweet to have a process analagous to 'non-blocking
VACUUM' as opposed to 'VACUUM FULL.' But there's no particularly easy
way, and, to do so, you'd essentially have to throw away a fair chunk
of the benefits of the "clustered" properties.

--
"cbbrowne","@","acm.org"
http://www.ntlug.org/~cbbrowne/rdbms.html
"When you have eliminated the impossible, whatever remains, however
improbable, must be the truth." -- Sir Arthur Conan Doyle (1859-1930),
English author. Sherlock Holmes, in The Sign of Four, ch. 6 (1889).
[...but see the Holmesian Fallacy, due to Bob Frankston...
<http://www.frankston.com/public/Essays/Holmesian%20Fallacy.asp>]

John Siracusa

2004-01-05 18:54:16 UTC

Permalink

Post by Christopher Browne
The cluster operation potentially has to reorder all the tuples, and
the fact that the table is already _partially_ organized only
diminishes the potential. If the new data, generally added "at the
end," has values that are fairly uniformly distributed across the
index, then the operation really will have to reorder all of the
tuples...

What about the special case of a table that is clustered on a column and all
subsequent inserts will add rows with ever-increasing values of that column?
This would be the case for creation dates or even a column created from a
sequence. Basically, after clustering, it would be nice if you could tell
the system to "only add to the end" and to "add in clustered order."

Programming for special cases is annoying, but sometimes it really helps.

-John

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Ray Ontko

2004-01-05 19:18:30 UTC

Permalink

John, et al,

We too have an interest in reclustering large tables, but in our
case most of the transactions are spread throughout the table
(though in some cases not uniformly).

I have been pondering a program that selects all the rows in
the table in cluster order and then, as a single transaction,
deletes a block-full of rows and then re-inserts them. The program
would then move on to the next block-full and repeat the operation.
Note that if there are triggers on the table, this may have unintended
side-effects.

This would require having a pretty clear idea of the space required
for each row, and would probably require frequent vacuums during
the process. If there were a way to tell the block address of each
row, I suppose you could leave some rows where they are. In the
end, you might end up with the same space requirements (a full
copy as workspace), but I'm not sure.

Depending on the data, it may be possible to add new rows to the
table while this process is going on. Any new rows added by other
processes will certainly not be in order, and may interfere with
new rows being added in a contiguous fashion (I'm not sure of the
allocation algorithm used by PG).

Thoughts? Comments?

Ray

Post by John Siracusa

What about the special case of a table that is clustered on a column and all
subsequent inserts will add rows with ever-increasing values of that column?
This would be the case for creation dates or even a column created from a
sequence. Basically, after clustering, it would be nice if you could tell
the system to "only add to the end" and to "add in clustered order."
Programming for special cases is annoying, but sometimes it really helps.
-John
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
http://archives.postgresql.org

----------------------------------------------------------------------
Ray Ontko ***@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788
Ray Ontko & Co. Software Consulting Services http://www.ontko.com/

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match