3

I have some tables which benefit greatly from CLUSTER ON/CLUSTER USING in Postgres SQL:

# CLUSTER table USING index_name;
# ANALYZE VERBOSE table;
# CLUSTER VERBOSE;

A maintenance task periodically runs CLUSTER VERBOSE to keep things fresh. But is there a test I can run to see how fragmented the table is, prior to running CLUSTER VERBOSE? Maybe something like:

# CLUSTER ANALYZE
table 40000 records. 4000 observed clusters, 5000 potential clusters (20% fragmentation)

Note that I use CLUSTER so data accessed at the same time is "defragmented" into a small number of disk blocks. For example I have thousands of attributes that go with each page. a CLUSTER page_attribute USING page_id; puts all the attributes next to each other, greatly reducing disk load.

dezso
  • 28,436
  • 12
  • 89
  • 132
Bryce
  • 135
  • 6
  • I edited back your title as `psql` is a client for PostgreSQL, it has no tables at all. – dezso Jun 01 '14 at 07:29

1 Answers1

6

I am not entirely sure how much this information helps, but the system table pg_stats contains a correlation column:

select schemaname,tablename,attname,correlation 
from pg_stats 
where schemaname='public' 
order by correlation;

From the manual

Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a < operator.)

This value is on a per-column basis, so my guess is, that if the correlation for the first index column is near zero, the table is "un-clustered", if near -1 or 1 it is closer to the order of the index and thus "more clustered". This if course can only work for B-Tree indexes that are defined as ASC (I guess).

a_horse_with_no_name
  • 69,148
  • 13
  • 135
  • 173
  • It's an excellent thought, however... say I have two indexes: one by page the other by author. I may cluster on either one depending on my load. There are different answers for how clustered my table is by author compared to how clustered it is by page. I experimented with CLUSTER and found no difference in the stats for correlation. – Bryce Jul 30 '13 at 20:01
  • I used ``select schemaname,tablename,attname,correlation from pg_stats where schemaname='public' order by correlation;`` on ``PostgreSQL 8.4.14`` – Bryce Jul 30 '13 at 20:06
  • Turns out one must ANALYZE even after the CLUSTER. I have added an example to the @a_horse_with_no_name answer. DESC columns have the opposite correlation. – Bryce Jul 30 '13 at 20:25