3

I am modelling an application where events for factories can be registered. There are usually a small amount of factories (around 500), but each factory can register a large set of events (millions). It's impossible that two events overlap for the same factory.

My idea was to introduce an exclusion constraint based on a GIST index:

CREATE TABLE event
(
   identifier SERIAL NOT NULL PRIMARY KEY
   factory_identifier INTEGER NOT NULL REFERENCES factory,
   period TSTZRANGE NOT NULL,

   EXCLUDE USING gist
   (
      factory_identifier WITH =,
      period             WITH &&
   )
);

In the documentation I read:

A GiST index will be relatively ineffective if its first column has only a few distinct values, even if there are many distinct values in additional columns.

I am not sure how to interpret this. Given that there are only a small amount of factories but a large number of events, should I define the exclusions constraint as following instead?

EXCLUDE USING gist
(
   period             WITH &&
   factory_identifier WITH =,
)

Now, the period is the first column, for which there are a lot of distinct values (almost all values are unique, actually), in contrast to factory_identifier, for which there only a few distinct values.

Is this better now? In short, I have difficulties to grasp the quoted documentation above.


additional

In the slides on a presentation of exclusion constraints on page 16 and a blog post, both coming from the original author of exclusion constraints, I found the following example:

EXCLUDE USING GIST(room WITH =, during WITH &&)

Given the (reasonable) assumption that there much more distinct reservation periods than distinct rooms, I wonder if this should have been (during WITH &&, room WITH =) instead, given the quote regarding GIST column ordering above.

This makes me believe that I am not really understanding the quoted documented above.

ItIsJustMe
  • 33
  • 3
  • I didn't remember this feature. Thanks for reminding me about it. Your question boils down to a performance dilemma. I think that the only way to decide which column ordering is better is to run pgbench on both scenarios. Could well be that the author of the slides chose a non optimal column ordering, because it's somehow more easy to comprehend. If you're going to run a benchmark, I'd love to learn what you found out. I'd also test the effect on unique validation when having an additional index, only on the `period` column. – Jonathan Jacobson Aug 22 '20 at 10:09
  • 1
    Ideally, you should try it both ways and see. I too think that that sentence of the docs is not particularly helpful, although I don't immediately have an improvement to offer, other than perhaps just deleting it. – jjanes Aug 22 '20 at 19:33
  • I plan to do some benchmarking with real data, for sure. My main purpose of this question was understanding the remark in the docs and getting a better grasp on how to reason about column ordering in a GIST index. – ItIsJustMe Aug 22 '20 at 20:50

0 Answers0