8

I know I have to write SUM twice, if I wish to use it in a HAVING clause (or use a derived table otherwise):

SELECT  id,
  sum(hours) AS totalhours
  FROM mytable
  GROUP BY id
  HAVING sum(hours) > 50;

My question now is, whether or not this is suboptimal. As a programmer, this query looks like the DB will calculate the sum twice. Is that so, or should I rely on optimizations the DB engine will do for me?

Update: an explain of a comparable query:

postgres=> explain select sum(counttodo) from orderline group by orderlineid having sum(counttodo) > 100;
                             QUERY PLAN                             
--------------------------------------------------------------------
 HashAggregate  (cost=1.31..1.54 rows=18 width=8)
   Filter: (sum(counttodo) > 100)
   ->  Seq Scan on orderline  (cost=0.00..1.18 rows=18 width=8)
(3 rows)
David Aldridge
  • 2,074
  • 9
  • 13
  • can you post the explain? – Ste May 28 '13 at 14:11
  • I won't answer this as there is a nice, concise explanation [here](http://stackoverflow.com/questions/12034849/accessing-column-alias-in-postgresql#comment16067732_12034878): " In case you're wondering why you can't refer to the SELECT aliases later in the query, like in the WHERE clause, it's all to do with order of evaluation. SQL isn't evaluated in the order it's written. The SELECT-list is actually evaluated almost last, and its contents effectively don't exist yet when HAVING etc are processed. This becomes important when the select-list includes functions with side-effects [...]" – dezso May 28 '13 at 15:02
  • ... this is why you can't just reference the aggregated column in the `HAVING` clause - but, to my understanding, internally this is rather done the other way around. – dezso May 28 '13 at 15:05
  • @dezso: I understand why it is, I was just wondering how optimal it is. It just "feels" like the DB is calculating it twice (apparently it doesn't and I should let the DB engine worry about optimizations here). Also, it creates redundancy in code, which doesn't feel right. – Bart Friederichs May 28 '13 at 15:14
  • 2
    @BartFriederichs well, many people complain alongside these lines (I had complained as well until I did not get used to it...) It is not executed twice and probably could be done by using an alias in `HAVING` (and then pulling the column definition from the `SELECT` clause) - for some reason they just don't do that. – dezso May 28 '13 at 15:24
  • @dezso: You complaints should go to the **[ISO/IEC 9075 committee](http://www.iso.org/iso/iso_technical_committee.html?commid=45342)** – ypercubeᵀᴹ May 28 '13 at 15:46
  • @ypercube well, we have some differences from the standard already, don't we? :) – dezso May 28 '13 at 16:08
  • 3
    I think that letting the DB engine worry about optimisations ought to be second nature to an RDBMS practitioner. SQL is a 4GL, so we're defining the result set we want, not the means by which it is achieved. There are many other issues out there that we do not worry over for the most part -- join order, or transformation of EXISTS into a join for example. This particular issue is more a problem from the "Don't Repeat Yourself" perspective for complex expression, but sensible workarounds (in-line views, CTEs) can help with those. – David Aldridge May 28 '13 at 16:54

3 Answers3

3

The sum is only computed once.

I verified this using

create table mytable (id int, hours int);
insert into mytable values (1, 60);
select sum(hours) from mytable group by id having sum(hours) > 50;

and then used a debugger to check how many times int4_sum (the transition function behind the sum aggregate) was called: once.

Peter Eisentraut
  • 9,473
  • 1
  • 29
  • 33
0

Compare your query

explain
select sum(counttodo)
from orderline
group by orderlineid
having sum(counttodo) > 100

To this equivalent one an check in what they differ

explain
select *
from (
    select sum(counttodo) counttodo
    from orderline
    group by orderlineid
) s
where counttodo > 100
Clodoaldo
  • 1,035
  • 2
  • 7
  • 21
  • 1
    I can see what you're getting at but in its current form this doesn't make for a "good" answer. Post the explain for each with a little more commentary and you're good for some upvotes. – Mark Storey-Smith May 28 '13 at 21:32
0

You don't have to write SUM twice if you don't need to retrieve it; if you're only interested in the ids having a SUM(hours) > 50 then the following is perfectly valid:

SELECT id,
FROM mytable
GROUP BY id
HAVING sum(hours) > 50;
Colin 't Hart
  • 8,747
  • 15
  • 33
  • 40