‘select distinct’ vs. ‘group by’ revisited

Seeing ‘select distinct’ in a query still makes me cringe.  I used to use ‘select distinct’ many years ago as a way to remove duplicate records in my result sets and time and experience taught me that this was not a good practice and what was needed was a better understanding of grouping and a more detailed knowledge of the underlying data itself.

I still see ‘select distinct’ in code these days and wanted go back to what made me think that its use was such a bad idea.  To that end, I put the question to my colleagues at the Accenture Enkitec Group who happen to be some of the most talented people in the world working with Oracle products.  I hope to try to relay some of this wisdom in the following points.

CJ Date, the famous relational database theorist, advocates that SQL’s default ‘select’ behavior should be ‘distinct’. A ‘select’ statement that can return more than one copy of the same tuple is, by definition, not relational. Such behavior turns SQL into a bag processing language instead of a set processing language, which has lots of implications on program complexity and correctness. His workaround advice for Oracle and other SQL languages that behave this way is to always use ‘distinct’.

Some ad-hoc SQL generators like IBM’s Cognos default to include DISTINCT on every SQL  generated, likely because it removes duplicates. Developers love this since they can get lazy with join predicates. If the SQL performs poorly, have the DBA to figure out why, or get more hardware.  Based in this particular reasoning, I am starting to remember why I got on the no ‘select distinct’ band wagon.

Using GROUP BY or DISTINCT as a shortcut to get a result set ordered was a widely used practice, even in products like Oracle EBS. That is why EBS requires the use of the hidden parameter _gby_hash_aggregation_enabled to disable the use of HASH for GROUP BY or DISTINCT. I guess that is cheaper than fixing all the code. People Soft also recommends disabling more efficient HASH algorithms for GROUP BY and DISTINCT since some of their code also uses DISTINCT and GROUP BY expecting a particular order and without using ORDER BY.

Oracle has a new function called APPROX_COUNT_DISTINCT introduced in Oracle version 12c (https://docs.oracle.com/database/121/SQLRF/functions013.htm#SQLRF56900).  This function quickly gathers the number of distinct values in a target column and should be used in place of a ‘select count(distinct col1)’ statement.  As always, test the results and performance before introducing this new function in place of every ‘distinct’ clause.

Still, seeing ‘select distinct’ should raise a red flag when reviewing code as it can indicate bad coding practice.  There are few situations where it is required in ALL select statements.  In these environments, you should be wary of the use of UNION and UNION ALL as well.

Based on these and many other opinions on both sides – pro and con, I will summarize by saying that I disagree with the universal use of ‘select distinct’ in all queries, but vow to not let seeing this clause cause me to bias my thinking that all the particular code is poorly written and poorly performing.  In my role, it is best to evaluate each statement on its own merits.  Don’t get me started about nested views, though.  That is a subject for another blog.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s