Nested views

I recently blogged on the use of ‘select distinct’ as a lazy coding practice to remove duplicate rows from a result set.  I solicited opinions from my knowledgeable coworkers and was able to modify my thinking a bit on the subject.  However, nested views are a problem that I think we can all agree on.

Just last week I received a report concerning a query that had been running at an acceptable 39 seconds the day before, but now was taking over two minutes.  A cursory examination of the explain plan displayed multiple nested loops on the same handful of tables.  I relayed this information to the developer reporting the issue, but was told over and over that it ran fine yesterday.  However, I did not think that 39 seconds was all that great.  I was not able to focus on it in greater depth at the time due to other performance issues I was working, all of which were caused by an underlying storage tier issue.  Once that was resolved, all involved systems returned to normal.

The next day, I circled back around on the report query and discovered the reason for the multiple nested loops.  This is the response I sent to the developer with redacted table names for security reasons:

“After yesterday’s crisis had passed, I was able to further review the report query and have enclosed the object mapping notes with the following observations.  Nested views are used in this query.  Nested views refer to views that include other views in their makeup.  The problem with these types of views is that tables in one or more of the child views can sometimes be found in the parent views.  This causes the same tables to be revisited over and over for similar data.  Even if the tables where accessed for different data on subsequent revisits, the same data could have been retrieved during previous visits.  Referring to the enclosed notes, some of the tables being visited numerous times include TAB1, TAB2, TAB3, and TAB4.  Care should be taken when using views that contain other views so that table revisits or nested loops can be avoided.  Even though the query’s performance is acceptable at this time, increased volume of information may not allow this query to continue to perform at acceptable levels.  Also,  two of the subsequent views use one or more UNION statements on the same objects.  The use of UNION more than once may indicate another condition of unnecessary table revisits which could also negatively impact query performance.  Thanks.”

The problem with nested views lies not only with the use of nested views, but with their creation as well.  In this regard, the responsible DBA should resist the creation of views composed of other views unless determined that the same tables are not in the makeup of more than one view.

The way to avoid nested views in my opinion is to select from tables first, then from views and to avoid the joining of views unless it has been determined that they do not contain the same tables.

‘select distinct’ vs. ‘group by’ revisited

Seeing ‘select distinct’ in a query still makes me cringe.  I used to use ‘select distinct’ many years ago as a way to remove duplicate records in my result sets and time and experience taught me that this was not a good practice and what was needed was a better understanding of grouping and a more detailed knowledge of the underlying data itself.

I still see ‘select distinct’ in code these days and wanted go back to what made me think that its use was such a bad idea.  To that end, I put the question to my colleagues at the Accenture Enkitec Group who happen to be some of the most talented people in the world working with Oracle products.  I hope to try to relay some of this wisdom in the following points.

CJ Date, the famous relational database theorist, advocates that SQL’s default ‘select’ behavior should be ‘distinct’. A ‘select’ statement that can return more than one copy of the same tuple is, by definition, not relational. Such behavior turns SQL into a bag processing language instead of a set processing language, which has lots of implications on program complexity and correctness. His workaround advice for Oracle and other SQL languages that behave this way is to always use ‘distinct’.

Some ad-hoc SQL generators like IBM’s Cognos default to include DISTINCT on every SQL  generated, likely because it removes duplicates. Developers love this since they can get lazy with join predicates. If the SQL performs poorly, have the DBA to figure out why, or get more hardware.  Based in this particular reasoning, I am starting to remember why I got on the no ‘select distinct’ band wagon.

Using GROUP BY or DISTINCT as a shortcut to get a result set ordered was a widely used practice, even in products like Oracle EBS. That is why EBS requires the use of the hidden parameter _gby_hash_aggregation_enabled to disable the use of HASH for GROUP BY or DISTINCT. I guess that is cheaper than fixing all the code. People Soft also recommends disabling more efficient HASH algorithms for GROUP BY and DISTINCT since some of their code also uses DISTINCT and GROUP BY expecting a particular order and without using ORDER BY.

Oracle has a new function called APPROX_COUNT_DISTINCT introduced in Oracle version 12c (https://docs.oracle.com/database/121/SQLRF/functions013.htm#SQLRF56900).  This function quickly gathers the number of distinct values in a target column and should be used in place of a ‘select count(distinct col1)’ statement.  As always, test the results and performance before introducing this new function in place of every ‘distinct’ clause.

Still, seeing ‘select distinct’ should raise a red flag when reviewing code as it can indicate bad coding practice.  There are few situations where it is required in ALL select statements.  In these environments, you should be wary of the use of UNION and UNION ALL as well.

Based on these and many other opinions on both sides – pro and con, I will summarize by saying that I disagree with the universal use of ‘select distinct’ in all queries, but vow to not let seeing this clause cause me to bias my thinking that all the particular code is poorly written and poorly performing.  In my role, it is best to evaluate each statement on its own merits.  Don’t get me started about nested views, though.  That is a subject for another blog.