Evaluating a SQL Server environment

This is a topic with more than one meaning, so I will narrow the focus to the aspect of just looking at an existing SQL Server environment for the first time. If you are experiencing performance issues with an existing environment, I recommend you start with the Microsoft (MS) SQL Server performance monitor to concentrate your effort instead of looking at the overall SQL Server. This is especially true if there are only a few problematic processes.

It is good to start asking basic questions for better understanding of the environment to evaluate. A short checklist of the items to inquire about before proceeding includes:

  • Is this a 24×7 environment?
  • What are the peak times of usage and critical availability?
  • Are there any current performance concerns?
  • What is the current backup configuration?
  • When was the last time a recovery was performed? Practice recovery? Cloning using backups?

This information can help you determine the best approach for your evaluation like the best times to execute any of the more resource-intensive examinations. There are many tools to choose from when evaluating a SQL Server environment. Some of these are MS native tools and some are third-party. As a precaution, verify the presence of a recent backup or even execute a one-off full backup before evaluation.

The most recognized MS built-in tool for evaluation is DBCC CHECKDB which is included with every SQL Server version. DBCC CHECKDB is a Transact SQL based utility that checks the logical and physical integrity of all the objects in the specified database. This process also calls other DBCC programs and performs these steps while executing:

  • Runs DBCC CHECKALLOC on the database to check the consistency of disk space allocation structures.
  • Runs DBCC CHECKTABLE on every table and view in the database to check the integrity of all the pages and structures that make table or indexed views.
  • Runs DBCC CHECKCATALOG on the database to check for catalog consistency within the database.
  • Validates the contents of every indexed view in the database.
  • Validates link-level consistency between table metadata and file system directories and files when storing varbinary(max) data in the file system using FILESTREAM.
  • Validates the Service Broker data in the database.

The DBCC CHECKDB utility includes options to repair issues it finds, but I do not recommend the use of these options until any existing issues have been identified and understood. All of these checks and validations can take a long time to execute against a sizeable database, so there are options to exclude one or more of the above steps for time and resource considerations. MS SQL Server versions 2014 through 2017 even offer the ability to execute this utility with a parallel degree. Consult the MS DBCC CHECKDB documentation for the version you are evaluating for more specific information.

Basic execution (from a SQL Server Management Studio prompt):

use <db_name>

go

dbcc checkdb

go

 

Note – For larger database executions, use the “Results To” option under the Query menu item in SQL Server Management Studio to spool the results for later review.

Brent Ozar, a well-regarded SQL Server expert, created a tool called the First Responder Kit, or sp_Blitz for short.   It is available for free from https://www.brentozar.com/first-aid/ and provides a list of its findings prioritized for immediate action or informational purposes. The lower the number a finding is given, the higher the priority. This tool installs required stored procedures and an output table in the database. However, it is free and you can use parameters at execution to avoid checking certain objects or ignore findings over a certain priority level.

Basic execution (from a SQL Server Management Studio prompt):

After downloading the First Responder’s Kit compressed archive and placing it in an accessible location, execute the sp_blitz.sql file to install the database objects. Then, from a SQL Server command prompt, execute:

sp_blitz

go

Note – By default, this script examines ALL databases present on the current SQL Server and groups the findings by order of priority, sorting by database name.

My own SQL Server Health Check script (https://wordpress.com/post/dbajonblog.wordpress.com/178), which is also free, may not have the maturity of some other tools, but it does not require any objects to be installed and provides a good overall picture of a database environment. It is a single script that can be executed the same as a standard SQL script.

 

Advertisements

Backups triggered automatically

Database availability is the number one goal of the database administrator.  Backup and recovery should be the second goal.  An internet search on establishing database backup and recovery processes will produce many examples.  I won’t go into my personal preferences, just emphasize the points I think are important:

  • Backup and recovery processes MUST be documented for reference by personnel responsible for maintenance and execution. I find that the lack of backup and recovery documentation is fairly common out in the world. No DBA likes to try to do an undocumented recovery at 3AM.
  • Backups processes that do not include a check for successful completion are INVALID. If the backup has been completing with an error for a month then guess what, you last valid backup was a month ago and if you had to do a recovery today, you just lost a month’s worth of data.
  • Recovery processes that have not been tested by performing a practice recovery or a clone are INVALID. You may have a proven backup method, but it has not been proven until you do a recovery in your own environment.
  • If there is available time and resources, add an export to your backup process. Exports are good for granular recovery of metadata and objects such as indexes and PL/SQL code.

That said, even solid backup processes can get blind-sided by unexpected activity in the database that produces vastly more redo activity then is normally seen.  It only takes one instance of the dreaded “ORA-00257: archiver error” to send personnel from support through management looking for or demanding answers and remediation steps.

The work around for these situations is usually increasing either the amount of space in the archived log destination or increasing the frequency of backups.  While either of these are fairly easy to implement, they each have areas of concern that should be considered prior to implementation.

Increasing the amount of space in the archived log location could be costly based on amount and type of storage used.  Also, if the surge of redo is not a common occurrence, all the additional space will be unused, essentially wasted.  Increasing the backup frequency may not cost in terms of additional storage, but in database resources needed to perform the additional backups, especially when they occur during normal business hours.

When a database is configured to use the fast recovery area (FRA) for backups and flashback, Oracle automatically monitors and alerts on its space usage.  Oracle 11.2 and 12.2 documentation both have the same information on these alerts:

The database issues a warning alert when reclaimable space is 
less than 15% and a critical alert when reclaimable space is 
less than 3%. To warn the DBA of this condition, an entry is 
added to the alert log and to the DBA_OUTSTANDING_ALERTS table 
(used by Enterprise Manager). Nevertheless, the database 
continues to consume space in the fast recovery area until 
there is no reclaimable space left.

When the recovery area is completely full, the error displayed 
is as follows, where nnnnn is the number of bytes required and 
mmmmm is the disk quota:

ORA-19809: limit exceeded for recovery files ORA-19804: cannot 
reclaim nnnnn bytes disk space from mmmmm limit

From <https://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmmaint.htm>
The database issues a warning alert when reclaimable space is 
less than 15% and a critical alert when reclaimable space is 
less than 3%. To warn the DBA of this condition, an entry is 
added to the alert log and to the DBA_OUTSTANDING_ALERTS table 
(used by Enterprise Manager). Nevertheless, the database 
continues to consume space in the fast recovery area until 
there is no reclaimable space left.

When the recovery area is completely full, the error displayed 
is as follows, where nnnnn is the number of bytes required and 
mmmmm is the disk quota:

ORA-19809: limit exceeded for recovery files ORA-19804: cannot 
reclaim nnnnn bytes disk space from mmmmm limit

From <https://docs.oracle.com/en/database/oracle/oracle-database/12.2/bradv/maintaining-rman-backups.html>

A practical database administrator should already be monitoring the alert log for errors and warnings.  Why not configure these monitoring processes to automatically trigger backups if this error is encountered?  One of my favorite alert log monitor scripts sends an email when an ORA- error is found in the alert log and sends a page or text message to designated parties if the error involves some type of corruption:

#

ERRMSG="Missing command line parameters"

if [ "$1" ]; then
        export ORACLE_SID=$1
else
        echo $ERRMSG
        exit 1
fi

if [ "$2" ]; then
        export ALERT_LOG_DIR=$2
else
        echo $ERRMSG
        exit 1
fi

            LOGFILE="${ALERT_LOG_DIR}/alert_${ORACLE_SID}.log"
            ERRFILE="${ALERT_LOG_DIR}/alert_error.log"
            CHECK_TIMEB="`date +%m/%d/%y-%H:%M`"
            SAVFILE="${ALERT_LOG_DIR}/alert_${ORACLE_SID}_`date '+%y%m%d-%H%M'`.log"
            GROUPID="SAPADMIN1.BASIS_ON_CALL"
            RECIP="youremail@sendwordnow.com"
            echo "  "
            echo "**"
            echo "**** $ORACLE_SID"
            echo "**"
            echo " "
            if [ -f $LOGFILE ]; then
               if [ `egrep -i "ORA-|corrupt" "${LOGFILE}" | wc -l` -eq 0 ]; then
                  echo "No errors found in alert log "
               else
                  egrep -i "ORA-|corrupt" $LOGFILE > $ERRFILE
                  echo " " >> $ERRFILE
                  echo "  Found the above error(s) in the alert log for ${ORACLE_SID}" >> $ERRFILE
                  echo "  Alert Log with error(s) saved as ${SAVFILE}" >> $ERRFILE
                  echo " " >> $ERRFILE
                  echo " " >> $ERRFILE
                  mailx -s "ALERT-Errors found in ${ORACLE_SID} Alert Log" youremail@yourcompany.com < $ERRFILE
                  if [ `egrep -i "corrupt" "${ERRFILE}" | wc -l` -gt 0 ]; then
                    mailx -s "ALERT - CORRUPTION DETECTED IN ${ORACLE_SID} ALERT LOG" YourPager@yourcompany.com < $ERRFILE
                  fi
                  mv $LOGFILE $SAVFILE
               fi
            else
               echo "No alert log found for ${ORACLE_SID} "
            fi
exit

You could easily add a line to call a separate backup script from this script.  I usually set my alert log monitor to execute once an hour, but increasing the frequency of execution will help catch redo space problems and not cost in terms of resource or space usage.  Also,  having an automatically-triggered backup will save time diagnosing the issue when an archiver ‘out of space’ problem occurs.

The drawback here is that you must have FRA enabled to utilize the alert functionality.  What if usage of the FRA is not preferred?  You could use the alert log monitor script above to trigger a backup when an archiver error is seen, but that would mean the database is already in the process of halting activity when the backup is triggered.  It is better to be pre-emptive in these types of alerts by setting up monitoring to alert of a problem before an alert log message is generated.

For example, here is a snippet of code for a Linux/Unix shell script that determines how much space is available in GB for a certain filesystem.  If that amount of space falls below a certain value, a message is generated:

#!/bin/sh

space_avail="`df -k | grep /archive | awk '{print $4}'`"

if [ `expr $space_avail / 1048576` -lt 40  ]; then
echo Backup needed
fi

You could easily replace the echo command with a call to a backup script.  After testing, this script could be called at an interval of your choosing from the crontab.

If you have your backup location in ASM, you can use this next piece of code to accomplish the same thing by querying the desired diskgroup:

#!/bin/sh

free_mb=`$CRS_HOME/bin/sqlplus -s "/ as sysasm"<<EOF
set heading off
select free_mb
from v\\$asm_diskgroup
where name = 'DATA'
order by name;
exit
EOF`

if [ `expr $free_mb / 1024` -lt 40  ]; then
echo Backup needed
fi

With a few simple steps, you can be proactive about excessive redo activity in your database.

To output in real time or not?

Just yesterday I encountered a wrong conclusion based on a misunderstanding about the use of the dbms_output.put_line function.  The writer of a SQL script that spooled its output to a log file used the function to generate ‘start’ and ‘end’ messages within a PL/SQL block and was confused when the ‘start’ message was not seen in the log file while the main section of the PL/SQL block was executing.  He was under the assumption that calls to this function were carried out immediately.

This is a common misconception regarding dbms_output.put_line.  This example of the same functionality they were attempting reinforces this conclusion.

set serveroutput on

spool ./output.log

begin

  dbms_output.put_line('Start - '||to_char(sysdate,'MM/DD/YYYY HH24:MI:SS'));

  execute immediate 'select * from tab';

  dbms_lock.sleep(30);

  dbms_output.put_line('End - '||to_char(sysdate,'MM/DD/YYYY HH24:MI:SS'));
  
end;
/

spool off

In this example, the spooling of the output is started inside the PL/SQL block.  This code will execute for about 30 seconds because of the dbms_lock.sleep function call.  During execution, the output.log file will be empty until the code finishes, either successfully or unsuccessfully, and ‘spool off’ is executed.  You can verify this with a ‘tail -f’ command in Linux/Unix or the ‘type’ command in Windows.

What happens if you place the ‘start’ and ‘end’ messages outside the PL/SQL block?

set serveroutput on

spool ./output.log

select 'Start - '||to_char(sysdate,'MM/DD/YYYY HH24:MI:SS')
from dual;

begin

  execute immediate 'select * from tab';

  dbms_lock.sleep(30);
  
end;
/

select 'End - '||to_char(sysdate,'MM/DD/YYYY HH24:MI:SS')
from dual;

spool off

Even though you see the ‘start’ message at the onset in your SQL session, the log file will still not be populated until the PL/SQL block execution is finished and the ‘spool off’ command runs.

If you truly want to see the ‘start’ message before ANY code executes, then using OS commands to start spooling a log file and generate the ‘start’ message is probably your only alternative.  Here is an example for a Linux-based system:

echo "Start - $(date)" > ./output.log

This command, executed before the script is called, will create the log file and write a ‘start’ message to the file with the current date and time in the current directory.  This information will show up in the log file immediately.   Next, call the script to be executed:

$ORACLE_HOME/bin/sqlplus -S username/password @script.sql >> ./output.log

This command will start a SQL*Plus session in silent mode (-S), suppressing header and prompt output, call the script.sql script, and send the output to the log file opened earlier.  Using this method, you don’t need to use any ‘spool’ commands in the script.  Notice I used “>>” to redirect the output in this command and not “>” for the command when I opened the log file.  The “>>” redirect appends to an existing file, opening it if it does not exist.  If you use “>” for the redirect, an existing file will be overwritten and you will lose any information previously written to the file.

For a Windows-based system, the command is almost the same:

echo Start - %date% %time% > ./output.log

This command, executed before the script is called, will create the log file and write a ‘start’ message to it with the date and time in the current directory.  This information will show up in the log file immediately.  The “>>” and “>” redirect options even work the same as a Linux/Unix environment.

From here there a few different ways to execute a SQL script from a Windows command prompt.  If this is going to be an ongoing process in Windows, then I recommend writing the scripts in Perl.  It will save time, especially if you have to port the code between Windows and Linux/Unix environments.

One last point.  You should not use this ‘start’ and ‘end’ message method in a production system.  While it is fine for time tracking while testing the script, it should not be relied on outside of a test environment.  Even when these messages are sent to a log file, that file needs to be parsed for error messages at the conclusion of the process.  A better solution is to send the messages to a table used for auditing and monitoring.  Just don’t forget to commit each time you send a message to the table and remember to purge the information from the table based on audit requirements.