Working with Oracle trace files

magnifying glassBeing a contractor sometimes means working with clients in situations where you can’t access their databases and servers directly. Fortunately Oracle has utilities such as AWR and scripts like SQLTXPLAIN to help diagnose the system in your absence. If you are very fortunate, you may even get them to execute eDB360 for you.

However, if you really want to know where the database is spending all of its time, nothing beats a good old-fashioned trace. Trace (.trc) files are the ever-present files in the Oracle diagnostic destination that can tell you what processes are doing and how long they are doing it for. Most ignore these files and even delete them when they build up on the database server, but you can’t beat them for their wealth of information.  If I can’t touch a client’s system, the first thing I ask for is a trace of the session executing the problem code.   If they are unfamiliar with the tracing process, I offer to run it for them.

Creating a trace file

For a proper trace complete with timing information, the statistics_level parameter needs to be set to ALL. Fortunately, if it is not set at the system level, it can be set at the session level.

alter session set statistics_level=ALL;

To make identifying the trace file easier, you can set a trace file identifier at the session level. This identifier will become part of the name(s) of the trace file(s) generated:

alter session set tracefile_identifier = ‘my_trace_id’;

Tracing a session is simple, even if that session is not yours. To trace your own session, you can use any of these commands:

alter session set sql_trace=TRUE;
exec dbms_monitor.session_trace_enable;
exec dbms_support.start_trace;
alter session set events = ‘10046 TRACE NAME CONTEXT FOREVER, LEVEL 12’;

The last command provides the most detailed tracing with a maximum level setting of 12. Lower settings are also available, but I would not recommend a lower setting than 8.

If you need to trace a session other than your own, you just need the sid and serial number of the session.  You start tracing on the current session with any of this commands:

execute dbms_monitor.session_trace_enable(sid,serial#,TRUE, TRUE); (The last two arguments are to enable reporting on WAITS and BIND variables)

If this is a client server environment and there are several sessions connected with the same database account and you are not sure which one will handle the code execution, you can generate dynamic statements to trace them all:

set lines 200 pages 999
select ‘execute dbms_monitor.session_trace_enable(‘||sid||’,’||serial#||’,TRUE,TRUE);’
from v$session
where username = ‘SCOTT’
order by sid;

When the process or code you are tracing has completed, stop the tracing in the current session with any of these commands:

alter session set sql_trace = false;
exec dbms_monitor.session_trace_disable;
exec dbms_support.stop_trace;
alter session set events = ‘10046 TRACE NAME CONTEXT OFF’;

Or you can use this command if the session is not the current one (based on the earlier sid and serial number values):

exec dbms_monitor.session_trace_disable(sid, serial#);

If you were tracing multiple sessions, use this code to generate the stop tracing commands:

set lines 200 pages 999
select ‘execute dbms_monitor.session_trace_disable(‘||sid||’,’||serial#||’);’
from v$session
where username = ‘SCOTT’
order by sid;

Locating and processing a trace file

The trace file location on the database server is designated by the diagnostic_dest parameter (11g and above). From there, go to the rdbms\<sid>\<instance>\trace directory to see the actual trace files. In a pre-11g system, just go to the location designated by the parameter user_dump_dest. Use a list command with a sort such as ‘ls -lart’ to see the latest trace files. The trace files end in the .trc extension. The one(s) you are interested in should be the latest and probably largest of the trace files and you can usually see them dynamically grow as the code executes. Of course, if you set a trace file identifier, the file will be even easier to find.

Processing a trace file

Looking at a raw trace is a bit confusing and takes some experience to decipher. You can use the TKPROF utility to process it into something a little more readable. I recommend using the EXEELA and FCHELA options during the process to sort the SQL code in order of elapsed time and fetches descending. That way, the SQL taking the longest time will be at the top for easy access. Here is an example of the command to create this output:

tkprof <raw_trace_file.trc> <processed_file.log> sort=EXEELA,FCHELA

You can use any extension for the output file, but it is best to use one that is not foreign to the editor you are going to use to open the resulting file.

Reviewing a trace file

Opening the file, go straight to the bottom section that starts with ‘OVERALL TOTALS’. If you are unfamiliar with looking at trace file output, spend a minute reviewing the legend at the top of the file:

Trace file: dummy_ora_1369.trc
Sort options: exeela fchela
********************************************************************************
count = number of times OCI procedure was executed
cpu = cpu time in seconds executing
elapsed = elapsed time in seconds executing
disk = number of physical reads of buffers from disk
query = number of buffers gotten for consistent read
current = number of buffers gotten in current mode (usually for update)
rows = number of rows processed by the fetch or execute call
********************************************************************************

Then proceed to the bottom:

********************************************************************************

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows

------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        0      0.00       0.00          0          0          0           0
Execute 209486    822.98     823.79       2526      33961    1607238      209486
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   209486    822.98     823.79       2526      33961    1607238      209486

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
log file sync                              154081        6.41      19315.04
SQL*Net message to client                  314407        0.00          1.51
SQL*Net message from client                314407        0.02         59.12
Disk file operations I/O                       22        0.00          0.00
db file sequential read                      2526        0.02          0.37
library cache: mutex X                        154        0.00          0.00
log buffer space                                8        0.17          0.96
buffer busy waits                               2        0.10          0.10
latch: In memory undo latch                     5        0.00          0.00
latch: shared pool                              3        0.00          0.00
undo segment extension                          1        0.00          0.00


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows

------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse   628459     13.06      13.03          0          0          0           0
Execute 1256917    598.32     611.29       1889     425754     653042      418972
Fetch   418973     10.35      10.45          1          4          0      418973
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   2304349    621.73     634.78       1890     425758     653042      837945

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
Disk file operations I/O                       23        0.00          0.00
db file sequential read                      1890        0.04          0.30
log buffer space                               22        0.63          3.65
log file switch (private strand flush incomplete)
12        1.56          7.50
cursor: pin S                                   1        0.00          0.00
buffer busy waits                               1        0.00          0.00
latch: In memory undo latch                     2        0.00          0.00
latch free                                      3        0.00          0.00
resmgr:cpu quantum                             36        0.06          0.29
log file switch completion                      1        1.57          1.57

418978  user  SQL statements in session.
2  internal SQL statements in session.
418980  SQL statements in session.
6  statements EXPLAINed in this session.
********************************************************************************
Trace file: dummy_ora_1369.trc
Trace file compatibility: 11.1.0.7
Sort options: exeela  fchela
1  session in tracefile.
418978  user  SQL statements in trace file.
2  internal SQL statements in trace file.
418980  SQL statements in trace file.
23  unique SQL statements in trace file.
6  SQL statements EXPLAINed using schema:
APPS.prof$plan_table
Default table was used.
Table was created.
Table was dropped.
36460282  lines in trace file.
20970  elapsed seconds in trace file.

There are two sections for overall totals – one for Oracle-generated non-recursive statements and one for user code known as recursive statements. At the very bottom is the number of elapsed seconds covered in the trace file. In this case, it is 20970 seconds or about 5.8 hours. Each of these sections has a total elapsed time and wait events information. The first thing to notice is that the total for the non-recursive statements is 823.79 seconds and the total for the recursive statements is 634.78. The sum of these two values is drastically short of the 20,970 grand total. Where did the other time go? Now start looking at the wait events under each section and stop when you get to the event that has a total time waited of 19315.04. To put this in perspective, a total of 1458.57 seconds (823.79 + 634.78) or 24 minutes was spent during the trace time on actual database work and 19315.04 seconds or 5.3 hours was spent doing log file syncs. Now we can start looking for the sources of all these log file sync waits. Since this output file is already sorted by elapsed time descending, we just start at the top:

SQL ID: 9dtjfluf6rcct Plan Hash: 2730624675

INSERT INTO "BAD"."SIMP_STG" (REC_ID,INSTANCE_ID,
ORIGINAL_ID,REPROCESS_ID,TRANSACTION_ID,REPROCESS_CNT)
VALUES
(SIMP_STG_S.nextval, :INSTANCE_ID, :ORIGINAL_BPEL_ID,
:REPROCESS_BPEL_ID, :TRANSACTION_ID, :REPROCESS_COUNTER)

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        0      0.00       0.00          0          0          0           0
Execute 104743    462.81     463.34       1414      18029     752169      104743
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   104743    462.81     463.34       1414      18029     752169      104743

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 44  (APPS)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
0          0          0  LOAD TABLE CONVENTIONAL  (cr=2 pr=0 pw=0 time=5657 us)
1          1          1   SEQUENCE  SIMP_STG_S (cr=2 pr=0 pw=0 time=2488 us)

Rows     Execution Plan
-------  ---------------------------------------------------
0  INSERT STATEMENT   MODE: ALL_ROWS
0   LOAD TABLE CONVENTIONAL OF 'SIMP_STG'
1    SEQUENCE OF 'SIMP_STG_S' (SEQUENCE)

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
log file sync                               49337        6.41       9105.43
SQL*Net message to client                  104743        0.00          0.51
SQL*Net message from client                104743        0.02         17.95
db file sequential read                      1414        0.01          0.17
library cache: mutex X                         61        0.00          0.00
Disk file operations I/O                       16        0.00          0.00
log buffer space                                1        0.17          0.17
buffer busy waits                               2        0.10          0.10
latch: In memory undo latch                     2        0.00          0.00
latch: shared pool                              1        0.00          0.00
undo segment extension                          1        0.00          0.00
********************************************************************************

This first statement looks like it accounts for about half of the log file sync waits. This insert statement uses a sequence to generate a unique ID for every one of the 104,743 rows inserted. A review of this sequence revealed that it had the default cache setting of 20. Any sequence that is called thousands of times during the course of one process needs a cache value that will drastically reduce the number of times it actually needs to generate a value. A cache value of 10000 or more would be needed here so that every time the sequence is called, the next 10000 numbers of the sequence are cached in memory, speeding up the obtaining of the sequence values. Once these cached values are exhausted, only then is the sequence physically called again.

Sequence caching removed about 9000 log file sync waits, but a review of the rest of the output revealed no more log file sync waits in any of the other statements. Where were the other 10000 waits? One of the problems with TKPROF is that it does not report on commits and rollbacks that are recorded in the trace file. It just so happens that one source of log file sync waits is excessive commits. The process generating these insert statements was a SQL Loader process that had options(rows=1) in the control file. It was performing a commit for every row inserted. Removing this clause and switching from conventional path to direct path loading removed the remaining log file sync waits.

Sometimes the cause of a problem is not always readily apparent. Take this statement for example:

SQL ID: 1b8xbxk8jc712 Plan Hash: 2504779452
SELECT L.JON
FROM
SIMP_HDR_STG H, SIMP_LINES_STG L WHERE
L.RECORD_HEADER_ID = H.RECORD_HEADER_ID AND H.INSTANCE_ID = :B1 AND ROWNUM = 1

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute     12      0.00       0.00          0          0          0           0
Fetch       12     81.56      81.56          0   26025612          0          12
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total       25     81.56      81.56          0   26025612          0          12

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 44     (recursive depth: 2)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
1          1          1  COUNT STOPKEY (cr=2168801 pr=0 pw=0 time=6878671 us)
1          1          1   NESTED LOOPS  (cr=2168801 pr=0 pw=0 time=6878653 us cost=9 size=126 card=7)
1072923    1072923    1072923    NESTED LOOPS  (cr=1095878 pr=0 pw=0 time=4041943 us cost=9 size=126 card=7)
1072923    1072923    1072923     TABLE ACCESS FULL SIMP_LINES_STG (cr=23852 pr=0 pw=0 time=535161 us cost=2 size=49 card=7)
1072923    1072923    1072923     INDEX UNIQUE SCAN SIMP_HDR_STG_PK (cr=1072026 pr=0 pw=0 time=2585580 us cost=0 size=0 card=1)(object id 3379354)
1          1          1    TABLE ACCESS BY INDEX ROWID SIMP_HDR_STG (cr=1072923 pr=0 pw=0 time=2309258 us cost=1 size=11 card=1)

You need to dig a little bit with this one. Execution time is 81.56 seconds for 12 executions of this statement, which is not too bad at 6.7 seconds per execution. However, 26 million consistent reads total or 2.1 million per execution for only one row per execution as a result seems excessive. Looking further into the details, there are more scans of the primary key index of the header table than are performed in a full table scan of the lines table. If you assume a one to many relationship between the header and lines tables, then you should be doing fewer scans on the headers table than the lines table. This may be a simple case of having the wrong table driving the query.

Finally, there are times when you see things in the TKPROF output that just make you shake your head:

********************************************************************************

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        0      0.00       0.00          0          0          0           0
Execute      0      0.00       0.00          0          0          0           0
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        0      0.00       0.00          0          0          0           0

Misses in library cache during parse: 0

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
db file sequential read                         1        0.00          0.00
utl_file I/O                                61671        0.00          1.12
PL/SQL lock timer                            2373       10.09      23731.10

OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse   314346      6.04       6.46          0          0         10           0
Execute 588073    667.85     678.04      35191    4498924     293728       50209
Fetch   496358    102.61     105.00       1768    5489750       6015      492324
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   1398777    776.51     789.51      36959    9988674     299753      542533

Misses in library cache during parse: 346
Misses in library cache during execute: 1331

Elapsed times include waiting on following events:
Event waited on                             Times   Max. Wait  Total Waited
----------------------------------------   Waited  ----------  ------------
db file sequential read                      7713        0.06         10.85
Disk file operations I/O                       93        0.00          0.00
db file scattered read                       2067        0.07          2.26
enq: TM - contention                            9        0.17          0.96
latch: cache buffers chains                     1        0.00          0.00
latch: In memory undo latch                     1        0.00          0.00

24039  user  SQL statements in session.
138  internal SQL statements in session.
24177  SQL statements in session.
********************************************************************************
Trace file: DUMMY_ora_22860.trc
Trace file compatibility: 11.1.0.7
Sort options: exeela  fchela
1  session in tracefile.
24039  user  SQL statements in trace file.
138  internal SQL statements in trace file.
24177  SQL statements in trace file.
357  unique SQL statements in trace file.
14449283  lines in trace file.
29448  elapsed seconds in trace file.
 

This is only the bottom portion of a TKPROF output file, but what is interesting here is the presence of the PL/SQL lock timer waits under the non-recursive section. The only cause of these waits are calls to the dbms_lock.sleep procedure. To summarize, there are 29448 total seconds in this TKPROF output and 23731 seconds or 80% of the total time is taken up by code forcing the database to do nothing or “sleep”.

I hope you find this as useful as I did when it was first revealed to me. There is a wealth of information that the Oracle database provides automatically. Sometimes we have to put on our Indiana Jones hats to dig it up.

Update 1/10/2017

I recently had a situation where I was only given a raw trace file with no means to process it through TKPROF.  I this situation, I only had a restricted access virtual machine (VM) with which to investigate its contents.

Knowing that I could install basic tools on this VM, I installed Cygwin64 Terminal (cygwin.com) which is a set of Unix-like tools for Windows environments.  After installation, I was able to use the following commands to help me process the wait events out of the trace file.

(Tested on Oracle version 11.2.0.4.0)

Use this command to group waits in a raw trace file:

cat PROD1_ora_24787.trc | grep WAIT | awk -F”ela=”‘{a[$3];}END{for (i in a)print i;}’

cat PROD1_ora_24787.trc | grep WAIT | awk ‘{a[$3″ “$4” “$5];}END{for (i in a)print i;}’

Note – Since the fields of a wait category are often separated by spaces, it may be
necessary to grab multiple fields ($3, $4, $5) to get the entire wait event name

To count the number of waits per category;
cat PROD1_ora_24787.trc | grep WAIT | awk ‘{a[$3]++;}END{for (i in a)print i, a[i];}’

To summarize elapsed microseconds for a particular wait:
cat PROD1_ora_24787.trc | grep WAIT | grep ‘utl_file I/O’ | awk ‘{sum += $6}END{print sum}’

The above command will summarize all the ‘utl_file I/O’ waits in the example trace file.
Note that the value will be give in microseconds and should be multiplied by .000001 to
produce the summarized value in seconds.