Oracle 12c audit file problem

This problem illustrates a situation with Oracle 12c clusterware default settings that needs to be addressed for every new installation. Connecting to a 12.1.0.1.0 database on RHEL5 as sysdba yesterday resulted in an error.  I could not connect because an audit file could not be created due to no space left on device.  Connections as system or another non-sysdba users are successful. At first glance this would seem to be a pretty straightforward problem, except that there was plenty of space left in the filesystem in question:

oracle@server1 [/u01/app/oracle/admin/DB04/adump]
# df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vgora-u01 99G   60G   35G  64% /u01

Trying to create a file in this filesystem results in the non-surprising error, though.

oracle@server1 [/u01/app/oracle/admin/DB04/adump]
# touch test.txt
touch: cannot touch `test.txt': No space left on device

At this point my first thought is that there is something wrong with the filesystem and a full filesystem check (fsck) needs to be executed, but a system engineer looking into this issue with me pointed out that the ASM audit file location contained over 6.5 million audit (.aud) files.  This entire system had only been online for less than three months, so this many audit files seemed a bit excessive. I then tried to clear out any of them older than 30 days:

oracle@server1 [/u01/app/12.1.0/grid/rdbms/audit]
# find . -name *.aud -mtime +30 -exec rm -f {} \;
-bash: /bin/find: Argument list too long

This error is common in Linux environments when it is trying to perform file commands against a very large number of files.  Fortunately, I have an alternate routine to clear them out:

for i in $(ls -1 | grep .aud);do
rm $i
done

This process will remove ALL the audit files, and not just the ones older than 30 days.  After this process executed for several minutes, I could once again connect to the ASM and database as sysdba. Now we need to find out why so many audit files are being generated.  Let’s check the cluster system parameters for clues (only relevant output is shown):

[oracle@server1 ~]$ crsctl stat res -p

NAME=ora.asm
TYPE=ora.asm.type
ACL=owner:oracle:rwx,pgrp:dba:r-x,other::r--
...
CHECK_INTERVAL=1
CHECK_TIMEOUT=30
CLEAN_TIMEOUT=60

NAME=ora.db04.db
TYPE=ora.database.type
ACL=owner:oracle:rwx,pgrp:dba:r-x,other::r--,group:dba:r-x,user:oracle:rwx
...
CHECK_INTERVAL=1
CHECK_TIMEOUT=30
CLEAN_TIMEOUT=60

The CHECK_INTERVAL parameter for both the ASM and database instances is set to 1.  This means that the clusterware is checking these resources every 1 second.  This ‘check’ involves connecting to the instance as sysdba, then generating an audit file.  We need to change this interval to something more practical. In this example, I use a value of 60 (seconds), but 300 (5 minutes) would likely be even more practical.

[oracle@server1 ~]$ crsctl modify resource ora.asm -attr "CHECK_INTERVAL=60"
CRS-4995:  The command 'Modify  resource' is invalid in crsctl. Use srvctl for this command.

In the event that you get this error, use the ‘unsupported’ option as a work around:

[oracle@server1 ~]$ crsctl modify resource ora.asm -attr "CHECK_INTERVAL=60" -unsupported
[oracle@server1 ~]$ crsctl modify resource ora.db04.db -attr "CHECK_INTERVAL=60" -unsupported

Circling back around, the file deletion process was not making what I thought was an acceptable level of progress, so my ultimate file removal option was to rename the audit file directory, create a new audit file directory, and perform an ‘rm -rf’ on the old (renamed) directory.

Update

Wanting to give the newer versions of 12c a fair shake, I created a two-node 12.1.0.1.0 grid infrastructure and default database in a lab environment to re-verify my initial findings and was surprised to find the CHECK_INTERVAL for the ASM instance set to 60 (seconds), but was not surprised to find the CHECK_INTERVAL for the database instance set to 1 (second).  I then created a 12.1.0.2.0 grid infrastructure and default database on an identical two-node RAC test environment and was surprised to find the same values for CHECK_INTERVAL that I received for the 12.1.0.1.0 environment.  I may post another update once I can set up a 12c release two environment.

Update 10/23/17

I finally remembered to update this blog post with 12c release 2 (12.2.0.1) environment.  My two-node test environment now has the new Oracle Cluster Management Database (-MGMTDB) instance installed as well.  The CHECK_INTERVAL setting for this service is set to 1, but it does not appear to generate any audit files that I can see.  The CHECK_INTERVAL setting for the cluster database I created during installation is 900, which is very reasonable.  CHECK_INTERVAL for the asm service is set to 1, but audit file generation is about 1 per minute on an idle system.  It appears the ‘1’ value for this setting may now refer to minutes instead of seconds.

Advertisements