That demned elusive archive log!

Introduction

With apologies to Emma Orczy again for stealing a line from “The Scarlet Pimpernel” … 🙂

Over the past few days, I’ve been battling a number of Oracle products to resolve a situation for a customer. They have a database refresh process that refreshes a non-Production environment with a copy of Production (no comments here about security issues, please!) For some time, that refresh has been failing. This post explains how we worked it out, as well as what we did to ensure it doesn’t re-occur.

The Problem

For this particular customer, the refresh from the Production environment is scripted. The last step of that script emails us the log of the script running. It’s been working well for quite some time, but then a few days back it failed partway through, with this text in the log:

Anyone that’s been around Oracle databases for a while would recognize that we were missing an archived redo log, thread# 2 and sequence# 13857.

Resolving the Problem

When we checked on disk, the archive file that was missing was over 7 days old, and had been copied off to tape. Before we checked the tapes, we went to the standby site and looked there, but it had already been applied and removed from the standby as well. So the next step was get the archived log file back from tape.

The problem here was that for this particular customer, the backups were performed by Oracle Secure Backup (OSB). For those of you that aren’t familiar with OSB, let me just say its documentation ranks up there with the worst Oracle documentation created, and that’s saying something! All we could work out from the logs was the date on which the backup should’ve included that archive log, so my colleague restored that to the machine where the refresh was taking place, and I simply added the file to RMAN using the catalog start with '/filespec'; command. That should resolve it, I thought, and left it for its next refresh.

Well, of course, it didn’t. Checked the log for the next refresh, and of course found the same error:

So I got my colleague to restore the backups from a day on either side as we weren’t 100% sure which backup had the right file, went through the same process of cataloging the extra files and waited for the refresh to complete. And believe it or not, the damn thing still came back with the same failure! This time I mounted the database, and did a list backupset for each of the backups we’d restored. On the first one we’d restored (the middle of three days), I found log sequence numbers 13854, 13855 and 13856. On the second one I restored, I found it started with log sequence number 13858. In other words, the archive file had never made it to the backup (I did check the other backup just to be sure but that was from earlier so it didn’t include the file either).

So where had that “demned elusive archive log” gone? Well, checking into the alert log, another colleague found the answer. Oracle had automatically deleted the damn thing because the fast recovery area had reached 80% full. This is a known Oracle “feature” (though to me deleting archived logs before backup should be a known Oracle bug!) So the file wasn’t on disk, and it wasn’t on tape. Thankfully, we could work around it in this scenario by pointing the refresh at a more recent full backup.

Ensuring It Doesn’t Happen Again

So how do we go about ensuring that situation doesn’t arise again, particularly in the case where we don’t have a simple workaround like pointing the refresh at a more recent backup? Well, for any of you that know my background with Enterprise Manager, you’ll be totally unsurprised to hear my first thought was to set an alert in EM. The problem with that is if you are using the Fast Recovery Area (FRA) for the archive destination and the destination is set to USE_DB_RECOVERY_FILE_DEST, the Archive Area % Used metric does not trigger anymore. Instead you will see the Recovery Area % Used metric trigger when it hits a Warning threshold of 85% full, and Critical of 97% full. These are server side defaults and can’t be modified (see support note 428473.1 for details).

The way to get around it, of course, is to create a metric extension. One of my previous colleagues at Oracle, Courtney Llamas, posted a blog on how to do that a few years back. While it’s somewhat dated, it’s easy enough to update to 13.22 so I’ll leave that as an exercise for the reader. 🙂

So now hopefully we should be good, and shouldn’t see this problem arising again – unless we haven’t responded to the ME alerts of course! 😉

Pete

After 22 years of working at Oracle in just about every role except Marketing and Support, I am now working as a Senior Managed Services Consultant with Data Intensity, specializing in Oracle Database technology, High Availability and Disaster Recovery solutions. I am also a member of the OakTable Network, and have presented at RMOUG Training Days, Hotsos Symposia, Oracle OpenWorld conferences, and other user group events. I have co-authored the Expert Oracle Enterprise Manager 12c and Practical Oracle Database Appliance books published by Apress, and am one of the authors of the Building Database Clouds in Oracle Database 12c book published by Addison Wesley.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.