With apologies to Emma Orczy for stealing a line from “The Scarlet Pimpernel” … 🙂
We ran into a strange situation at a customer site this morning. A customer asked us to kill two stuck sessions on one of their 126.96.36.199 RAC Exadata environments, normally something they have sufficient knowledge and access to be able to handle themselves. Well this time round, every time they issued an “ALTER SYSTEM KILL SESSION ‘xxx, yyyyy'” command they were getting an error “User session ID does not exist”. Normally, you would expect this for one of a couple of reasons:
- They were connected to the wrong instance (remember this is Exadata and RAC) or had forgotten to add the inst_id parameter to the command as well – not the case this time.
- The “stuck” session was just slowly progressing and had finished of its own accord – again, not the case this time.
The two stuck sessions (SIDs 165 and 529) were still viewable in V$SESSION, but we couldn’t kill them because the serial# (the second parameter to the ALTER SYSTEM KILL SESSION command) was actively changing, faster than we could select the information from V$SESSION and kill the session. This post walks through how we resolved the problem.
Investigating the Issue
The first step to finding out what was going on was identifying the relevant database / instance information. The customer had provided us with the connection string (customer information obfuscated for obvious reasons!):
(ADDRESS = (PROTOCOL = TCP)(HOST = host05-scan.domain.com)(PORT = 1521))
(SERVER = DEDICATED)
(SID = instance3)
But there are 8 nodes in this Exadata environment using that “host05-scan” bit. Our records showed that particular instance was running on the 3rd node of the cluster, so we connected to that instance first to find the DB_UNIQUE_NAME:
Now we have the DB_UNIQUE_NAME, we can determine where the different instances are:
Now the customer had told us the SIDs that were problematic, so stealing a script from Tim Hall’s forever useful website and modifying it to see just those sessions, we could see this:
Fixing the Issue
So now we know a couple more things about these mysterious sessions:
- They’re on instance2, which we can tell by comparing the program (oracle@hostdb0502) with the output of the srvctl command above
- They’re actually both parallel query processes (P096 and P288)
Well, that should make it easy. We can just connect to the second node and kill those processes off at the OS level, right?
Unfortunately, wrong. When we went and looked at the second node, neither of those two process IDs existed. We could see processes P094 and P096, as well as P287 and P289, but those middle processes were just nowhere to be found. The situation just gets weirder and weirder!
Next, we decided to delve deeper into the PQ side, using another carefully stolen script, this time from Amit Bansal, and it showed us a way out:
Can you see the way out?
If you look carefully, you can see our sessions with SIDs 529 and 165, both active. More importantly, you can see there are two killed SQL*Plus sessions, sessions 871 and 1786. If these sessions are already killed in the database, is it possible that one of them was the master of our two missing PQ slaves? Well, since they’re already killed off in the database, we can kill the matching OS processes (if they were still there, of course!) without impacting any users. We checked those two SIDs using Tim’s script above, found the two SPIDs to match, and found those processes DID still exist at the OS level. We killed off one – nothing happened. We killed off the other, and lo and behold the sessions with SIDs 529 and 165 magically disappeared. Hey presto! Issue solved!