It has been quite a long time since I posted here. However, all I can say is that I’ve been very busy, and I’m happy that work is so plentiful that I barely have time for blogging!
Yesterday I got a call from a client that was having some trouble. Their runqueue was through the roof with no explanation as to why. Inserts, especially, were causing them grief, hanging for quite a long time.
Identifying the Problem
The system is 10.2.0.4 64-bit on Solaris 10, 16 CPU cores, 32GB RAM. In order to locate what the problem was, the first step was to do a little fact finding. What was going on when it started going bad? What kind of things happened in the past 24 hours that I should know about? Those questions along with some basic queries showed the issue.
It turns out that the night previous a huge update was run in a loop by mistake. This update ran for a while and was finally killed. The following morning things seemed to run fine, but massive waits started piling up while inserting/deleting from a large table used for tracking purposes (and therefore used constantly). I used the following query to find out the bottlenecks currently contributing to the issue:
select event, count(*) from v$session_wait group by event;
This simple query is extremely powerful. For anyone who has not used the v$session_wait table, let me give you an introduction. This table contains each SID (session logged into the instance), a wait event, and supporting information in the p1, p2, and p3 columns (you can look up the meaning of these columns in v$event_name on 10g).
The results of this query showed a great deal of waits. Library cache pins, cache buffers chains latches, and a few others. Because of experience on this system I knew that these waits were usually brought up as symptoms to a real issue, so I looked for any waits that were not as common and found these: enq: US – contention, wait for stopper event to be increased, and wait for a undo record.
Those waits, along with the knowledge of the killed updates from the night previous, and the fact that mostly DML was suffering was enough to put together the assumption that undo was the cause of these issues. A quick check of ‘top’ showed that the SMON process was consuming a fair amount of CPU on the system along with some PX processes, which definitely looked like fast start parallel recovery. I did a quick check on Metalink and found Note 464246.1 – DATABASE HANG DUE TO PARALLEL TRANSACTION RECOVERY. Assumption confirmed, for the most part.
Parallel Transaction Recovery
When you perform DML in Oracle your changes will go to UNDO (among other places). Specifically your changes are placed into an UNDO segment inside your UNDO tablespace. Please note that on RAC, each node has its own active UNDO tablespace while on single instance systems only one UNDO tablespace may be active at a time (unless you’re switching to a new one). But what does UNDO actually accomplish?
- Transaction rollback. UNDO is the mechanism by which you can undo your changes before a commit with the ROLLBACK command. Also, if a process is abnormally terminated UNDO will be used to clean up any uncommitted transactions.
- Flashback. Oracle 10g is sometimes called the Flashback Database. Flashback Query and Flashback Table (but not dropped table) utilize UNDO.
- Read consistency. If you start a query at 9AM and it takes 3 hours, are the results you see from 9AM or noon? The answer is 9AM, the moment you pressed the enter key. Oracle accomplishes this through read consistency, which pulls the data you require from current data blocks or consistent read blocks, which is UNDO. If you do not have enough UNDO to satisfy your long running queries, you may receive an ‘ORA-01555: snapshot too old, rollback segment too small’ error.
- Recovery. If your instance is abnormally terminated (kill -9 pmon, shutdown abort, janitor trips over cable), SMON will handle recovery on instance startup. This involves four phases: mount the database, roll forward all redo since the last checkpoint to the datafiles, open database, and rollback all non-committed transactions.
Undo Recovery
That last bullet point is the reason things went bad for my client. When Oracle rolls back all uncommitted transactions during an instance startup, SMON will spawn parallel processes to do the work. In addition, the server process of connected users will also try to rollback uncommitted transactions if they stumble upon any while doing their own work. During this time if any processes are using excessive amounts of UNDO things will get worse and worse. For instance, inserts to a table by many connected users while jobs try to keep the data clean with deletes on the backend could cause a lot of UNDO contention.
The FAST_START_PARALLEL_ROLLBACK parameter controls how much juice SMON will take. The three possible parameters are: FALSE, LOW, and HIGH. A setting of FALSE will tell SMON to do the recovery on its own without parallel assistance. This could work to allow your app to remain online while SMON does its job, but could still cause contention issues. In the end, it is better to let SMON finish the work while other processes try to lay off UNDO as much as possible. A setting of LOW will tell SMON to use 2 * CPU_COUNT parallel processes to do the work, and a setting of high will use 4 * CPU_COUNT. In order to allow SMON to finish its work as quickly as possible, we turned off their app and set the FAST_START_PARALLEL_ROLLBACK parameter to HIGH, allowing SMON all the resources it needed to finish. To determine how long it would take to finish, we consulted the V$FAST_START_TRANSACTIONS view (hey, a V$ view that’s pluralized!).
select state,
undoblocksdone,
undoblockstotal,
undoblocksdone / undoblockstotal * 100
from v$fast_start_transactions;
STATE UNDOBLOCKSDONE UNDOBLOCKSTOTAL UNDOBLOCKSDONE/UNDOBLOCKSTOTAL
---------------- -------------- --------------- ------------------------------
RECOVERING 94160 95432 98.6671138
The result you see there was near the end of course, with only 1,272 blocks to go to completion. Once the work completed, we were able to bring the app back up.
Undone by Undo
But then, confusion struck. Runqueue started creeping up again, and all the while the question was asked, “Why didn’t this problem affect us until the afternoon?” Soon contention was back up on the system with the ‘enq: US – contention’ wait in the forefront. Even though UNDO fast start parallel rollback was finished, contention was still high.
A small amount of further investigation provided the answer: background jobs overlapping while deleting from high concurrency tables. This meant that multiple jobs were attempting to do the same deletes over and over again while the app was still trying to insert into those tables, causing locks and UNDO contention. That also provided the answer to the question of delayed issues. The system was able to live with the amount of UNDO being generated during the less busy morning hours; however, once things ramped up, and with fast start parallel rollback taking so much horsepower, the structure of their system started breaking down at all levels. Previous to calling me the client also did a shutdown abort and a Dataguard failover, which unfortunately made the problem worse (more UNDO to recover). The result was a catastrophic UNDO pile up.
The Solution
In the end though, we can’t blame UNDO for the problem. It was only doing its job to preserve the integrity of the system. However, we can take away two things from this problem:
- In the case of excessive CPU usage due to UNDO following an instance failure, SMON can be throttled with the FAST_START_PARALLEL_ROLLBACK parameter. You can monitor the progress using the V$FAST_START_TRANSACTIONS view.
- Take care of your apps! Any scripts that perform DML should include a portion at the top that checks to see if it is already running (ps -ef | grep [p]urge.sh for instance). This will take care of overlapping process issues that are bound to cause contention. In addition, avoid unnecessary DML. Do you really need to insert/update/delete as much as you are? Is there another way that perhaps uses memory, or even flat files to avoid a constant barrage on the database layer?
In the end, the big conclusion is this: anything in excess can be problematic, especially when there are waits involved.
Dear Steven,
Your Explanation to the Problem is crystal clear. I solved our database problem with your blog’s Notes. Hope others also will get help out of this. Keep posting more articles.
Rgds,
Siva
Awesome blog. I had the same situation yesterday in one my databases and recovery is running since last 10 hours. when i set the fast_start_parallel_rollback=high recovery completed in 30 minutes. I can’t believe.
Thank you once again for sharing wonderful information.
Thanks,
-Vijay
Hi,
Thanks for the wonderful article.
I faced just the same problem with Oracle 10g R1 on Windows 2003 32-bit. When the database is starting, SMON is hanging other processes because it’s doing its recovery for large UNDO and I can see it waiting for “wait for a undo record”.
I see the parameter system set fast_start_parallel_rollback=high
Still it takes ages to complete. It’s been one hour now and here’s the output:
SYS @ dawa> select UNDOBLOCKSDONE , UNDOBLOCKSTOTAL
2 from V$FAST_START_TRANSACTIONS
3 /
UNDOBLOCKSDONE UNDOBLOCKSTOTAL
————– —————
98 659741
Oracle Support are saying I have to upgrade to 11.1,
but how can I upgrade if I cannot cleanly shutdown the database!
Do you please have any idea on how to solve this issue. It’s severly hurting our business.
Yup, great explanation of the real time issue.
Thanks
Shakir Mustafa
Thanks — my issue too! saw a lot of buffer related waits — “buffer exterminate watis/ wait for stopper increase event … blah blah.. .all vanished when I created another undo tablespace of same size, switched the undo_tablespace parameter to this new one .
note: Many times – oracle shows undo as 99% full — although the sessions using it are no longer active/ logged in. Solution here – it wont’ release even if u restart the instance — so best way is to create a standby undo tablespace of similar size – switch the instance to use this spare undo tablespace whenever existing undo free space falls below 5 to 10%. This avoids downtime as well.
i.e
create undo tablespace undo_2 ‘/mnt1/your_undo2.dbf’ size 20g autoextend off;
alter system set undo_Tablespace=’UNDO_2′ ;
Whenever this is full — drop and recreate the earlier one ( assuming unodtbs1)
first confirm the older undo is not active by checking its file_id from dba_Data_files
and see it is not currently having any session active on it (
select ubafil from v$transaction — this shows file# of active undo datafiles)
once you are sure no one is active on it .
drop undo tablespace undotbs1 ;
create undo tablespace undotbs1 datafile ‘/your_dropped_undotbs1.dbf’ reuse
autoextend off.;
And then repeat
alter system set undo_Tablespace=undotbs1;
Thanks, It worked for me.
Is Burleson a collegue of yours??
http://www.dba-oracle.com/t_parallel_transaction_recovery_undo_cpu.htm