Spikes in CPU starting at 3am?

It was noticed that between 3am and 5am every morning, the CPU of our DW reached it’s peak and at times was close to using all CPU on the server.

A check using dba_hist_active_sess_history (diagnostics pack required!) showed that the responsible sessions were doing similar SQL:

select * from (
select session_type,sql_id,nvl(event,'cpu') event, module
 ,count(*) "COUNT(*)"
from dba_hist_active_sess_history ash
join dba_hist_snapshot ss
 on ss.snap_id = ash.snap_id
 and ss.dbid = ash.dbid
where ash.sample_time between to_date(:startdate_1 ,'dd/mm/yyyy hh24:mi:ss') and to_date(:enddate_1 ,'dd/mm/yyyy hh24:mi:ss')
and ss.begin_interval_time <= to_date(:enddate_1 ,'dd/mm/yyyy hh24:mi:ss')
and ss.end_interval_time >= to_date(:startdate_1 ,'dd/mm/yyyy hh24:mi:ss')
group by session_type,sql_id,nvl(event,'cpu'),module
order by count(*) desc)
where rownum <=5 order by "COUNT(*)"

_____________ _____________ ______________________________ ______________________________ ________________________
FOREGROUND 1kz16yhs993h2 latch: undo global data DBMS_SCHEDULER 459
FOREGROUND 1k5c5twx2xr01 cpu DBMS_SCHEDULER 2,347
BACKGROUND cpu backup incr datafile 2,970
FOREGROUND 1kz16yhs993h2 cpu DBMS_SCHEDULER 3,798

5 rows selected.

select sql_id, sql_text from v$sql where sql_id in ('1kz16yhs993h2','1k5c5twx2xr01') and child_number = 0;

_____________ ____________________________________________________________________________________________________
1kz16yhs993h2 insert into sys.scheduler$_event_log (log_id, log_date, type#, name, owner, operation, status, user
 _name, client_id, guid, dbid, additional_info, destination, credential, class_id, flags) value
 s (:1, SYSTIMESTAMP, :2, :3, :4, :5, :6, :7, :8, :9, NULL, :10, :11, :12, :13, :14)

1k5c5twx2xr01 insert into sys.scheduler$_job_run_details (log_id, log_date, req_start_date, start_date, run_durat
 ion, instance_id, session_id, slave_pid, cpu_used, error#, additional_info) values (:1, SYSTIME
 STAMP, :2, :3, :4, :5, :6, :7, numtodsinterval(:8/100, 'second'), :9, :10)


These are the SQLs that get run as part of logging a dbms_scheduler job, so we must be attempting to do a lot more scheduled jobs around these times, right? We have an incremental ETL that is constantly running via scheduler jobs (the many dependencies are managed by scheduler chains) so we were already aware that we do fire off a lot of jobs (our job_queue_processes is 24 so there is often many concurrent job sessions). In September last year, we upgraded our CDC  processing from using archived redo logs to online redo logs – meaning that our incremental loads were much more frequent (from every 5 minutes to every few seconds, limited by processing time). We did a check of what gets logged to dba_scheduler_job_log between these hours and compared it to other hours in the day, nothing really stood out as being different. If anything, we executed less scheduled jobs.

I decided to think about what else was really going on in the system between these times, yes there were back up operations going on but that’s not every day and they are certainly active for a couple hours either side of this window – it couldn’t explain the regular hits.

Our second biggest wait between these times was due to ‘latch: undo global data’, like a lot of events, I had no idea what that was other than it was to do with undo. I googled it and found a few different things. Contention could come about when new transactions wish to use unexpired undo as there’s no expired undo to use (I immediately discard this as we have guaranteed undo retention – nothing will use unexpired undo without erroring). All causes came down to a session needed to read information about undo. Latches are different to locks as instead of forming an orderly queue and waiting in line, they spin on CPU before trying to obtain the latch – potentially getting it before another process that was trying to get it for much longer.

Something must be happening between 3am and 5am that causes the insert statements to read from undo – potentially lots of it. A lightbulb went off in my head: Oracle must purge it’s scheduler logs at some point to stop them from growing, perhaps the purge kicks off at 3am and generates a lot of undo from all the deleting. A quick check of dba_scheduler_jobs found PURGE_LOG job and yes, it’s scheduled to start at 3am and usually completes a few minutes before 5 am. So there’s our smoking gun.

The next steps were to decide how to address the problem, the obvious one was to reduce the amount of logging the scheduler does. There’s already inbuilt job classes that will only log failed runs:

------------------------------ ----------- ----------- ------------------------------------------------------------
SCHED$_LOG_ON_ERRORS_CLASS     FAILED RUNS             This is the default job class if you want minimal logging.

Our code instruments itself in another table so it was decided that failed runs was all we really needed anyway. I then identified the jobs that would really benefit from this, the top loggers – there were three that stood out miles from the competition so we assigned this job class to them in DEV and made sure everything was fine.

Assigning the job class was as simple as:

grant execute on SCHED$_LOG_ON_ERRORS_CLASS to job_owner;
exec dbms_scheduler.set_attribute(<job owner>.<job_name>,'job_class','SCHED$_LOG_ON_ERRORS_CLASS')

It had the desired effect in DEV, we kicked off a manual purge whilst jobs were running and things seemed much better. We then followed our release process to get the change on production where the problem was also relieved, hurray.

There’s something not quite satisfying about how I decided that it must have been the purge job that caused the problem. I always say that Oracle is so well instrumented that there is never a need to guess, but that’s exactly how I got from the latch event around undo to the purge job. I decided to keep looking to see if I was missing anything that would have made the connection concrete and obvious.

I decided that understanding the latch event was crucial, I had a search of MOS and found this article “LATCH: UNDO GLOBAL DATA” In The Top Wait Events (Doc ID 1451536.1), the article describes a probable cause and solution if the latch miss was caused by ktusm_stealext_2. I hadn’t previously seen anywhere which told me which part of the Oracle kernel produced a latch miss but a quick google suggested that it should be easy to find in an AWR report. If it’s in an AWR report it should be in a dba_hist view, so I did a search of the documentation and found DBA_HIST_LATCH_MISSES_SUMMARY. A quick-and-dirty sample query revealed that the view was storing (like many others do) a snapshot of the number of latch misses has happened up to that point, in order to find out how many happened between one period you have to take the difference. I knocked up a quick script that I could use to get the differences in each snapshot and easily identify:

select start_time
from (
select cast(ss.begin_interval_time as date) start_time
 , cast(ss.end_interval_time as date) end_time
 , parent_name, where_in_code
 , nwfail_count -lag(nwfail_count) over (partition by parent_name, where_in_code order by lms.snap_id) nwfail
 , sleep_count -lag(sleep_count) over (partition by parent_name, where_in_code order by lms.snap_id) sleep_count
 , wtr_slp_count-lag(wtr_slp_count) over (partition by parent_name, where_in_code order by lms.snap_id) wtr_slp_count
 , max(sleep_count)over (partition by parent_name, where_in_code) -min(sleep_count) over (partition by parent_name, where_in_code) total_sleep_count
from dba_hist_latch_misses_summary lms
JOIN dba_hist_snapshot ss
 ON ss.snap_id = lms.snap_id
 and ss.dbid = lms.dbid
where ss.begin_interval_time <= TO_DATE(:enddate_1 ,'dd/mm/yyyy hh24:mi:ss')
and ss.end_interval_time >= TO_DATE(:startdate_1 ,'dd/mm/yyyy hh24:mi:ss')-interval'1'hour
and lms.parent_name like :parent_name
) where total_sleep_count > 0;

For this particular latch parent and for the early hours of the morning I saw this:


This made it incredibly clear, “ktudba: KSLBEGIN” is the offending where. This was not the “where” mentioned in the MOS article so I had to keep looking. I found Bug 17532836 : LATCH: UNDO GLOBAL DATA ON KTUDBA: KSLBEGIN, but this didn’t really get me anywhere. Back to google! I ended up having to use a translated page https://translate.google.co.uk/translate?hl=en&sl=zh-CN&u=http://www.anbob.com/archives/2690.html&prev=search It mentions that the latch is there so that a session can identify the location of the undo data and is usually very fast. They then go on to identifying large transactions active on the system at the time (by looking at number of undo blocks used), this linked up to a session doing a delete of an entire table. This is very much like our scenario, but I can’t quite tell if the problem was really due to the huge amount of undo that would have been generated from our purge or if it was due to the undo being required by so many sessions at the same time.


It seems this problem stems from one session purging a table that many other sessions are trying to insert into. The achievable fix was to insert less by changing the job class of the freuquently executing jobs.

A textbook way that most would implement this scenario would be to partition the logging table by a date and then truncate/drop partitions that were due to be purged. This would both make the purge super fast and also require very little undo. Queries against the logging table would be faster if they filtered on the log date as you could do partition pruning. The complexity here comes due to some job classes being wanted to be kept for longer. Maybe Oracle should partition the logging tables by a purge date instead – general queries against the table would not really be effected performance wise (unless you knew when the purge date would be), it might allow for more concurrency for inserts if they were from different job classes that had different retentions (although I’m not sure if it’d be something to write home about). Hopefully, in the future, Oracle will implement this sort of solution but for now we either need to log less or put up with the CPU spikes.