4 EASY Lessons to enhance your performance diagnostics

This is just a quick post that covers everything I believe anyone should know about their Oracle performance problems. Most problems seen on the OTN are trivial to solve as soon as you can see this further information, and this information really is EASY to get.

1) Use dbms_application_info

This is a small but powerful package, all it does it is tell Oracle what your session is doing. It means that information is recorded against your v$session row, so when you notice some session is burning through CPU or writing tons of redo
or doing tons of physical IO, you can easily associate that session with the application process so you know exactly what code is being called. Think about it the other way, some end user complains that their critical process is taking a long time
you have hundreds of sessions connected, how on earth do you associate what the end user is doing and their session on the database? Their client identifier, because the application uses this package to set it.
Seriously, start instrumenting your code with it everywhere. Oracle already does this in dbms_scheduler jobs – by default a job will have the module of dbms_scheduler and the action will be the job_name.

Here’s a basic query that I can use to see what sessions exist for a module and what are they running and what event are they currently waiting on (or most recently waited on).

ANDY@pdb1>select sid, sql_id, sql_child_number, event, state from v$session where module ='SillyModule';

       SID SQL_ID        SQL_CHILD_NUMBER EVENT                          STATE
---------- ------------- ---------------- ------------------------------ -------------------
       259 74ah5jrb24sq5                0 latch: shared pool             WAITED SHORT TIME

2) You can’t get aggregate from detail

That last query tells us very little, you can’t assume that what a session was doing when you sampled it was what it’s been doing all day. Luckily there are many views aggregated over the life of the session that can help us.

v$session_event is one of the views I use all the time, here’s a quick and dumb script using it:

get session_event
var sid number
exec :sid := &1
select event, total_waits, time_waited, time_waited_micro, average_wait, max_wait
from v$session_event where sid = :sid
order by time_waited_micro;

If we call @session_event 259 (the sid from before) then we get the following:

ANDY@pdb1>@session_event 259

EVENT                          TOTAL_WAITS TIME_WAITED TIME_WAITED_MICRO AVERAGE_WAIT   MAX_WAIT
------------------------------ ----------- ----------- ----------------- ------------ ----------
buffer busy waits                        1           0                14            0          0
SQL*Net break/reset to client            4           0               630          .02          0
events in waitclass Other              278           0              3369            0          0
SQL*Net message to client               63           0              3495          .01          0
latch: shared pool                      33           2             17696          .05          0
Disk file operations I/O                 4           2             19183          .48          1
db file scattered read                 108          23            225372          .21          1
db file sequential read                181          24            239792          .13          0
SQL*Net message from client             62      354755        3547550028      5721.85     191541

This tells us that most of the wait events are down to SQLNet message from client, that’s because the client is SQLPlus and the session is mainly inactive until I tell it to do something. This is an idle event, nothing is really happening on the DB.
The second largest time waited is for db file sequential read, this is a wait event associated with index reads of tables. We can see we do 181 of these and on average they take 0.13 hundredths of a second (yes, that is the unit).
In (very close) third place is the db file scattered read event, this is associated with full table scans and.

It’s important to note that just because these wait events exist, doesn’t mean we can do much about them. It may be that it is much easier (which means cheaper in human cost) to solve whatever was causing the Disk file operations I/O and that might be okay.
Maybe your critical process only needs to be a 0.02 seconds quicker (okay maybe it’d be a little more in real life, it’s hard to demonstrate performance problems on a tablet computer!).

If a session isn’t waiting, it’s working, we also should know how much time it actually spent working – that is on the CPU. Luckily, that’s something else that’s captured for us, the v$sess_time_model view gives us this information and even helps explain why:
Here’s the output of a query against that same session using v$sess_time_model

ANDY@pdb1>select stat_name, value from v$sess_time_model where value > 0 and sid = :sid order by value
  2  /

STAT_NAME                                                             VALUE
---------------------------------------------------------------- ----------
repeated bind elapsed time                                             6802
failed parse elapsed time                                              8812
sequence load elapsed time                                            12471
connection management call elapsed time                              174237
hard parse (sharing criteria) elapsed time                           212315
PL/SQL compilation elapsed time                                      315875
PL/SQL execution elapsed time                                        888086
hard parse elapsed time                                             1815114
parse time elapsed                                                  1854917
DB CPU                                                            100468750
sql execute elapsed time                                          102401294
DB time                                                           102934270

We need to appreciate that some of these statistics overlap, the most obvious one being DB CPU time is part of DB time and part of it will be sql execute elapsed time too.
This output clearly tells us that our session was mainly working and not waiting, our DB CPU is a huge percentage of our DB time.

3) Read the REAL execution plans being executed

Stop using explain plan to tell you what’s going on, explain plan is flawed. Luckily, you can just ask Uncle Oracle to give you the real plan that has been used from memory. All you need to know is the sql_id, which you can identify
using v$session if a known session is executing it currently (remember if your session is instrumented with dbms_application_info this is EASY) or you can try and search v$sql – I have a script I use all the time for this, here’s a simplified version:

get sql_find
select * from (select sql_id, last_active_time, executions, substr(sql_text,1,100) sql_text from v$sql where sql_text not like '%v$sql%' and upper(sql_text) like UPPER('%&1%') order by last_active_time desc)
where rownum <=15
/

Another thing to remember is that the first session to parse a child cursor will have it’s module and action reportable in v$sql – another win for instrumentation. Once you have the sql_id, it’s just a matter of calling dbms_xplan.display_cursor

ANDY@pdb1>@sql_find "select object_type, count(*) from big_table group by object_type"

SQL_ID        LAST_ACTIVE_TIME    EXECUTIONS
------------- ------------------- ----------
SQL_TEXT
-------------------------------------------------------------------------------------------------------
22tq3p0gsd63k 20/03/2017 21:39:53          1
select object_type, count(*) from big_table group by object_type

ANDY@pdb1>select * from table(dbms_xplan.display_cursor('22tq3p0gsd63k'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------
SQL_ID  22tq3p0gsd63k, child number 0
-------------------------------------
select object_type, count(*) from big_table group by object_type

Plan hash value: 3721451536

-------------------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |       |       |   377 (100)|          |       |       |
|   1 |  PARTITION LIST ALL |           |    35 |   350 |   377   (2)| 00:00:01 |     1 |    36 |
|   2 |   HASH GROUP BY     |           |    35 |   350 |   377   (2)| 00:00:01 |       |       |
|   3 |    TABLE ACCESS FULL| BIG_TABLE | 68112 |   665K|   374   (1)| 00:00:01 |     1 |    36 |
-------------------------------------------------------------------------------------------------


15 rows selected.

See how I just used the sql_id as the first argument? Sometimes a sql_id is responsible for many child cursors which could mean multiple plans here, if you know which child cursor you’re using (it’s v$session.sql_child_number) then you can use it as the second parameter

ANDY@pdb1>select * from table(dbms_xplan.display_cursor('22tq3p0gsd63k',0));

Would give you the same plan as earlier (you can see the child number in the previous output).

4) Row source execution statistics

But what if I told you, you could get Oracle to not only tell you the plan, but actually the REAL work done on each line of the plan? And you don’t even have to pay for the diagnostics pack? Yes, I’m not kidding. All you need to do is
tell Oracle to collect these extra statistics for your execution, that can be done in two ways

Either
set statistics_level=all
or
hint gather_plan_statistics

And then we just ask Oracle to share the collected information, this is just an extra parameter to the dbms_xplan call – the format parameter. I’ll demonstrate using the statistics_level parameter:

alter session set statistics_level=all;
select object_type, count(*) from big_table group by object_type;
select * from table(dbms_xplan.display_cursor('22tq3p0gsd63k',NULL,'allstats typical last'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  22tq3p0gsd63k, child number 0
-------------------------------------
select object_type, count(*) from big_table group by object_type

Plan hash value: 3721451536

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name      | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |        |       |   377 (100)|          |       |       |       |       |          |
|   1 |  PARTITION LIST ALL |           |     35 |   350 |   377   (2)| 00:00:01 |     1 |    36 |       |       |          |
|   2 |   HASH GROUP BY     |           |     35 |   350 |   377   (2)| 00:00:01 |       |       |  1394K|  1394K|  508K (0)|
|   3 |    TABLE ACCESS FULL| BIG_TABLE |  68112 |   665K|   374   (1)| 00:00:01 |     1 |    36 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------

Note
-----
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level

SQL_ID  22tq3p0gsd63k, child number 1
-------------------------------------
select object_type, count(*) from big_table group by object_type

Plan hash value: 3721451536

----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |      1 |        |       |   377 (100)|          |       |       |     35 |00:00:00.02 |    1703 |       |       |          |
|   1 |  PARTITION LIST ALL |           |      1 |     35 |   350 |   377   (2)| 00:00:01 |     1 |    36 |     35 |00:00:00.02 |    1703 |       |       |          |
|   2 |   HASH GROUP BY     |           |     36 |     35 |   350 |   377   (2)| 00:00:01 |       |       |     35 |00:00:00.02 |    1703 |  1394K|  1394K|  497K (0)|
|   3 |    TABLE ACCESS FULL| BIG_TABLE |     36 |  68112 |   665K|   374   (1)| 00:00:01 |     1 |    36 |  68112 |00:00:00.01 |    1703 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------


36 rows selected.

You may notice that I used NULL for my child number, that’s because in changing this parameter I won’t be able to reuse other child cursor I’ve already created because I’ve changed my optimizer settings. I choose the format parameter of

format=>'allstats typical last'

That means:
include both memory and IO stats (who knows what could be important until you look)
include typical plan options (otherwise you won’t see the cost column and others)
only show statistics for the last execution (if you wanted to aggregate many executions then don’t use this, also this means that parallel slaves won’t have their statistics included so you wouldn’t use last there either)

The output tells us that we did 36 full table scans against big_table (starts column on line 3) this is because it has 36 partitions (it was partitioned when I was playing around with the auto list partitioning feature of 12.2 so has number of object_types + 1).
The total number of rows selected in those 36 full table scans is 68112 (A-Rows on line 3) and it took 1703 buffers to do this work (but no physical reads, this would be shown in the reads column which is missing because we did none).
Next the row source moves up to the hash group by, we can tell that this line started 36 times too, it happened once per partition – each time we only did a hash group by on the rows of one partition.
The total rows returned from our hash group by on line 2 is 35, it takes 497K in memory in order to execute these hash group bys.

dbms_xplan.display_cursor also has a very neat default option, it can use the most recently executed sql for that session without you knowing the sql_id. The thing to remember is that when you have serveroutput turned on, sqlplus (and other clients)
will automatically fire off it’s own queries after yours to check for lines in the dbms_output buffer – so turn it off first! Here’s a demo where I use the gather_plan_statistics hint to enable the row source execution statistics gathering

alter session set statistics_level=typical;
set serverout off 
select /*+gather_plan_statistics*/object_id from big_table where object_type='TABLE' and object_name like 'BIG_TABLE%';
select * from table(dbms_xplan.display_cursor(NULL,NULL,'allstats typical last'));

SQL_ID  g91kfkbn2m9a4, child number 0
-------------------------------------
select /*+gather_plan_statistics*/object_id from big_table where
object_type='TABLE' and object_name like 'BIG_TABLE%'

Plan hash value: 4051153611

---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name      | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |           |      1 |        |       |    14 (100)|          |       |       |      0 |00:00:00.01 |      52 |
|   1 |  PARTITION LIST SINGLE|           |      1 |      1 |    29 |    14   (0)| 00:00:01 |   KEY |   KEY |      0 |00:00:00.01 |      52 |
|*  2 |   TABLE ACCESS FULL   | BIG_TABLE |      1 |      1 |    29 |    14   (0)| 00:00:01 |     3 |     3 |      0 |00:00:00.01 |      52 |
---------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OBJECT_NAME" LIKE 'BIG_TABLE%')

Not only do we have our plan, we have our execution statistics and we have the sql_id. EASY!

But how does this help you? Well in a complex real life plan, you may have no idea where to start looking. By looking at the row source execution statistics you can see exactly where your time is going, you can choose exactly where to start your battle.

Summary

Hopefully, you’ll agree that these are really easy to do. You can’t start to solve a performance problem until you know what’s causing it, and these 4 things will help you identify that cause very quickly with little effort.

Advertisements