Improve performance of complex Top-N queries by adding an additional order by clause

I’ve been looking at a lot of application SQL this week with the hope of getting some easy changes done to boost performance. I go about this by looking at where the time is going and seeing what can be done to fix that problem. A few of the problematic SQLs follow the pattern of

optimizer_mode=first_rows_10
select  *
from    main_table m
join    lookup_table l1
  on    ...
join    lookup_table l2
  on    ...
join    lookup_table l3
  on    ...
where   m.slightly_selective = :blah1
and     m.not_at_all_selective_but_neccessary = :blah2
order by m.some_date , m.primary_key

i.e find some rows that you care about, join them to a lot of other tables to fill out some data, return the earliest 10 rows (by ordering against the main table columns only and only fetching 10 results).

If you have an index which covers your main filter columns and the order by clause then you can end up with a nice first rows plan which loops through the rows ordered as desired and stops when it gets 10 results. If you don’t have that index then you will end up having to visit all the rows that match the predicates, join them all to the lookup tables and then sort them and then take the first 10 rows. I aim to improve this scenario without having to add indexes to cover all of your order by conditions.

Problem Demo

First we set up some tables

create table big_table as select * from all_objects;
create table big_table_2 as select * from big_table;

create index big_idx_perfect on big_table (object_type,object_name,object_id);
create unique index big2_idx1 on big_table_2 (object_id);

Our query will be

select *  
from big_table t1 
join big_table_2 t2 
on t1.object_id = t2.object_id 
where t1.object_type ='TABLE' 
order by t1.object_name, t1.object_id 

We’ve got an index which covers both our filter and the order by columns in the right order, so the explain plan we get is

-------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                 |    11 |  2948 |    19   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                 |                 |    11 |  2948 |    19   (0)| 00:00:01 |
|   2 |   NESTED LOOPS                |                 |    11 |  2948 |    19   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID| BIG_TABLE       |  1918 |   250K|     8   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | BIG_IDX_PERFECT |    11 |       |     3   (0)| 00:00:01 |
|*  5 |    INDEX UNIQUE SCAN          | BIG2_IDX1       |     1 |       |     0   (0)| 00:00:01 |
|   6 |   TABLE ACCESS BY INDEX ROWID | BIG_TABLE_2     |     1 |   134 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("T1"."OBJECT_TYPE"='TABLE')
   5 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")

Note how we don’t have any sorting done in the plan because we are accessing everything in the right order by using the index.

What happens when we replace the perfectly crafted index with one that only allows us to filter?

drop index big_idx_perfect;
create index big_idx_filter  on big_table (object_type,object_name);
-------------------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |  1918 |   501K|       |   489   (5)| 00:00:01 |
|   1 |  SORT ORDER BY      |             |  1918 |   501K|   624K|   489   (5)| 00:00:01 |
|*  2 |   HASH JOIN         |             |  1918 |   501K|       |   388   (6)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| BIG_TABLE   |  1918 |   250K|       |   191   (4)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| BIG_TABLE_2 | 72892 |  9538K|       |   196   (7)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")
   3 - filter("T1"."OBJECT_TYPE"='TABLE')

It turns out that my index isn’t even good enough to provide selectivity (which is no surprise if you look at what it’s filtering on). By losing the index ordered access we have to now get all the rows and sort them all, we have to do full table scans and a hash join just to return 10 rows.

Problem Fixing

What if we were to tell Oracle to place an additional order by once it has read from our first table. Once the rows are in order, it can successfully use a first rows approach to join from the ordered rows until it reaches the right number of results (or it finishes). This is easy to do if you can rewrite the SQL, you just need to add a subquery

select *  
from (select * from big_table order by object_name, object_id ) t1 
join big_table_2 t2 
on t1.object_id = t2.object_id 
where t1.object_type ='TABLE' 
order by t1.object_name, t1.object_id 
-------------------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |  1918 |   501K|       |   489   (5)| 00:00:01 |
|   1 |  SORT ORDER BY      |             |  1918 |   501K|   624K|   489   (5)| 00:00:01 |
|*  2 |   HASH JOIN         |             |  1918 |   501K|       |   388   (6)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| BIG_TABLE   |  1918 |   250K|       |   191   (4)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| BIG_TABLE_2 | 72892 |  9538K|       |   196   (7)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BIG_TABLE"."OBJECT_ID"="T2"."OBJECT_ID")
   3 - filter("BIG_TABLE"."OBJECT_TYPE"='TABLE')

Oh, it’s the same plan that sorts all the results before allowing it to return. What happened? Well, Oracle thinks it knows better than us and has rewritten the query back to how it was before. If we grab the outline from the plan, we can see if we can control the behaviour.

select * from dbms_xplan.display(format=>'outline')
...
Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$73523A42" "T1"@"SEL$2")
      USE_HASH(@"SEL$F1D6E378" "T2"@"SEL$1")
      LEADING(@"SEL$F1D6E378" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$F1D6E378" "T2"@"SEL$1")
      NO_ACCESS(@"SEL$F1D6E378" "T1"@"SEL$1")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$3")
      OUTLINE(@"SEL$2")
      MERGE(@"SEL$1" >"SEL$3")
      OUTLINE_LEAF(@"SEL$F1D6E378")
      ELIMINATE_OBY(@"SEL$2")
      OUTLINE_LEAF(@"SEL$73523A42")
      FIRST_ROWS(10)
      DB_VERSION('18.1.0')
      OPTIMIZER_FEATURES_ENABLE('18.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

It’s the ELIMINATE_OBY which looks suspicious to me, so we want to tell Oracle don’t do that. Let’s lookup the counter hint

select inverse from v$sql_hint where name ='ELIMINATE_OBY';

INVERSE
----------------------------------------------------------------
NO_ELIMINATE_OBY

Easy! So we need to stick that in our subquery too

Solution – Order By Placement

select *  
from (select /*+ no_eliminate_oby */ * from big_table t1 order by t1.object_name, t1.object_id ) t1 
join big_table_2 t2 
on t1.object_id = t2.object_id 
where t1.object_type ='TABLE' 
order by t1.object_name, t1.object_id 
--------------------------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |    10 |  6070 |    11   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |             |    10 |  6070 |    11   (0)| 00:00:01 |
|   2 |   NESTED LOOPS               |             |    10 |  6070 |    11   (0)| 00:00:01 |
|   3 |    VIEW                      |             |  1918 |   885K|     1   (0)| 00:00:01 |
|   4 |     SORT ORDER BY            |             |  1918 |   250K|   192   (5)| 00:00:01 |
|*  5 |      TABLE ACCESS FULL       | BIG_TABLE   |  1918 |   250K|   191   (4)| 00:00:01 |
|*  6 |    INDEX UNIQUE SCAN         | BIG2_IDX1   |     1 |       |     0   (0)| 00:00:01 |
|   7 |   TABLE ACCESS BY INDEX ROWID| BIG_TABLE_2 |     1 |   134 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - filter("T1"."OBJECT_TYPE"='TABLE')
   6 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")

Success! And we’ve even got a lower cost, suggesting this transformation was not costed.

How do we know whether this rewrite would be beneficial?

This has an easy answer: benchmark it.

In reality, first rows style queries are still going to be difficult to get. TABLE ACCESS BY INDEX ROWID BATCHED and ADAPTIVE JOIN are a couple of things that have been introduced that prevent ordered results keeping their order as you join. You would have to give up on these (and probably a few other performance improvements) in order to benefit. In my opinion, this is going to be heavily waited towards my query rewrite still being faster – better to not read rows at all than to read a lot with faster reads.

If Oracle could cost this transformation then it would be a very easy win for complex TOP-N queries. In the mean time, we will have to identify cases which look like they would benefit and do the rewrite ourselves.

Identifying Candidate SQLs

In theory, we should be able to look for query plans that could benefit. The following criteria could be considered:

  • Order by in the select statement
  • First_rows optimizer mode/rownum filter after the order by/fetch first rows syntax
  • All columns in the order by are coming from one table
  • This table is driving the query
  • Optimizer expects many more rows to be found in the the driving table than it expects to want to fetch

I will try to write up a query which can identify these cases in bulk in the future.

For those queries, all we need to do is take the order by clause and replicate it into a subquery using the driving table and use our NO_ELIMINATE_OBY hint.

The real solution would be for this query transformation to be done by the optimizer and costed. The Oracle cost-based optimizer is already able to do a similar query transformation with group by clauses since 11g . If the optimizer is able to do the transformation for you, then tuning this in live applications can be as simple as creating a SQL Plan Baseline or a SQL Patch to inject a hint.

Myth Busting: Order by cardinality in a composite index

There’s a lot of advise floating around about how to order the columns in a composite index. The suggestions are usually decent enough as a starting point (so long as it remembers the Golden Rule of Indexing) but I often see something similar to:

For a query with equality conditions, columns which have more distinct values should appear first

To be clear, this is not a bad rule of thumb. The problem is that it comes from a misunderstanding of how indexes work, a reader may then imply the same misunderstanding leading to other confusion. The misunderstanding being that the index traversal is done by searching against the index for the first column, then applying another search for the second column, this is not the case. I’m going to discount the situation where you have multiple queries and some of those queries only use one of your columns as a filter (if you’re in that situation, leading with that column is what you’re going to do).

A composite B-tree index is just an ordered list of row pointers, the order is given by the column list in the create index statement. This ordering allows it to be searched efficiently. If you have only equality conditions against the columns in the index then you know exactly where to start reading the index, and you know when you’ve read all the data you need. The order of these columns makes no difference to how you know where to start and how much data you need to read from the index.

Example

Maybe you need convincing, good! Let’s create a table of 1,000,000 rows, consisting of a primary key, a column which has 100,000 different values (CODE_100000) and a column which has 2 different values (code_2). The two non-primary key columns are going to be the columns we’re filtering – they have a massive difference in selectivity so it should be easy to show if a difference in performance exists.

Let the query that we want to optimize looks like:

select id, code_2, code_100000 from myth_table where code_2=1 and code_100000=1;

If we were to follow the mentioned rule of thumb, we would create the index with code_100000 as the leading column and code_2 as the next column:

create index myth_table_rule on myth_table (code_100000, code_2) ;

And if we were to ignore that rule and go in the opposite order we would have:

create index myth_table_andy on myth_table (code_2, code_100000) ;

What really happens

Oracle

I’m using the mod function to get a known distribution of values, you can run this on your own system with whatever distribution you like if you don’t trust me 🙂

drop table myth_table;
create table myth_table (
  id number,
  code_2 number,
  code_100000 number
);
insert into myth_table (
    id, code_2,code_100000
)
select rownum,
    mod(rownum,2),
    mod(rownum,100000)
from dual connect by rownum <= 1000000;
select count(*), count(distinct code_2), count(distinct code_100000)
from myth_table;
commit;

create index myth_table_rule on myth_table (code_100000, code_2) ;
create index myth_table_andy on myth_table (code_2, code_100000) ;

alter session set statistics_level=all;
set serverout off

set feedback only
select /*+ index (myth_table myth_table_rule) */* from myth_table where code_100000 = 10 and code_2 = 0;
set feedback on
select * from dbms_xplan.display_cursor(sql_id=>null,format=>'typical allstats last');
set feedback only
select /*+ index (myth_table myth_table_andy) */* from myth_table where code_100000 = 10 and code_2 = 0;
set feedback on
select * from dbms_xplan.display_cursor(sql_id=>null,format=>'typical allstats last');

And the results:

  COUNT(*) COUNT(DISTINCTCODE_2) COUNT(DISTINCTCODE_100000)
---------- --------------------- --------------------------
   1000000                     2                     100000

---------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |       |    49 (100)|          |     10 |00:00:00.01 |      14 |      2 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| MYTH_TABLE      |      1 |     40 |  1560 |    49   (0)| 00:00:01 |     10 |00:00:00.01 |      14 |      2 |
|*  2 |   INDEX RANGE SCAN                  | MYTH_TABLE_RULE |      1 |     40 |       |     3   (0)| 00:00:01 |     10 |00:00:00.01 |       4 |      2 |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CODE_100000"=10 AND "CODE_2"=0)


----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |       |    49 (100)|          |     10 |00:00:00.01 |      14 |      2 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| MYTH_TABLE      |      1 |     40 |  1560 |    49   (0)| 00:00:01 |     10 |00:00:00.01 |      14 |      2 |
|*  2 |   INDEX RANGE SCAN                  | MYTH_TABLE_ANDY |      1 |     40 |       |     3   (0)| 00:00:01 |     10 |00:00:00.01 |       4 |      2 |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("CODE_2"=0 AND "CODE_100000"=10)

The amount of buffers (logical IOs) against the different indexes were exactly the same – 4. They found exactly the same 10 ROWIDs so ended up reading the same rows from the table. The order of columns made no difference.

SQL Server

We’ll do the same on SQL Server, I’ll create a heap table for simplicity (you can try it yourself with a clustered index). I’m using similar functionality to ensure the same data distribution as my Oracle case but it’s important to note that this is not to compare performance between the two engines, this is about performance within each engine using different indexing techniques. I’m using set statistics IO on to get information about the IO we’re doing here

drop table myth_table;
create table myth_table (
  id int,
  code_2 int,
  code_100000 int
);
with rows_10 as (select * from (values(1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) as t(i))
  ,cte as (select row_number() over(order by r.i) rn
           from   rows_10 r
                  cross join rows_10 r1 -- 100  
                  cross join rows_10 r2 -- 1,000
                  cross join rows_10 r3-- 10,000  
                  cross join rows_10 r4-- 100,000  
                  cross join rows_10 r5 -- 1,000,000  
         )
insert into myth_table (
    id, code_2,code_100000
)
select  rn,
        rn % 2,
        rn % 100000
from   cte
where  rn <= 1000000;

select count(*), count(distinct code_2), count(distinct code_100000)
from myth_table;

create index myth_table_rule on myth_table (code_100000, code_2) ;
create index myth_table_andy on myth_table (code_2, code_100000) ;

set statistics io on
select * from myth_table with (index (myth_table_rule))  where code_100000 = 10 and code_2 = 0;
select * from myth_table with (index (myth_table_andy))  where code_100000 = 10 and code_2 = 0;
set statistics io off

And the results:

                        
----------- ----------- -----------
1000000     2           100000
...
Table 'myth_table'. Scan count 1, logical reads 13, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
...
Table 'myth_table'. Scan count 1, logical reads 13, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

Here we can see we’re doing 13 logical reads no matter which index is used. If you want to read the actual execution plans, I’ve shared them here https://www.brentozar.com/pastetheplan/?id=BJz2LhVEc

PostgreSQL

PostgreSQL doesn’t support hinting, so if we want to force the optimizer we will need to be a bit more dramatic, I’ve decided that dropping indexes is probably the easiest for the demo. I’ll use EXPLAIN (ANALYZE, BUFFERS) to get a similar output to what Oracle gives us.

CREATE TABLE myth_table (
  id SERIAL UNIQUE NOT NULL,
  code_2 int NOT NULL,
  code_100000 int NOT NULL
);
insert into myth_table (
    code_2,code_100000
)
select
    mod(i,2),
    mod(i,100000)
from generate_series(1, 1000000) s(i);
select count(*), count(distinct code_2), count(distinct code_100000)
from myth_table;
create index myth_table_ix01 on myth_table (code_2, code_100000);
EXPLAIN (ANALYZE, BUFFERS)
select * from myth_table where code_100000 = 10 and code_2 = 0;
drop index myth_table_ix01;
create index myth_table_ix02 on myth_table (code_100000, code_2);
EXPLAIN (ANALYZE, BUFFERS)
select * from myth_table where code_100000 = 10 and code_2 = 0;
drop index myth_table_ix02;

And the results:

count	count	count
1000000	2	100000

Bitmap Heap Scan on myth_table  (cost=4.68..99.96 rows=25 width=12) (actual time=0.057..0.174 rows=10 loops=1)
  Recheck Cond: ((code_2 = 0) AND (code_100000 = 10))
  Heap Blocks: exact=10
  Buffers: shared read=13
  ->  Bitmap Index Scan on myth_table_ix01  (cost=0.00..4.67 rows=25 width=0) (actual time=0.040..0.040 rows=10 loops=1)
        Index Cond: ((code_2 = 0) AND (code_100000 = 10))
        Buffers: shared read=3
Planning:
  Buffers: shared hit=19 read=7
...
Bitmap Heap Scan on myth_table  (cost=4.68..99.96 rows=25 width=12) (actual time=0.052..0.069 rows=10 loops=1)
  Recheck Cond: ((code_100000 = 10) AND (code_2 = 0))
  Heap Blocks: exact=10
  Buffers: shared hit=10 read=3
  ->  Bitmap Index Scan on myth_table_ix02  (cost=0.00..4.67 rows=25 width=0) (actual time=0.043..0.043 rows=10 loops=1)
        Index Cond: ((code_100000 = 10) AND (code_2 = 0))
        Buffers: shared read=3
Planning:
  Buffers: shared hit=20 read=1 dirtied=1
...

The lines underneath the Bitmap Index Scan line tells us how much work we did against the index, both times we did 3 reads and those lead to an additional 10 reads from the table. If you would like to try tweaking the demo, you can do so with DBFiddle

Does it ever matter?

Of course it matters! Once you’ve included all the columns with equality conditions (that are worth including), you have one final column which can provide index selectivity. Here, you can choose by how selective your range filters are going to be – this is your last piece of index selectivity. e.g. if you had the query

select * 
from  queue_table 
where queue = 'MYQUEUE' 
and   status='READY' 
and   expire_date > sysdate 
and   start_date < sysdate 
and   owner <> 'BLOCKED'

The two equality conditions here are obvious so we can start with those (again, assuming they’re decent filters). That leaves our filters against expire_date, start_date and owner. Owner can’t be used as a range scan1, so we can discount that first. We just need to decide which filter reduces the number of rows from our already filtered set the most. It’s simple enough to run a small query to figure this out:

select count(case when expire_date > sysdate then 1 else null end) expire_date
      ,count(case when start_date  < sysdate then 1 else null end) start_date  
      ,count(*)
from  queue_table 
where queue = 'MYQUEUE' 
and   status='READY' 

This will tell us what we need to know – and if the count(*) result is sufficiently small we might decide the index is already good enough. Let’s imagine the result came out to be

EXPIRE_DATE  START_DATE   COUNT(*)
----------- ----------- ----------
       3986         200       4000

The choice is clear here, the start_date filter means we reduce our index selectivity to 200 so including start_date after queue and status is a decent idea. At this point, our index selectivity can no longer be improved. Adding further columns can reduce the amount of data read from the table but you’re growing the index (and therefore having to do more work against the index). It may still be overall beneficial to add the columns – it’s a balancing act.

1We can technically split owner <> 'BLOCKED' into (owner < 'BLOCKED' or owner > 'BLOCKED') which could be expanded to give us two index range scans. In this case, this is also a filter which is not going to provide great selectivity so we don’t need to worry. The reader is welcome to see if their DBMS of choice can make this transformation when it is worthwhile.

Summary

When the columns in the index are subject to equality conditions, it doesn’t matter which order they appear in so long as they are not after columns which are not subject to equality conditions (see the Golden Rule of Indexing). Like I said in the intro, it’s not a bad rule of thumb to say they should be ordered by cardinality, it’s not going to give you bad performance, it just doesn’t matter.

I haven’t mentioned a few things that could be interesting to think about:

  • Does the order of columns effect the clustering factor statistic?
  • Does the real clustering factor impact the performance?
  • Does having a highly repetitive leading column give benefits to prefix compression – does that impact performance?
  • Does having access to index skip scans make one way more forgiving?

Driving an outer join query from the coalesce of the outer joined table and the main table

I had a hard time coming up with a snappy title for this, I don’t think it’s going to be particularly google-able but never mind.

The problem I’m tackling here is that we have a query which outer joins two tables but the main filter is achieved with a coalesce (or NVL) of a column from the outer joined table and the main table. I’m going to go over how I investigated why this was slow and how I would solve the problem with a query rewrite (both when I can touch the application code, and when I cant’).

To put this into some context and make the problem more understandable, let’s create some sample data.

We have two tables: as_main which contains names of things and as_optional which contains alternative names for the same things that should be used if they exist. They are joined together on their own primary keys. I’ve added indexes on the name columns in both.

create table as_main 
  (pkcol number primary key
  ,padding_col varchar2(200 char) 
  ,name varchar2(100 char)
  );
create index as_main_name on as_main (name);
create table as_optional 
  (pkcol number primary key
  ,padding_col varchar2(200 char)
  ,alternative_name varchar2(100 char)
  );
create index as_optional_name on as_optional (alternative_name);
insert into as_main (pkcol, padding_col, name)
select rownum, rpad('x',200,'x'), dbms_random.string('a',30) from dual connect by rownum <=10000;
insert into as_optional (pkcol, padding_col, alternative_name)
select pkcol, rpad('x',200,'x'), dbms_random.string('a',30) from as_main where mod(pkCol, 5) = 0;
commit;
exec dbms_stats.gather_table_stats(null,'as_main')
exec dbms_stats.gather_table_stats(null,'as_optional')

The business logic for this data model requires that for each row in as_main, if another row exists in as_optional then display the alternative_name from there instead. E.g. if I wanted to look up the data with pkcol=100, I would run:

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  m.pkcol = 100;

And we get a very sensible and cheap looking plan

Plan hash value: 1756420787
 
------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                 |     1 |    70 |     3   (0)| 00:00:01 |
|   1 |  NESTED LOOPS OUTER          |                 |     1 |    70 |     3   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| AS_MAIN         |     1 |    35 |     2   (0)| 00:00:01 |
|*  3 |    INDEX UNIQUE SCAN         | SYS_C0033781918 |     1 |       |     1   (0)| 00:00:01 |
|   4 |   TABLE ACCESS BY INDEX ROWID| AS_OPTIONAL     |     1 |    35 |     1   (0)| 00:00:01 |
|*  5 |    INDEX UNIQUE SCAN         | SYS_C0033781919 |     1 |       |     0   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("M"."PKCOL"=100)
   5 - access("O"."PKCOL"(+)=100)

But what happens if I want to search based on that computed name column? Let’s fiter on coalesce(m.name, o.alternative_name)

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  coalesce(m.name, o.alternative_name) = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj';
Plan hash value: 202792823
 
--------------------------------------------------------------------------------------
| Id  | Operation              | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |             | 10000 |   683K|   145   (0)| 00:00:01 |
|*  1 |  FILTER                |             |       |       |            |          |
|*  2 |   HASH JOIN RIGHT OUTER|             | 10000 |   683K|   145   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL   | AS_OPTIONAL |  2000 | 70000 |    25   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL   | AS_MAIN     | 10000 |   341K|   120   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(COALESCE("M"."NAME","O"."ALTERNATIVE_NAME")='fvzXQcNfOuXnBgMwYlz
              zMkFPKYxDrj')
   2 - access("M"."PKCOL"="O"."PKCOL"(+))

Yikes! The driving filter is being executed on plan line 1, after a full tablescan of each table. We have to read and join every row (note the lack of predicate applied to lines 3 and 4) before we can start applying our filter, that’s going to be nasty when these tables grow.

You might be wondering if there’s a way for the query to be executed in a fast way at all. You could filter on rows in as_main that match the name filter and check to make sure they don’t have a joining row in as_optional, you can then filter on rows in as_optional that match the name filter and join back to as_main. I’ll share the SQL version of this later on but let’s see if we can get the optimizer to do the work for us.

We used coalesce, but there are some tricks that the optimizer has with NVL so let’s see if changing that helps.

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  nvl(o.alternative_name, m.name) = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj';
Plan hash value: 202792823
 
--------------------------------------------------------------------------------------
| Id  | Operation              | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |             | 10000 |   683K|   145   (0)| 00:00:01 |
|*  1 |  FILTER                |             |       |       |            |          |
|*  2 |   HASH JOIN RIGHT OUTER|             | 10000 |   683K|   145   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL   | AS_OPTIONAL |  2000 | 70000 |    25   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL   | AS_MAIN     | 10000 |   341K|   120   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(NVL("O"."ALTERNATIVE_NAME","M"."NAME")='fvzXQcNfOuXnBgMwYlzzMkFP
              KYxDrj')
   2 - access("M"."PKCOL"="O"."PKCOL"(+))

Nope! Same plan. What if we expand the NVL filter a bit:

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where   (o.alternative_name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj' or (o.alternative_name is null and m.name ='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'));
Plan hash value: 202792823
 
--------------------------------------------------------------------------------------
| Id  | Operation              | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |             |     5 |   350 |   145   (0)| 00:00:01 |
|*  1 |  FILTER                |             |       |       |            |          |
|*  2 |   HASH JOIN RIGHT OUTER|             |     5 |   350 |   145   (0)| 00:00:01 |
|   3 |    TABLE ACCESS FULL   | AS_OPTIONAL |  2000 | 70000 |    25   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL   | AS_MAIN     | 10000 |   341K|   120   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("O"."ALTERNATIVE_NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj' OR 
              "O"."ALTERNATIVE_NAME" IS NULL AND "M"."NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'
              )
   2 - access("M"."PKCOL"="O"."PKCOL"(+))

Hmm, no change again. What if we force the or expansion with a use_concat hint?

select /*+use_concat*/m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where   (o.alternative_name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj' or (o.alternative_name is null and m.name ='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'));
Plan hash value: 2005024459
 
----------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                 |     6 |   420 |   148   (0)| 00:00:01 |
|   1 |  CONCATENATION                         |                 |       |       |            |          |
|*  2 |   FILTER                               |                 |       |       |            |          |
|   3 |    NESTED LOOPS OUTER                  |                 |     1 |    70 |     3   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| AS_MAIN         |     1 |    35 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | AS_MAIN_NAME    |     1 |       |     1   (0)| 00:00:01 |
|   6 |     TABLE ACCESS BY INDEX ROWID        | AS_OPTIONAL     |     1 |    35 |     1   (0)| 00:00:01 |
|*  7 |      INDEX UNIQUE SCAN                 | SYS_C0033779786 |     1 |       |     0   (0)| 00:00:01 |
|*  8 |   FILTER                               |                 |       |       |            |          |
|*  9 |    HASH JOIN RIGHT OUTER               |                 |     5 |   350 |   145   (0)| 00:00:01 |
|  10 |     TABLE ACCESS FULL                  | AS_OPTIONAL     |  2000 | 70000 |    25   (0)| 00:00:01 |
|  11 |     TABLE ACCESS FULL                  | AS_MAIN         | 10000 |   341K|   120   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter("O"."ALTERNATIVE_NAME" IS NULL)
   5 - access("M"."NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')
   7 - access("M"."PKCOL"="O"."PKCOL"(+))
   8 - filter("O"."ALTERNATIVE_NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj' AND 
              (LNNVL("O"."ALTERNATIVE_NAME" IS NULL) OR LNNVL("M"."NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')))
   9 - access("M"."PKCOL"="O"."PKCOL"(+))

We’re sort of there, part of the query will lead on the name index on as_main, the other part is not. We can take a closer look at the predicates against the second part of the concatenation for clues.

8 - filter("O"."ALTERNATIVE_NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj' AND 
           (LNNVL("O"."ALTERNATIVE_NAME" IS NULL) OR LNNVL("M"."NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')))
9 - access("M"."PKCOL"="O"."PKCOL"(+))

We can see that Oracle is still choosing to execute this as an outer join (from the operation in the plan or the presense of (+)s in the predicates). It’s then deciding that it can only filter on

"O"."ALTERNATIVE_NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'

After executing the join. We know that the filter could legally be applied on line 10 if the join on line 9 was an inner join (which we know can happen as we’re filtering on it being not null). The optimizer doesn’t seem to spot this. Perhaps it wants to apply the LNNVL predicate at the same time (which must be done after the join becaues it uses both sides of the join). We know that the LNNVL isn’t necessary because the two sides of the OR condition are mutually exclusive, you can’t have a column equal to a value and also null, but this always gets added as a fail safe with OR is expanded to union all. Note: LNNVL (condition) is the equivalent of (condition = false or condition is unknown) ie the 3-valued logic form of anything but true.

Now that we’ve done some digging into whether we can help Oracle do the query transformation itself, it might be time to give in and just rewrite the query ourselves. This is my manually rewritten query

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  o.alternative_name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'
union all
select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  m.name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'
and    o.alternative_name is null;

All I’ve done here is copy and pasted the OR version of the query and stuck one side of the OR condition in one subquery and the other side in the other. It is logically identical to the first version (remember the sides of the OR are mutually exclusive so we can get away with the union all), but let’s see how the optimizer treats it.

Plan hash value: 236470842
 
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                  |     2 |   140 |     6   (0)| 00:00:01 |
|   1 |  UNION-ALL                             |                  |       |       |            |          |
|   2 |   NESTED LOOPS                         |                  |     1 |    70 |     3   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                        |                  |     1 |    70 |     3   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| AS_OPTIONAL      |     1 |    35 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | AS_OPTIONAL_NAME |     1 |       |     1   (0)| 00:00:01 |
|*  6 |     INDEX UNIQUE SCAN                  | SYS_C0033779785  |     1 |       |     0   (0)| 00:00:01 |
|   7 |    TABLE ACCESS BY INDEX ROWID         | AS_MAIN          |     1 |    35 |     1   (0)| 00:00:01 |
|*  8 |   FILTER                               |                  |       |       |            |          |
|   9 |    NESTED LOOPS OUTER                  |                  |     1 |    70 |     3   (0)| 00:00:01 |
|  10 |     TABLE ACCESS BY INDEX ROWID BATCHED| AS_MAIN          |     1 |    35 |     2   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN                  | AS_MAIN_NAME     |     1 |       |     1   (0)| 00:00:01 |
|  12 |     TABLE ACCESS BY INDEX ROWID        | AS_OPTIONAL      |     1 |    35 |     1   (0)| 00:00:01 |
|* 13 |      INDEX UNIQUE SCAN                 | SYS_C0033779786  |     1 |       |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("O"."ALTERNATIVE_NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')
   6 - access("M"."PKCOL"="O"."PKCOL")
   8 - filter("O"."ALTERNATIVE_NAME" IS NULL)
  11 - access("M"."NAME"='fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')
  13 - access("M"."PKCOL"="O"."PKCOL"(+))
 
Note
-----
   - this is an adaptive plan

We now do some lookups against both tables on our driving filter, compute the join from either side and then combine them. This should be much more optimal than the full table scans. But we had to do the transformation for the optimizer here.

SQL Translation

What if we didn’t have the luxury of being able to modify the code? We can’t use SQL Plan Baselines or SQL Patches because there’s no way Oracle would use our desired plans (the closest you’d get would include that LNNVL filter), but perhaps Oracle has another trick up its sleeve?

Enter SQL Translation Framework. I was reminded about this in a recent Twitter exchange, it allows you (from 12.1 onwards) to modify the SQL being sent by an application before it gets parsed. The main piece of documentation relevent to us is https://docs.oracle.com/en/database/oracle/oracle-database/12.2/drdas/SQLtranslation-and-examples.html#GUID-E519283E-9BD8-4354-B612-72A07D7A5016. I had a look for a demo and found a thread started by Bob Bryla, it was answered by Solomon Yakobson with some code I found very useful: https://community.oracle.com/thread/3996327. You should read the thread as I won’t include all the code here.

I’ll note that you can also use dbms_advanced_rewrite to rewrite this exact query, but I need to be able to accept any literal value for the string.

To use Solomon’s sample code, we just need to figure out the regexp_replace arguments that will convert our badly performing query to the transformed query. Regular expressions are tricky to understand at first but we’ll be using them very simplistically so hopefully you can follow along(I recommend using online demos to see what does what – https://regexr.com/ is usually my goto)

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  coalesce(m.name, o.alternative_name) = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'

First, we need to escape anything that would mean something special in regular expressions – parenthesise and full stops (periods).

select m\.pkcol, coalesce\(o\.alternative_name, m\.name\) name
from   as_main m
left join as_optional o 
  on   m\.pkcol = o\.pkcol
where  coalesce\(m\.name, o\.alternative_name\) = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'

Next, we need to chunk up the query

(select m\.pkcol, coalesce\(o\.alternative_name, m\.name\) name
from   as_main m
left join as_optional o 
  on   m\.pkcol = o\.pkcol
where  coalesce\(m\.name, o\.alternative_name\) = )('fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj')

This gives us two groups, \1 which gives us the static part of the query, \2 which gives us the (currently static) value we’re comparing the coalesce against. Next step is to allow for further values in our string and ensure that we are referring to the complete query.

^(select m\.pkcol, coalesce\(o\.alternative_name, m\.name\) name
from   as_main m
left join as_optional o 
  on   m\.pkcol = o\.pkcol
where  coalesce\(m\.name, o\.alternative_name\) = )('\w+')$

\w means a word character (alphabetic character, a number, or an underscore). If we apply this regular expression to our query then \2 would give us ‘fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj’.
^ means the start of a string and $ means the end of a string, so we can wrap our query in those to ensure that the regular expression won’t be triggered if there is anything on either side of our query.

Now we need to form the query that this can be replaced by: first the manual translation we came up with before

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  o.alternative_name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'
union all
select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  m.name = 'fvzXQcNfOuXnBgMwYlzzMkFPKYxDrj'
and    o.alternative_name is null

We need to replace the known variables with our regular expression groups:

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  o.alternative_name = \2
union all
select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  m.name = \2
and    o.alternative_name is null

Now let’s put it all together, using q strings (like Solomon) so we don’t need to worry about escaping our quotes. We only need to modify the user defined PL/SQL package (the one that’s been defined as attr_translator):

create or replace
  package body u1_translator
    is
      procedure translate_sql(
                              sql_text        in  clob,
                              translated_text out clob
                            )
        is
        begin
           translated_text := regexp_replace(
                                             sql_text,
                                             q'#^(select m\.pkcol, coalesce\(o\.alternative_name, m\.name\) name
from   as_main m
left join as_optional o 
  on   m\.pkcol = o\.pkcol
where  coalesce\(m\.name, o\.alternative_name\) = )('\w+')$#',
                                             'select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  o.alternative_name = \2
union all
select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  m.name = \2
and    o.alternative_name is null'
                                           );
     end;
     procedure translate_error(
                               error_code          in  binary_integer,
                               translated_code    out binary_integer,
                               translated_sqlstate out varchar2
                             )
       is
       begin
           null;
     end;
End;
/

We’ve mimiced the white space so the formatting of the script is a little funny. We could concatenate the string together with carriage returns and line feeds (yes, Oracle runs on Windows just fine), or we could do it on one line and use \s+ for any whitespace (which includes new lines and carriage returns). Ultimately, I don’t think that will add readability here.

To test this, we just need to set up a session to use the SQL Translation Profile (see the linked code) and run our query

select m.pkcol, coalesce(o.alternative_name, m.name) name
from   as_main m
left join as_optional o 
  on   m.pkcol = o.pkcol
where  coalesce(m.name, o.alternative_name) = 'someOtherValue'

Now uses

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                  |     2 |   140 |     6   (0)| 00:00:01 |
|   1 |  UNION-ALL                             |                  |       |       |            |          |
|   2 |   NESTED LOOPS                         |                  |     1 |    70 |     3   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                        |                  |     1 |    70 |     3   (0)| 00:00:01 |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| AS_OPTIONAL      |     1 |    35 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | AS_OPTIONAL_NAME |     1 |       |     1   (0)| 00:00:01 |
|*  6 |     INDEX UNIQUE SCAN                  | SYS_C0033779785  |     1 |       |     0   (0)| 00:00:01 |
|   7 |    TABLE ACCESS BY INDEX ROWID         | AS_MAIN          |     1 |    35 |     1   (0)| 00:00:01 |
|*  8 |   FILTER                               |                  |       |       |            |          |
|   9 |    NESTED LOOPS OUTER                  |                  |     1 |    70 |     3   (0)| 00:00:01 |
|  10 |     TABLE ACCESS BY INDEX ROWID BATCHED| AS_MAIN          |     1 |    35 |     2   (0)| 00:00:01 |
|* 11 |      INDEX RANGE SCAN                  | AS_MAIN_NAME     |     1 |       |     1   (0)| 00:00:01 |
|  12 |     TABLE ACCESS BY INDEX ROWID        | AS_OPTIONAL      |     1 |    35 |     1   (0)| 00:00:01 |
|* 13 |      INDEX UNIQUE SCAN                 | SYS_C0033779786  |     1 |       |     0   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("O"."ALTERNATIVE_NAME"='someOtherValue')
   6 - access("M"."PKCOL"="O"."PKCOL")
   8 - filter("O"."ALTERNATIVE_NAME" IS NULL)
  11 - access("M"."NAME"='someOtherValue')
  13 - access("M"."PKCOL"="O"."PKCOL"(+))

Excellent!

A challenge for the reader wouldbe to make this as dynamic as possible – what if we want it to support different columns being selected? What if we wanted to support additional predicates? What if we wanted to support this as a subquery? What if we want to support different outer join conditions (would that even be valid to rewrite)? What if we want to support different tables all-together?

Obviously, the best solution would be for the optimizer to do it for us (and cost the transformation)!

Oops! I forget to split the MAXVALUE partition…

Partitioning your data by a date column with a plan to archive off old data is quite standard these days. If you had set this up prior to 11.2 (or just don’t use new features) then you will have probably set up some process to split a MAXVALUE partition into a bunch of partitions in advance. Let’s pretend that somehow you’ve forgotten to run that process for quite some time and now you’ve got a 700 GB table with 690GB in the MAXVALUE partition. You really don’t want to end up in this situation again so you want to use interval partitioning, but uh oh:

alter table as_totally_fictional_table set interval (numtoyminterval(1,’month’));
ORA-14759: SET INTERVAL is not legal on this table.

We’re going to have to do something about our MAXVALUE partition first. Let’s look at what problems can arise and how to mitigate them.

First let’s set up some sample data.

create table as_totally_fictional_table 
  (pk_col number primary key
  ,date_col date not null
  ,filler_col varchar2(400)
  ) 
partition by range (date_col) 
  (partition p01012015 values less than (to_date('01012015','ddmmyyyy'))
  ,partition p01022015 values less than (to_date('01022015','ddmmyyyy'))
  ,partition pmax values less than (maxvalue)
  );

insert /*+append*/ into as_totally_fictional_table
select rownum, to_date('14082019','ddmmyyyy')-rownum/1000, rpad('x',400,'x')
from (select 1 a from dual connect by rownum <= 365*5 ) a
cross join (select 1 from dual connect by rownum <= 100) b;
commit;

select partition_name, dbms_xplan.format_size(bytes) partition_size 
from   user_segments 
where  segment_name = 'AS_TOTALLY_FICTIONAL_TABLE'
# col partition_name for a30
# col partition_size for a30
/

PARTITION_NAME                 PARTITION_SIZE
------------------------------ ------------------------------
P01012015                      8192K
P01022015                      8192K
PMAX                           80M

We could just go in and split the partition at a value higher than the known max value (usually the column is populated by sysdate so it’s not hard to imagine a higher value). If we do this, Oracle is able to use the fast-split method (see https://docs.oracle.com/database/121/VLDBG/GUID-5AEB1A3A-89F1-489B-824E-CCD955F94BD7.htm). We can trace the DDL statement to see what’s going on.

exec dbms_monitor.session_trace_enable(waits=>true, binds=>false);
alter table as_totally_fictional_table split partition pmax
at (to_date('01092019','ddmmyyyy')) into (partition pBIG, partition pNewMax);
exec dbms_monitor.session_trace_disable

Opening up the trace file and searching for where the time is taken:

PARSING IN CURSOR #2142721212672 len=294 dep=1 uid=103 oct=3 lid=103 tim=956511516267 hv=1856181448 ad='7ffb9f2c1d80' sqlid='f1j7q2dra6568'
select /*+ FIRST_ROWS(1) PARALLEL("AS_TOTALLY_FICTIONAL_TABLE", 1) */ 1 from "ANDY"."AS_TOTALLY_FICTIONAL_TABLE" PARTITION ("PMAX")  where ( (  (  ( "DATE_COL" >= TO_DATE(' 2019-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN') OR "DATE_COL" IS NULL  )  )  ) ) and rownum < 2
END OF STMT
PARSE #2142721212672:c=15625,e=25115,p=0,cr=244,cu=0,mis=1,r=0,dep=1,og=2,plh=827481864,tim=956511516266
EXEC #2142721212672:c=0,e=29,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=2,plh=827481864,tim=956511516385
WAIT #2142721212672: nam='db file scattered read' ela= 1826 file#=12 block#=112952 blocks=61 obj#=91479 tim=956511518495
WAIT #2142721212672: nam='db file scattered read' ela= 1419 file#=12 block#=113013 blocks=61 obj#=91479 tim=956511520556
WAIT #2142721212672: nam='db file scattered read' ela= 1360 file#=12 block#=113074 blocks=43 obj#=91479 tim=956511522564
...
FETCH #2142721212672:c=140625,e=3372937,p=9244,cr=19480,cu=0,mis=0,r=0,dep=1,og=2,plh=827481864,tim=956514889345
STAT #2142721212672 id=1 cnt=0 pid=0 pos=1 obj=0 op='COUNT STOPKEY (cr=19480 pr=9244 pw=0 str=1 time=3372946 us)'
STAT #2142721212672 id=2 cnt=0 pid=1 pos=1 obj=0 op='VIEW  VW_ORE_AAA83388 (cr=19480 pr=9244 pw=0 str=1 time=3372943 us cost=29 size=0 card=8966)'
STAT #2142721212672 id=3 cnt=0 pid=2 pos=1 obj=0 op='UNION-ALL  (cr=19480 pr=9244 pw=0 str=1 time=3372941 us)'
STAT #2142721212672 id=4 cnt=0 pid=3 pos=1 obj=0 op='PARTITION RANGE SINGLE PARTITION: 3 3 (cr=9740 pr=9244 pw=0 str=1 time=3354008 us cost=26 size=981 card=109)'
STAT #2142721212672 id=5 cnt=0 pid=4 pos=1 obj=91476 op='TABLE ACCESS FULL AS_TOTALLY_FICTIONAL_TABLE PARTITION: 3 3 (cr=9740 pr=9244 pw=0 str=1 time=3354000 us cost=26 size=981 card=109)'
STAT #2142721212672 id=6 cnt=0 pid=3 pos=2 obj=0 op='PARTITION RANGE SINGLE PARTITION: 3 3 (cr=9740 pr=0 pw=0 str=1 time=18923 us cost=3 size=79713 card=8857)'
STAT #2142721212672 id=7 cnt=0 pid=6 pos=1 obj=91476 op='TABLE ACCESS FULL AS_TOTALLY_FICTIONAL_TABLE PARTITION: 3 3 (cr=9740 pr=0 pw=0 str=1 time=18915 us cost=3 size=79713 card=8857)'
CLOSE #2142721212672:c=0,e=9,dep=1,type=0,tim=956514889987

Looking at the trace file we can see that Oracle ended up doing a full table scan of our partition in order to confirm there was no data above our split value, this makes sense as the fast-split is so desirable that we might be able to afford a one off scan. While it’s doing that full table scan, it’s probably already taken out locks on the table. If we time a query against v$lock just right we can see:

select * from v$lock where sid = 17
ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
000001F2E36B7270 000001F2E36B72E0         17 TM         18          0          3          0          2          0          3
000001F2E36B7270 000001F2E36B72E0         17 TM      91476          0          3          0          2          0          3
00007FFB88744C08 00007FFB88744C90         17 TX     524313       2880          6          0          2          0          3
00007FFB8BC1DAE0 00007FFB8BC1DB60         17 AE        134 1744511710          4          0         49          0          0
000001F2E36B7270 000001F2E36B72E0         17 TM      91479          0          6          0          2          0          3

We can look up the relevant objects that the locks point to using id1 = object_id (I found that a flashback query was necessary to identify the partition object as we have split it). It’s taken out an exclusive table lock against the partition being split (TM mode 6) – this means that no one can modify any data within our big partition (bad), and a TM lock against the table in mode 3 (probably less of a worry). My demo split isn’t too bad because it’s only an 80MB partition that needs to be read, if we have a huge partition to split we will soon run into major problems.

What if we could tell Oracle that we know there is no data in our table with a date value above our split? We can with a check constraint! So what happens if we reset out test case and add a check constraint? I’m going to increase the amount of data to make it easier to capture locks.

PARTITION_NAME                 PARTITION_SIZE
------------------------------ ------------------------------
P01012015                      68M
P01022015                      16M
PMAX                           768M
alter table as_totally_fictional_table add constraint
as_totally_fictional_table_cc check (date_col < to_date('15082019','ddmmyyyy'));

The locks owned by the session during the call:

ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
000001F2DC977288 000001F2DC9772F8         17 TM      91488          0          4          0          2          0          3
00007FFB8BC1DAE0 00007FFB8BC1DB60         17 AE        134 1744511710          4          0       1067          0          0

We still have an exclusive lock, only now it effects the entire table while the whole table is being read… bad news!

What if we can limit this lock to a short period? We can split the constraint adding into a few parts:

alter table as_totally_fictional_table drop constraint as_totally_fictional_table_cc;
alter table as_totally_fictional_table add constraint
as_totally_fictional_table_cc check (date_col < to_date('15082019','ddmmyyyy'))
enable novalidate;

We’ve enabled the constraint without validating it, this was super quick as it didn’t require reading anything. It did still have to take out the exclusive table lock though, we can see if we get in it’s way:

alter table as_totally_fictional_table drop constraint as_totally_fictional_table_cc;
other session> select * from as_totally_fictional_table where rownum <2 for update; 
alter table as_totally_fictional_table add constraint
as_totally_fictional_table_cc check (date_col < to_date('15082019','ddmmyyyy'))
enable novalidate;
ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
00007FFB886F6BF8 00007FFB886F6C80         17 TX     196631       3167          6          0          2          0          3
000001F2DC95B368 000001F2DC95B3D8         17 TM      91488          0          2          0          2          0          3
00007FFB8BC1DD10 00007FFB8BC1DD90         17 OD      91488          0          4          0          2          0          3
00007FFB8BC1DAE0 00007FFB8BC1DB60         17 AE        134 1744511710          4          0       2350          0          0
00007FFB8BC1DE28 00007FFB8BC1DEA8         17 TX     458765       3380          0          4          2          0          3

The next step is to validate the constraint, this will require reading the data but will it lock?

other session> select * from as_totally_fictional_table where rownum <2 for update; 
alter table as_totally_fictional_table modify constraint
as_totally_fictional_table_cc validate;

Table altered.

Elapsed: 00:00:03.98

No locking at all!

Now we need to double check that this effort was worth it to allow our split partition to avoid the full tablescan.

exec dbms_monitor.session_trace_enable(waits=>true, binds=>false);
alter table as_totally_fictional_table split partition pmax
at (to_date('01092019','ddmmyyyy')) into (partition pBIG, partition pNewMax);
exec dbms_monitor.session_trace_disable
PARSING IN CURSOR #2142805602888 len=294 dep=1 uid=103 oct=3 lid=103 tim=961274498406 hv=1856181448 ad='7ffba076ed10' sqlid='f1j7q2dra6568'
select /*+ FIRST_ROWS(1) PARALLEL("AS_TOTALLY_FICTIONAL_TABLE", 1) */ 1 from "ANDY"."AS_TOTALLY_FICTIONAL_TABLE" PARTITION ("PMAX")  where ( (  (  ( "DATE_COL" >= TO_DATE(' 2019-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN') OR "DATE_COL" IS NULL  )  )  ) ) and rownum < 2
END OF STMT
PARSE #2142805602888:c=0,e=5817,p=0,cr=8,cu=0,mis=1,r=0,dep=1,og=2,plh=693532361,tim=961274498405
EXEC #2142805602888:c=0,e=23,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=2,plh=693532361,tim=961274498543
FETCH #2142805602888:c=0,e=4,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=2,plh=693532361,tim=961274498570
STAT #2142805602888 id=1 cnt=0 pid=0 pos=1 obj=0 op='COUNT STOPKEY (cr=0 pr=0 pw=0 str=1 time=4 us)'
STAT #2142805602888 id=2 cnt=0 pid=1 pos=1 obj=0 op='FILTER  (cr=0 pr=0 pw=0 str=1 time=1 us)'
STAT #2142805602888 id=3 cnt=0 pid=2 pos=1 obj=0 op='PARTITION RANGE SINGLE PARTITION: 3 3 (cr=0 pr=0 pw=0 str=0 time=0 us cost=26577 size=9684 card=1076)'
STAT #2142805602888 id=4 cnt=0 pid=3 pos=1 obj=91495 op='TABLE ACCESS FULL AS_TOTALLY_FICTIONAL_TABLE PARTITION: 3 3 (cr=0 pr=0 pw=0 str=0 time=0 us cost=26577 size=9684 card=1076)'
CLOSE #2142805602888:c=0,e=4,dep=1,type=0,tim=961274498843

The absence of the WAITs suggests that no real work was done. We can also dive in to the data to confirm this (and it’s always handy to keeep https://method-r.com/wp-content/uploads/2017/11/SQL-Trace-Data-paper.pdf open on the side). The EXEC and FETCH rows tell us that we did no consistent or current mode gets (cr=0,cu=0). The STAT rows tell us some statistics per line of the plan, we can see that the TABLE ACCESS FULL on line id=4 didn’t execute at all (str=0).

Now we can drop our empty PMAX partition (we still have the constraint so we know we’re not going to lose data)

alter table as_totally_fictional_table drop partition pNewMax update indexes;

And finally we can enable interval partitioning:

alter table as_totally_fictional_table set interval (numtoyminterval(1,'month'));

Just need to tidy up now

alter table as_totally_fictional_table drop constraint as_totally_fictional_table_cc;

Now we shouldn’t have this problem again and we didn’t have to organize a massive downtime or temporarily double our storage. We might still have our huge partition but we will eventually be able to ignore it completely.