Performance Tuning / SQL Server DBA / sqlLAB

Writing Scalable Queries with T-SQL Ranking Functions

More frequently, I am coming across code that incorporates the various ranking functions now available in T-SQL. These functions have been around for a while and provide ways to include explicit ranking columns in query results. For example, in a race with three runners you could return their rank (1, 2, 3) with the following:

 


create table [race]
(
   [competitor] varchar(10),
   [speed (sec)] int
);
go

insert into [race]
select 'James', 14
union select 'Harry', 10
union select 'Arnold', 11;
go

select rank() over (order by [speed (sec)]) as [Ranking]
     , [competitor]
     , [speed (sec)]
from race;

Which will produce the following results:

race

 

Unfortunately, I am also beginning to see queries using these ranking functions to select “top x” results, or the most recent results. Not only do I find this a little contrived, it is not a scalable solution. The issue with using the ranking functions for this purpose is that it forces a sort of the full dataset, something that can be done much more efficiently with a well thought out group by and subquery. Below is an example:

Imagine that you run a very simple bank. You have 3 clients, James, Harry and Arnold, who each deposit money into their bank accounts. Every deposit is recorded in a ledger table:

 


create table [person]
(
    id int primary key,
    name varchar(10) not null
);
go

create table [ledger]
(
    transactionid int primary key,
    personid int foreign key references [person](id),
    amount float not null,
    ledgerdate date not null
);
go

ledger

Your job is to write a report that returns the most recent deposit made by each client. There are two methods for this. One, Query 1, below is the rather verbose GROUP BY with a SUBQUERY. The other, Query 2, is the much more succinct (and I would argue self-explanatory) query using row_number():


-- Query 1
select personid, amount
from ledger as a
where ledgerdate = (
                      select max(ledgerdate)
                      from ledger as b
                      where b.personid = a.personid
                      group by personid
                   )
group by personid, amount
order by personid;

-- Query 2
select personid, amount
from (
       select row_number() over (partition by personid order by ledgerdate desc) as tx
            , personid
            , amount
       from ledger
 ) as x
where x.tx = 1;     -- i.e. the most recent deposit
go

Both queries return the same results:

ledger_results

 

And surprisingly (to me at least) Query 2 actually out-performs Query 1:

ledge_relative_small

Now this seems counter-intuitive to me. I had hoped that SQL Server would be able to arrive at a solution that extracted the most recent deposit for each client much more quickly than the row_number() solution. Even after the addition of a sensible index, the relative performances remain the same. Let’s add an index anyway:


create index idx_depositDate on ledger (ledgerdate desc) include (personid, amount);
go

Right now we only have 10 rows in out ledger. Let’s increase this to a about 1000 rows and perform run the same two queries:

ledger_relative_med

This time we see that the relative performance is 0.54 : 0.46. This is much closer. So we get the idea that as the size of the data increases, Query 1 starts to look better performance-wise. Notice that as the data size increase we get increasingly better performance from Query 1:

Data size = 100,000 rows

ledger_relative_large

Data size = 1.6 million rows:

ledger_relative_big

This supports my assertion that using T-SQL ranking functions to affect “top x” or “most recent” -type queries is cute, but inefficient as the scale of the problem increases.

There is one thing that I do like about the use to row_number() in Query 2, and that is that I think it makes the intention of the query a lot more obvious. It is immediately more obvious what the query is attempting to do, rather than having to understand both the outer and inner queries of Query 1. Simply for readability and maintainability, I am more likely to lean towards a CTE for this kind of problem:


with maxdate
as
(
   select personid, max(ledgerdate) as x
   from ledger
   group by personid
)
select l.personid, amount
from ledger as l
 inner join maxdate as m on m.x = l.ledgerdate and m.personid = l.personid
order by l.personid;
go

And it turns out, this solution performs as well as Query 1:

ledger_cte

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s