It’s interesting how a lot of the work I’ve been doing lately has in some way involved a kind of performance tuning. Previously I’ve talked about how to increase the performance of .NET applications by using delegates instead of reflection in code that runs frequently.

This time it is all about performance in database processing.

The scenario

Imagine an application that manages a wait list. Users of this application put themselves in line and wait for their turn to gain access to some kind of shared resource. Here are the basic rules of this system:

  • The same user can appear in the wait list multiple times, once for every resource she is queuing for.
  • The users’ position in the wait list at any given time is decided by a score.
  • This score is calculated based on the number of credits each user has in the system compared to the amount required by the resource they wish to access.

Let’s say that this wait list is modeled in a Microsoft SQL Server database with the following schema:

WaitListSchema

The position of the different users in the wait list is periodically updated by a Stored Procedure that calculates the current score for each and every row in the WaitList table.

So far so good. Now, imagine that this WaitList table contains somewhere around 30 millions rows, and the Stored Procedure that updates all of the scores takes about 9 hours to complete. And now we have problem.

 The imperative SQL approach

Before going into all kinds of general database optimization techniques, let’s start off by looking at how that Stored Procedure is implemented.
Here is a slightly simplified version of it:

CREATE PROCEDURE CalculateWaitListScores_Imperative
AS
BEGIN

DECLARE @rowsToCalculate INT

SELECT @rowsToCalcualte = COUNT(*)
FROM WaitList
AND Score IS NULL

WHILE ( @rowsToCalculate > 0 )
BEGIN

  DECLARE @userID INT
  DECLARE @resourceID INT
  DECLARE @score INT

  SELECT TOP 1 @userID = UserID, @resourceID = ResourceID
  FROM WaitList
  AND Score IS NULL

  -- The actual calculation of the score is omitted for clarity.
  -- Let's just say that it involves a SELECT query that joins
  -- the [WaitList] table with the [User] and [Resource] tables
  -- and applies a formula that associates the values
  -- of the [Credit] columns in each of them.
  -- For the sake of this example we just set it to a constant value
  SET @score = 150

  UPDATE WaitList
  SET Score = @score
  WHERE UserID = @userID
  AND ResourceID = @resourceID
  AND Score IS NULL

  SELECT @rowsToCalcualte = COUNT(*)
  FROM WaitList
  AND Score IS NULL

END

END

If you aren’t into the Transact-SQL language syntax, let me spell out the algorithm for you:

  1. Get the number of rows in the WaitList table where the score has never been calculated
  2. If there are any such rows, get the user and the resource IDs for the first row in the WaitList table where the score has never been calculated
  3. Calculate the score for that user and resource
  4. Update the score with the newly calculated value
  5. Go to Step 1

In the worst case, this set of operations will be repeated 30 millions times, that is once for every row in the WaitList table. Think about it for a moment.

While looking at this code, I immediately imagined this dialogue taking place between SQL Server and the developer(s) who wrote the Stored Procedure:

Developer: Listen up, SQL Server. I want you to calculate a new score and update all of those 3o millions rows, but do it one row at a time.

SQL Server: That’s easy enough, but I’m pretty sure I can find a faster way to do this, if you’ll let me.

Developer: No, no. I want you to do exactly what I said. That way it’s easier for me to understand what’s going on and debug if any problem occurs.

SQL Server: Alright, you’re the boss.

Jokes aside, the bottom line here is this:

infoBy implementing database operations in an imperative manner, you effectively tie up the hands of the query execution engine, thus preventing it from performing a number of optimizations at runtime in order to speed things up.

And that basically means trading performance and scalability for more fine-grained control.

 The declarative SQL approach

Let’s see if we can make this Stored Procedure run any faster, by changing our approach to the problem altogether.

infoThis time, we’ll tell the database what we want done in a declarative manner, and we’ll let the query execution engine figure out the best way to get the job done.

Here is a rewritten version of the original Stored Procedure:

CREATE PROCEDURE CalculateWaitListScores_Declarative
AS
BEGIN

UPDATE WaitList
SET Score = dbo.CalculateScore(UserID, ResourceID)
WHERE Score IS NULL
	
END

What we did is basically removing the explicit loop and merging all operations into a single UPDATE statement executed on the WaitList table, which invokes a custom a scalar function (CalculateScore) to calculate the score with the value of the current row.

Now, let’s look at some performance comparison:

WaitListPerfChart

That’s a pretty significant leap in speed. How is that possible? A look at the CPU usage on the database server while running the two versions of the Stored Procedure pretty much explains it all:

CPU usage while executing CalculateWaitListScores_Imperative:

CpuUsageWithImperativeSql

CPU usage while executing CalculateWaitListScores_Declarative:

CpuUsageWithDeclarativeSql 

As you see, in the first picture the CPU is steadily at 9-10% and is basically using only one out of four available cores. This is because SQL Server is forced to do its work sequentially and has to wait until the score for the current row has been calculated and updated before proceeding to the next.

In the second picture, we are simply telling SQL Server our intent, rather than dictating exactly how it should be done. This allows SQL Server to parallelize the workload than can now be executed on multiple CPU/Cores at once leveraging the full power of the hardware.

Lessons learned

Here are a couple of getaways I learned from this exercise:

  1. SQL is a declarative language at its core, designed to work with sets of rows. That’s what it does best and that’s how you should use it.
  2. Whenever possible, try to avoid applying an imperative programming mindset when implementing database operations, even if  the constructs available in SQL-derived languages like T-SQL make it easy to do so
  3. Don’t be afraid to give up some control over what happens at runtime when your database code runs. Let the database find out the best way to do things, and get ready to experience some great performance improvements.

Hope this helps.

/Enrico

The other day I was reading about one of the most talked about future Microsoft technologies code-named “Oslo”. The details of what “Oslo” is are still pretty vague, at  least until PDC 2008. In short it’s a set of technologies aimed at creating Domain Specific Languages (DSL) that can then be used to build executable programs. Yes, I know that still sounds vague, but that’s really all Microsoft has reveled about the whole project so far.dsl

According to a recent interview on .NET Rocks! with Don Box and Doug Purdy, two of the main architects behind “Oslo”, Microsoft’s goal is to:

enable people with knowledge on a specific domain, to create software in order to solve a problem in that domain, without having to know the technical details of software construction.

To me, “Oslo” represents the next big step in the evolution of programming styles.

Imperative Programming

Traditionally computer programming has always been about “telling” the machine what to do by specifying a set of instructions to execute in a particular order. This is known as imperative programming. The way programmers expressed these instructions has evolved over time from being commands directly interpreted by CPU to higher level structured languages that used abstracted concepts instead of processor-specific instructions, and let another program, the compiler, produce the appropriate code executable by the machine. The goal of programming languages has always been to give programmers new metaphors to interact with the computer in a more intuitive and natural way. Here are a few important milestones in this evolution:

  • Structured Programming: introduced natural language-like constructs to control the execution flow of program, like selection (“IF“, “ELSE“) and iteration (“WHILE“, “FOR“). Before that programmers used to construct programs by explicitly pointing to the next instruction to execute (“GOTO“).
  • Procedural Programming: introduced the concept of a “routine“, an ordered set of instructions that could be executed as a group by referring to them with a name (procedure call). Routines could take input data through parameters and output results through return values. Related routines could be grouped into modules to better organize them.
  • Object-Oriented Programming: focuses on the shape of the data being processed in a program. Introduced the concepts of “objects“, data structures that contain both data and the functions (methods) that operate on that data in a single unit. Objects and the interactions among them are modeled to represent real-world entities inside of a program, which allows programmers to think about a problem in a more natural way.

Declarative Programming

At the same time another style of programming has evolved over the years, known as declarative programming. Instead of telling the computer what to do, declarative programming focuses on telling what results are expected, and letting the computer figure out which steps it has to go through to obtain those results. Declarative programming expresses intent without issuing commands. Key milestones in the evolution include:

  • Functional Programming: describes a program as a set of mathematical functions that operate on immutable data. Functions can have data or other functions as their input and return the result of the computation.
  • Logic Programming: describes a program as a set of mathematical logic expressions. These expressions often take the form of premises and conclusions composed by logical statements (for example, “IF A AND B THEN C” where A, B, and C are logical statements). The program output is produced by evaluating these expressions on set of data called facts.
  • Language Oriented Programming: focuses on constructing a domain-specific language to describe the problem in the terms of its domain, and then using that language to describe a program to solve that problem.

Microsoft “Oslo”

The technologies delivered with “Oslo” fall clearly in this last category. In “Oslo” a compiler will translate a program expressed with a domain-specific language into an imperative general-purpose programming language such as C# or Visual Basic, which in its turn gets compiled into executable machine code. This way programmers can think using even more natural metaphors, and let the computer take care of the details of how to translate their intent into running software.

/Enrico

Follow

Get every new post delivered to your Inbox.

Join 185 other followers