It’s interesting how a lot of the work I’ve been doing lately has in some way involved a kind of performance tuning. Previously I’ve talked about how to increase the performance of .NET applications by using delegates instead of reflection in code that runs frequently.

This time it is all about performance in database processing.

The scenario

Imagine an application that manages a wait list. Users of this application put themselves in line and wait for their turn to gain access to some kind of shared resource. Here are the basic rules of this system:

  • The same user can appear in the wait list multiple times, once for every resource she is queuing for.
  • The users’ position in the wait list at any given time is decided by a score.
  • This score is calculated based on the number of credits each user has in the system compared to the amount required by the resource they wish to access.

Let’s say that this wait list is modeled in a Microsoft SQL Server database with the following schema:

WaitListSchema

The position of the different users in the wait list is periodically updated by a Stored Procedure that calculates the current score for each and every row in the WaitList table.

So far so good. Now, imagine that this WaitList table contains somewhere around 30 millions rows, and the Stored Procedure that updates all of the scores takes about 9 hours to complete. And now we have problem.

 The imperative SQL approach

Before going into all kinds of general database optimization techniques, let’s start off by looking at how that Stored Procedure is implemented.
Here is a slightly simplified version of it:

CREATE PROCEDURE CalculateWaitListScores_Imperative
AS
BEGIN

DECLARE @rowsToCalculate INT

SELECT @rowsToCalcualte = COUNT(*)
FROM WaitList
AND Score IS NULL

WHILE ( @rowsToCalculate > 0 )
BEGIN

  DECLARE @userID INT
  DECLARE @resourceID INT
  DECLARE @score INT

  SELECT TOP 1 @userID = UserID, @resourceID = ResourceID
  FROM WaitList
  AND Score IS NULL

  -- The actual calculation of the score is omitted for clarity.
  -- Let's just say that it involves a SELECT query that joins
  -- the [WaitList] table with the [User] and [Resource] tables
  -- and applies a formula that associates the values
  -- of the [Credit] columns in each of them.
  -- For the sake of this example we just set it to a constant value
  SET @score = 150

  UPDATE WaitList
  SET Score = @score
  WHERE UserID = @userID
  AND ResourceID = @resourceID
  AND Score IS NULL

  SELECT @rowsToCalcualte = COUNT(*)
  FROM WaitList
  AND Score IS NULL

END

END

If you aren’t into the Transact-SQL language syntax, let me spell out the algorithm for you:

  1. Get the number of rows in the WaitList table where the score has never been calculated
  2. If there are any such rows, get the user and the resource IDs for the first row in the WaitList table where the score has never been calculated
  3. Calculate the score for that user and resource
  4. Update the score with the newly calculated value
  5. Go to Step 1

In the worst case, this set of operations will be repeated 30 millions times, that is once for every row in the WaitList table. Think about it for a moment.

While looking at this code, I immediately imagined this dialogue taking place between SQL Server and the developer(s) who wrote the Stored Procedure:

Developer: Listen up, SQL Server. I want you to calculate a new score and update all of those 3o millions rows, but do it one row at a time.

SQL Server: That’s easy enough, but I’m pretty sure I can find a faster way to do this, if you’ll let me.

Developer: No, no. I want you to do exactly what I said. That way it’s easier for me to understand what’s going on and debug if any problem occurs.

SQL Server: Alright, you’re the boss.

Jokes aside, the bottom line here is this:

infoBy implementing database operations in an imperative manner, you effectively tie up the hands of the query execution engine, thus preventing it from performing a number of optimizations at runtime in order to speed things up.

And that basically means trading performance and scalability for more fine-grained control.

 The declarative SQL approach

Let’s see if we can make this Stored Procedure run any faster, by changing our approach to the problem altogether.

infoThis time, we’ll tell the database what we want done in a declarative manner, and we’ll let the query execution engine figure out the best way to get the job done.

Here is a rewritten version of the original Stored Procedure:

CREATE PROCEDURE CalculateWaitListScores_Declarative
AS
BEGIN

UPDATE WaitList
SET Score = dbo.CalculateScore(UserID, ResourceID)
WHERE Score IS NULL
	
END

What we did is basically removing the explicit loop and merging all operations into a single UPDATE statement executed on the WaitList table, which invokes a custom a scalar function (CalculateScore) to calculate the score with the value of the current row.

Now, let’s look at some performance comparison:

WaitListPerfChart

That’s a pretty significant leap in speed. How is that possible? A look at the CPU usage on the database server while running the two versions of the Stored Procedure pretty much explains it all:

CPU usage while executing CalculateWaitListScores_Imperative:

CpuUsageWithImperativeSql

CPU usage while executing CalculateWaitListScores_Declarative:

CpuUsageWithDeclarativeSql 

As you see, in the first picture the CPU is steadily at 9-10% and is basically using only one out of four available cores. This is because SQL Server is forced to do its work sequentially and has to wait until the score for the current row has been calculated and updated before proceeding to the next.

In the second picture, we are simply telling SQL Server our intent, rather than dictating exactly how it should be done. This allows SQL Server to parallelize the workload than can now be executed on multiple CPU/Cores at once leveraging the full power of the hardware.

Lessons learned

Here are a couple of getaways I learned from this exercise:

  1. SQL is a declarative language at its core, designed to work with sets of rows. That’s what it does best and that’s how you should use it.
  2. Whenever possible, try to avoid applying an imperative programming mindset when implementing database operations, even if  the constructs available in SQL-derived languages like T-SQL make it easy to do so
  3. Don’t be afraid to give up some control over what happens at runtime when your database code runs. Let the database find out the best way to do things, and get ready to experience some great performance improvements.

Hope this helps.

/Enrico

SQL Injections

12/11/2008

When discussing web development, the subject of security inevitably comes up. SecurityLast week while I was teaching a class in ASP.NET, we had a sample login page that,   although being quite trivial in nature, was exposing a serious vulnerability to SQL injections attacks.

Does it hurt?

Well, it might. But let’s first start off by defining the concepts. An SQL Injection is a term used to define a type of attack that exploits a vulnerability in the data access layer (DAL) of an application to embed arbitrary SQL statements in properly crafted user input, that will get executed at runtime. Let me show you how this works with an example.

Vulnerability and exploit

Suppose we have a web site that prompts users to enter their username and password credentials in order to authenticate their identity. Typically, somewhere in the DAL of the application, there will be a chunk of code looking not to different than this:

using (SqlConnection connection = new SqlConnection("ConnectionString"))
{
    string query = "SELECT EXISTS FROM Users WHERE Username='"
        + username + "' AND Password='" + password + "'";
    SqlCommand command = new SqlCommand(connection, query);
    bool isAuthenticated = (bool)command.ExecuteScalar();
}
 
This code exposes a serious vulnerability whenever the content of the ‘username’ and ‘password’ strings coming from the UI are not properly validated. Now, a malicious user could exploit this vulnerability by entering the following value in the username input field:
 
SomeUser' OR 1=1 --
 
What we have done is effectively extended the SQL query that will be run against the database by appending an additional condition to it that will cause it to always evaluate to ‘true’ and commented out the rest of the original query. Here is how the ‘query’ variable will look like when executed at runtime:
 
SELECT EXISTS FROM Users WHERE username='SomeUser' OR 1=1 --'AND Password=''
 

Yes folks, this will let us login to the web site without actually possessing a valid user account.

In other scenarios SQL Injections can be used to steal information. Suppose our web site has a page displaying a list of products. The URL then could more then likely look like this:

http://unsafewebsite.com/products.aspx?category=books

Then the page would run the following piece of code to retrieve and return the appropriate rows from the database table:

using (SqlConnection connection = new SqlConnection("ConnectionString"))
{
    string query = "SELECT * FROM Products WHERE Category='"
        + category + "'";
    SqlCommand command = new SqlCommand(connection, query);
    return command.ExecuteReader();
}

An attacker could exploit the fact that the URL parameter is not properly validated and enter the following value in the browser’s address bar:

http://unsafewebsite.com/products.aspx?category=Something';SELECT * FROM Users --

Here we embedded a new SQL statement that will run after the first one during the same database connection. Again, without proper validation of the user input, the final command that will get executed at runtime will look like this:

SELECT * FROM Products WHERE Category='Something';SELECT * FROM Users --'

Which will display all information about the users of the web site on the products page. Ouch!

The solution

So, how can we prevent this from happening? Well, there are a couple of security golden rules you should keep in mind:

  1. Never trust user input: this means you should always assert that every single piece of data coming from the user interface is what you expect it to be, and nothing else. In some places like for example web site URLs it is a good idea to sanitize all strings by encoding them, which will replace all special characters with HTML or ASCII values. This will prevent maliciously crafted strings from being evaluated at runtime and executed as code.
  2. Always use parameters in SQL statements: avoid dynamically generating SQL statements in code by concatenating hard-coded strings and variables, like I showed you in the examples above. Instead you should use strongly-typed parameters that will be assigned a value at runtime. For example the following code will prevent SQL injections by only accepting numeric values for the parameter:
using (SqlConnection connection = new SqlConnection("ConnectionString"))
{
    string query = "SELECT * FROM Products WHERE ProductId=@id";
    SqlCommand command = new SqlCommand(connection, query);
    // Valid parameter values are only integers
    command.Parameters.Add("id", SqlDbType.Int, productId);
    return command.ExecuteReader();
}
 

Bringing it together

SQL injections are a relatively simple way of attacking an application by embedding SQL statements in the user input and having them execute unexpectedly. When in the wrong hands they can indeed be used for great evil. Fortunately it’s not too hard to guard ourselves from this kind of vulnerability by always validating incoming data and  using parameters in all our SQL commands.

/Enrico

Follow

Get every new post delivered to your Inbox.

Join 186 other followers