Thursday, December 22, 2011

MSSQL non-clustered indexes INCLUDE feature explained

Today I received a question from someone about the nature of the INCLUDE feature when creating a non-clustered (secondary) index on a table. My response was a bit long, and I haven't posted in a while -- ergo blogpost! Here's the question:

What is your opinion about using the include statement when building your indexes? I’ve never really used the included functionality and I’m curious if there are upsides or downsides to them. My reading lead me to believe that I should use the include when I have a non clustered index and space appears to be a concern. I can use the ‘include’ as part of the index with the columns that may not be used as frequently. We’re using standard datatypes and nothing with very large column widths. So is there a benefit to using include?



A clustered index is stored in a B+-tree data structure and the actual data the whole row of data is in the leaf. So if you think of a b-tree structure (simplified) as something like:
           A
         /   \
        B     C
       / \   / \
      M   N O   P
And you have records like
[ Id =1, FirstName = Steve, LastName = Ash ]
[ Id = 2, FirstName = Neil, LastName=Gafter ]
Then for the clustered index, the id data (id=1 and id=2) will exist at every node (A, B, C, M, N, O, P), but the bytes for “steve” “ash” “neil” “gafter” will only exist in leaf nodes (either M N O or P). The id is used to locate which leaf holds the whole record. (note this is a simplification, see Wikipedia for more info about b+-trees). The noteworthy fact is being a clustered index means that the whole record is in a leaf (i.e. M N O P).

Now let’s think of a non-clustered index on lastName that corresponds to the clustered index above.
           D
         /   \
        E     F
       / \   
      Q   R         
In this case the “last name” is used to get to the leaf, and the leaf holds the primary key of the corresponding row in the clustered index. So D E or F will have things like lastName=Ash and lastName=Gafter and the leaves Q and R will have both lastnames and IDs. So an entry in Q or R might look like [lastname=Ash, id=1] (again simplifying).

So if you issue a query like
SELECT firstName FROM ThisTable WHERE lastName = ‘Ash’
(and the optimizer chooses to use the non-clustered index) then the database engine will do something like:
  1. Seek the non-clustered index for ‘Ash’
    1. Look in D to decide which direction to go E or F (lets say E is the right choice)
    2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
  2. Find the primary key for ‘Ash’
    1. In Q find id for Ash – which is id=1
  3. Seek the clustered index for id = 1
    1. Look in A to decide which direction to go B or C (lets say B is right)
    2. Look in B to decide which direction to go M or N (lets say N is right)
  4. Find the value of firstName for id=1
    1. In N find firstName for id=1 which is ‘Steve’
  5. return ‘Steve’ as the query result
Notice that we created a non-clustered indexed on lastName and we can use that index to quickly locate things by last name, but if we need any additional info, then we have to go back to the clustered index to get the other info in the SELECT list.

The “include” provides a way for you to shove additional information in the leaf nodes of non-clustered indexes (Q and R) to alleviate this "going back" to the clustered index.

So had I created the non-clustered index above on LastName INCLUDE FirstName then the engine would only need to do:

  1. Seek the non-clustered index for ‘Ash’
    1. Look in D to decide which direction to go E or F (lets say E is the right choice)
    2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
  2. Find the firstName for ‘Ash’
    1. In Q due to the include there is id=1 AND firstName=’Steve’ so we have the first name right here!
  3. return Steve
So you get rid of that entire other seek into the clustered index. This additional seek shows up as a “bookmark lookup” operation in the query plan in MSSQL 2008 & MSSQL 2000 and just another join in MSSQL 2005 query plans. Bookmark lookup is a join -- just with a different name to indicate its semantic role in the query plan.

When you have an index such that they query can be completely served from the index without needing to go back to the clustered index – such an index is called a covering index. The index covers the needs of the query completely. And it’s a performance boost as you don’t need the other join.

So this means a few things:
  1. You can obviously only “include” fields in non-clustered indexes. Clustered indexes already have all the fields in the leaf...so it doesn’t mean anything to include more.
  2. You can only have one clustered-index for a table, but you can simulate have multiple clustered indexes by INCLUDing the rest of the columns on your secondary indexes
  3. By duplicating the data in the secondary index, if you UPDATE firstName – you now have to update both the clustered index and the nonclustered index. (Main trade-off consideration)
    • This is also a huge deadlock opportunity if you’re not using read committed snapshot isolation (RCSI) level. Think of two queries: (1)
      UPDATE MyTable SET firstName = ‘Steve2’ where id = 1
      and (2)
      SELECT shoeSize FROM MyTable WHERE lastName = ‘Ash’
      (pretend shoe size is a new field that is in the clustered index but NOT in the non-clustered, i.e. NOT in the INCLUDE). Then the SELECT will seek the non-clustered index, grab shared (S) locks, then (while holding S locks) traverse the clustered index. Whereas (in the opposite order) the UPDATE will seek the clustered index, hold an exclusive (X) lock, then seek the non-clustered index to update firstName. The fact that these two queries are holding incompatible locks in opposite directions, is a deadlock waiting to happen. If I had a dollar for every time I diagnosed this deadlock scenario...
  4. By duplicating the data in the secondary index, each page in the leaf (Q and R) now has fewer rows per page and thus there is a greater memory demand on the buffer cache (and more IO to get the same number of records) (Second trade off consideration)



  5. Some databases (even MSSQL < 2005) don’t support this feature, but you can approximate it by creating non-clustered indexes with compound keys. I.e. if I created the non-clustered index on
    (lastName, firstName)
    then the index still covers queries like
    SELECT firstName FROM myTable WHERE lastName = ‘Ash’
    Note that this is not as good as the INCLUDE solution (for this particular query) as now the bytes of firstName=’Steve’ take up some space in non-leaf nodes D E F.
So deciding to use an INCLUDE is (like everything) a trade off. If you have a performance critical query that is being executed frequently, then you can usually use INCLUDEs to reduce the number of joins and increase performance. In an environment where CPU is more precious than memory this can be a big win (we can talk about cache locality benefits of includes later). However, if you INCLUDE a column that will be UPDATEd later – then you often are shooting yourself in the foot as the cost can easily outweigh the benefit.

Last tidbit about INCLUDEs that I’ll mention. The sql index analyzer is REALLY aggressive about recommending you add indexes with lots of INCLUDEs. This is because the index analyzer usually doesn’t know how many UPDATEs you’re doing. Usually you just tell it what SELECT you want to speed up and it naively says “oh well of course if you add these three covering indexes, this SELECT will be faster.” And while that’s true it doesn’t take into account the _total_ workload (UPDATEs DELETEs etc) so just be skeptical when you see this if you use the index analyzer.

I can’t give you a hard and fast rule to say INCLUDE is GOOD or EVIL as – like everything with database performance tuning – it depends ;)

Steve