Thursday, December 22, 2011

MSSQL non-clustered indexes INCLUDE feature explained

Today I received a question from someone about the nature of the INCLUDE feature when creating a non-clustered (secondary) index on a table. My response was a bit long, and I haven't posted in a while -- ergo blogpost! Here's the question:

What is your opinion about using the include statement when building your indexes? I’ve never really used the included functionality and I’m curious if there are upsides or downsides to them. My reading lead me to believe that I should use the include when I have a non clustered index and space appears to be a concern. I can use the ‘include’ as part of the index with the columns that may not be used as frequently. We’re using standard datatypes and nothing with very large column widths. So is there a benefit to using include?

A clustered index is stored in a B+-tree data structure and the actual data the whole row of data is in the leaf. So if you think of a b-tree structure (simplified) as something like:
           A
/   \
B     C
/ \   / \
M   N O   P

And you have records like
[ Id =1, FirstName = Steve, LastName = Ash ]
[ Id = 2, FirstName = Neil, LastName=Gafter ]

Then for the clustered index, the id data (id=1 and id=2) will exist at every node (A, B, C, M, N, O, P), but the bytes for “steve” “ash” “neil” “gafter” will only exist in leaf nodes (either M N O or P). The id is used to locate which leaf holds the whole record. (note this is a simplification, see Wikipedia for more info about b+-trees). The noteworthy fact is being a clustered index means that the whole record is in a leaf (i.e. M N O P).

Now let’s think of a non-clustered index on lastName that corresponds to the clustered index above.
           D
/   \
E     F
/ \
Q   R

In this case the “last name” is used to get to the leaf, and the leaf holds the primary key of the corresponding row in the clustered index. So D E or F will have things like lastName=Ash and lastName=Gafter and the leaves Q and R will have both lastnames and IDs. So an entry in Q or R might look like [lastname=Ash, id=1] (again simplifying).

So if you issue a query like
SELECT firstName FROM ThisTable WHERE lastName = ‘Ash’
(and the optimizer chooses to use the non-clustered index) then the database engine will do something like:
1. Seek the non-clustered index for ‘Ash’
1. Look in D to decide which direction to go E or F (lets say E is the right choice)
2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
2. Find the primary key for ‘Ash’
1. In Q find id for Ash – which is id=1
3. Seek the clustered index for id = 1
1. Look in A to decide which direction to go B or C (lets say B is right)
2. Look in B to decide which direction to go M or N (lets say N is right)
4. Find the value of firstName for id=1
1. In N find firstName for id=1 which is ‘Steve’
5. return ‘Steve’ as the query result
Notice that we created a non-clustered indexed on lastName and we can use that index to quickly locate things by last name, but if we need any additional info, then we have to go back to the clustered index to get the other info in the SELECT list.

The “include” provides a way for you to shove additional information in the leaf nodes of non-clustered indexes (Q and R) to alleviate this "going back" to the clustered index.

So had I created the non-clustered index above on LastName INCLUDE FirstName then the engine would only need to do:

1. Seek the non-clustered index for ‘Ash’
1. Look in D to decide which direction to go E or F (lets say E is the right choice)
2. Look in E to decide which direction to go Q or R (lets say Q is the right choice)
2. Find the firstName for ‘Ash’
1. In Q due to the include there is id=1 AND firstName=’Steve’ so we have the first name right here!
3. return Steve
So you get rid of that entire other seek into the clustered index. This additional seek shows up as a “bookmark lookup” operation in the query plan in MSSQL 2008 & MSSQL 2000 and just another join in MSSQL 2005 query plans. Bookmark lookup is a join -- just with a different name to indicate its semantic role in the query plan.

When you have an index such that they query can be completely served from the index without needing to go back to the clustered index – such an index is called a covering index. The index covers the needs of the query completely. And it’s a performance boost as you don’t need the other join.

So this means a few things:
1. You can obviously only “include” fields in non-clustered indexes. Clustered indexes already have all the fields in the leaf...so it doesn’t mean anything to include more.
2. You can only have one clustered-index for a table, but you can simulate have multiple clustered indexes by INCLUDing the rest of the columns on your secondary indexes
3. By duplicating the data in the secondary index, if you UPDATE firstName – you now have to update both the clustered index and the nonclustered index. (Main trade-off consideration)
• This is also a huge deadlock opportunity if you’re not using read committed snapshot isolation (RCSI) level. Think of two queries: (1)
UPDATE MyTable SET firstName = ‘Steve2’ where id = 1
and (2)
SELECT shoeSize FROM MyTable WHERE lastName = ‘Ash’
(pretend shoe size is a new field that is in the clustered index but NOT in the non-clustered, i.e. NOT in the INCLUDE). Then the SELECT will seek the non-clustered index, grab shared (S) locks, then (while holding S locks) traverse the clustered index. Whereas (in the opposite order) the UPDATE will seek the clustered index, hold an exclusive (X) lock, then seek the non-clustered index to update firstName. The fact that these two queries are holding incompatible locks in opposite directions, is a deadlock waiting to happen. If I had a dollar for every time I diagnosed this deadlock scenario...
4. By duplicating the data in the secondary index, each page in the leaf (Q and R) now has fewer rows per page and thus there is a greater memory demand on the buffer cache (and more IO to get the same number of records) (Second trade off consideration)

5. Some databases (even MSSQL < 2005) don’t support this feature, but you can approximate it by creating non-clustered indexes with compound keys. I.e. if I created the non-clustered index on
(lastName, firstName)
then the index still covers queries like
SELECT firstName FROM myTable WHERE lastName = ‘Ash’
Note that this is not as good as the INCLUDE solution (for this particular query) as now the bytes of firstName=’Steve’ take up some space in non-leaf nodes D E F.
So deciding to use an INCLUDE is (like everything) a trade off. If you have a performance critical query that is being executed frequently, then you can usually use INCLUDEs to reduce the number of joins and increase performance. In an environment where CPU is more precious than memory this can be a big win (we can talk about cache locality benefits of includes later). However, if you INCLUDE a column that will be UPDATEd later – then you often are shooting yourself in the foot as the cost can easily outweigh the benefit.

Last tidbit about INCLUDEs that I’ll mention. The sql index analyzer is REALLY aggressive about recommending you add indexes with lots of INCLUDEs. This is because the index analyzer usually doesn’t know how many UPDATEs you’re doing. Usually you just tell it what SELECT you want to speed up and it naively says “oh well of course if you add these three covering indexes, this SELECT will be faster.” And while that’s true it doesn’t take into account the _total_ workload (UPDATEs DELETEs etc) so just be skeptical when you see this if you use the index analyzer.

I can’t give you a hard and fast rule to say INCLUDE is GOOD or EVIL as – like everything with database performance tuning – it depends ;)

Steve

Sunday, July 31, 2011

Delegates or interfaces? Functional and OO Dualism

I have a mixed background: doing C#/.NET for ~4 years then switching to Java (switched jobs). I have been in the Java enterprise ecosystem for the last 4 years. I do mostly Java, but enjoy doing a little C# every now and again. C# is really a nice language. Shame its in such a horrible Microsoft-centric ecosystem.

In any case, I've been writing a little thing in C# and needed a type with a single method to "doWork". So coming from a Java bias I created a:
public interface IWorker {
void DoWork();
}

Later, however, I wanted to offer the users an API option of just using a lambda to "doWork". Unfortunately, there is no type conversion from a delegate to the matching "single method interface" (at least that I could find, if someone knows the answer, please share!). So as a shim, I created:
public delegate void WorkerDelegate();

public WorkerWrapper : IWorker {
public WorkerWrapper(WorkerDelegate workerDelegate) {
this.workerDelegate = workerDelegate;
}

public void DoWork() {
workerDelegate();
}
}

So I wrap the lambda in a little wrapper and don't need to change my entire IWorker-based infrastructure. It works, but I'm not happy with this. I know that in the Java lambda mailing list, they are planning to include a "lambda conversion" so that lambdas can be converted to compatible single abstract method (SAM) types. This would've alleviated my need for the shim above as I could've assigned the lambda directly to an IWorker and all would've been well.

I believe the "C# way" would've been for me to use delegates all the way through to begin with. Had someone come along with an IWorker interface then that would've been assignable to my WorkerDelegate.

But is this the right answer? Conceptually, how should I think of these? What is an IWorker in the above case? Is it really just a chunk of code that should be passed around as such? Or is it a member in the space of collaborating types that make up my system...

This is an example of the conceptual problems reconciling a Functional view of the world with an object oriented view of the world (dualism). I know that many people smarter than me have thought about these problems, and I'm hoping that I can find some good articles discussing them.

It feels like we're describing the same concept: a "chunk of code" that is defined by an interface (call it a delegate or a SAM, same thing). I don't think that I would have any dissonance if both were assignable to each other, and thus can be treated as different expressions of the same concept. If this were the case, then maybe I would view delegates as just SAM types -- so my "world view" is still object oriented, I just have an additional, concise lambda syntax to create SAMs. Actually, if this were the case, then you could probably invoke the Liskov substitution principle and call functional-OO dualism reconciled...

But something still seems amiss. There is more to the identity of an IWorker than the fact that it takes no arguments and returns nothing. I suppose the same questions are true of reconciling structural typing to static typing. Hmm.. I have a lot of reading to do.

I imagine there are a number of philosophical problems between functional and OO. This is just the one I ran across and felt dissonance with C#s implementation. Maybe they truly are different things and should be treated as such. I hope (despite my extremely small readership) to get some links to articles on this topic.

Steve

Saturday, July 9, 2011

I <3 Robert C. Martin

Im reading Clean Code by Uncle Bob. I have read sections of this book before when it came out and actually had the pleasure of watching Robert Martin present it at SDWest in 2008. I've decided to read the whole thing.

A while ago, I read a blog post from someone who was arguing that software wasn't a craft but a trade. I believe the authors intention was to say that we software developers should recognize that the value of the software is the business value and thus we shouldn't wax philosophic about "elegance in design" or software aesthetics as that was all wasting time trying to get to the goal. I may be misrepresenting the author's intention. I couldn't find the post to link it.

In any case, I disagreed entirely with this opinion. While I agree that business value is the motivator-- the craft aspects such as aesthetics, conceptual purity, elegance, etc. All contribute to the solution and its extensibility and maintainability. Maybe we're just arguing over the definition of craft, trade, or art, but in any case I feel there is value in recognizing the challenge of good engineering for today and tomorrow. The masters do it almost effortlessly-- almost accidentally. That feels like art to me and thus should be labelled appropriately as craft.

To this point, clean code is more art than science and Mr. Martin has something to say about it that I really enjoyed:
Every system is built from a domain specific language designed by the programmers to describe their system. Functions are the verbs, classes are the nouns. This is not some throwback to the hideous old notion that the nouns and verbs in a requirements document are the first guess of the classes and functions of a system. Rather, this is a much older truth. The art of programming is, and always has been, the art of language design.

Master programmers think of systems as stories to be told rather than programs to be written. They use facilities of their chosen programming language to construct a much richer and more expressive language that can be used to tell that story. Part of that domain-specific language is the hierarchy of functions that describe all the actions that take place within that system. In an artful act of recursion those actions are written to use the very domain specific language they define to tell their own small part of the story.
So to argue that software is not art is to naively ignore the reality that language is hard and has a dramatic effect on the bottom line of your code base. How many software systems never change or never need to be understood after they are written? Such systems must not be very interesting or do anything important.

Let's recognize the art of good software engineering! It will motivate us to continue to improve if we recognize these things have a value.

Monday, June 27, 2011

Maven -_-

I have spent the last few days mucking about with POM files. Anyone that has done this understands where I'm going with this. So I'll just leave these two quotes that fit nicely:
But there has to be something fundamentally wrong with any tool that, whenever I use it, seems to have at least a 50% chance of completely fucking up my day.

-Charles Miller

The people who love Maven love the theory. The people who hate Maven hate the reality.

-Zutubi

Frustration.

EDIT: For anyone running across this. Now that I am well over the "learning curve" hump. I <3 maven. Seriously. Yes there are some rough edges-- I really should be able to delete folders simply without having to do ant-runs, etc. But its benefit drastically outweighs its cost in our environment.

Wednesday, April 13, 2011

One of my favorite parts of Pragmatic Thinking is the description of the Dreyfuss model of skills acquisition. This describes the phenomena of how people are frequently distributed along some non-trivial "skill" (like programming), and defines metrics about what differentiates each skill level. Overall, as one moves up to higher skill levels, they are increasing their intuition about the problem space. Intuition is something for which we must train our brain through experience and knowledge. To that end, there are a few books that have helped me in increasing my intuition, which I would like to catalog. If you have others that are missing, please leave a comment! My appetite for the amazon marketplace is insatiable ;-)

OO and Design Patterns

• But Uncle Bob Essay on SOLID - This is Robert C Martin's article outlining the principles of Object Oriented Design. If SOLID doesn't ring a bell, then start with this article. Note that the now-famous acronym: SOLID is not actually mentioned in this post, but Robert C Martin is still credited as the inventor.
• Agile Software Development Principles, Patterns, and Practices - lovingly referred to as the PPP book (not to be confused with the protocol). This describes much of the why of OOD, and explains the Agile mindset.
• Object Design - another often referenced book describing the why and various design decisions that go into object oriented design. Thinking about objects isn't hard-- category theory and abstraction is something that our brain does quite naturally (hence the appeal of this design methodology). However, the ergonomics of OOD can sometimes lead to an undeserved sense of self-confidence in one's design. It may feel like OOD, because "oh look-- there are objects there! inheritance! oh, my!", but within the context of software engineering there are many factors that make some designs good and some bad.
• Design Patterns: Elements of Reusable Object Oriented Software - Canonical book by the "gang of four". Has good description of patterns and why they are useful. If you prefer something with prettier pictures, starting with the Head First Design Patterns book is nice too
• Essays on OO Software Engineering - Maybe not terribly well-known, but well written theoretical overview of OO, the design forces in OOD, and the motivations behind them. I had the pleasure to work with Ed.
• Patterns of Enterprise Application Architecture - Canonical book on patterns by the man himself
• Refactoring - Another Fowler book. I haven't read this one cover to cover, but read many chapters. Good examples to get your head around particular refactoring patterns
• Domain Driven Design - Eric Evan's now-famous book on how to turn business problems into rich object models. Great read.
• Real World Java EE Patterns - while this is a "java" book, the principles and design trade-off analysis that Adam Bien does for each pattern is universal. Good read.

Code

• Beautiful Code - Essays where programmers reflect on what is beauty in code. As the authors wax philosophic about their "beautiful" examples, you glean insight into their thought process. It's a nice read.
• Implementation Patterns - Kent Beck's book about the low-level decisions we make as we actually type the code and create APIs (knowingly or unknowingly). One of my top recommendations. I really love this book.
• Clean Code - guide to code readability, writing code at appropriate abstractions, etc. A nice companion to Implementation Patterns. I had the pleasure of meeting Robert C Martin at a conference once. I'm sure I acted appropriately star-struck.

Other "meta" and philosophic musings

• Notes on the Synthesis of Form - this is not a computer science book, but deals with the theory of what is design, and describes a methodology for decomposition. It's a quick read, and at least the first half is worth the exploration just to get some ideas in your head about methodologies for design.
• Pragmatic Programmer - what list would be complete without this one! Good overview of the pragmatic aspects of becoming a good programmer. From problem solving skills to tools, this is required reading. I think if you read this, then skip or at most skim Productive Programmer
• Pragmatic Thinking: Refactoring your wetware - I mentioned this at the front of the post. I enjoyed most of this book-- in particular the front half. Some of the topics towards the end were less interesting, because they were more familiar. In any case at least the first three chapters are a great read.
• The Little Schemer - you can run through this in an afternoon. This is an introduction to recursion, via Scheme/LISP. This book is fun, because its a simple and unique format, but takes you through a journey that helps shape your mind to "think" of solutions recursively.
• Beautiful Architecture - This one wasn't as successful to me as Beautiful Code, but still worth a read. There are a few chapters you can skip, but a few (especially the first chapter) that are great.
• Beautiful Data - I'm only about half way through this one, and need to spend more time with it, as I'm sure there are some gems in there I have yet to run across.

These are books that are on my shelf to read next (that are relevant to this list at least)
• The Design of Design - Fred Brooks (of Mythical Man Month fame) waxes about design philosophies and goes through a few case studies. Im really excited to make time for this.
• Masterminds of Programming - interviews with the creators of many of the major languages used over the last 30 years. Looks promising.
• Thoughtworks Anthology - I am going to be selective in which essays I read. I will start with those that are most applicable to my world, and then move out to the more "exotic" (relatively). I know that there are essays in here that will be more useful to my current world than others.

I feel like I'm missing some...

Steve

Sunday, March 6, 2011

Spring 2011 Reading List

I gave last Spring's reading list (which was more what I had read in the previous year) in this post. This spring, I am going to write what I am currently reading or hope to finish this Spring.
• Seven Languages in Seven Weeks - this book looks interesting, and certainly the challenge of trying to meaningfully cover seven languages and language philosophies in a short book will be interesting if nothing else.
• Programming in Scala - I have "played" with Scala, but not done more than toy programs. However, from everything I know about it- I believe I will really enjoy it. I buy into a lot of the functional vs object oriented dualism, and certainly enjoy the terse syntax. I think there are still interesting questions with regards to software development economics (i.e. how do we integrate Scala in a team of mixed skill levels, cost trade offs, etc.)
• Thinking Forth - I'm not particularly interested in Forth, but I am interested in language design and problem solving philosophies.
• Domain Driven Design - this was recommended by a commenter in the previous post, and I picked it up. I've made it through quite a bit of the book, and have enjoyed it so far. I had always heard this book referenced by Fowler et al, and really I should've read it years ago...
• Introduction to Reliable Distributed Programming - its a Springer book, 'nuff said. I've been through a few distributed computing books, and have a fairly broad knowledge of the space. I am hoping to get a more mature, theoretically rigorous view of the problems now.
• The Art of Multiprocessor Programming - this book is great. I enjoyed the nice mix of intuitive description and theoretical rigor. It's a nice blend of theory and pragmatics. I really recommend reading this for anyone that wants in depth knowledge of parallel computing.
• Introductions to Neural Networks for Java - this was an interesting book to skim I don't really want to recommend it, because most of it is explaining his neural network code base instead of the underlying concepts. In any case, its a nice thing to skim over an afternoon.
• Neural Networks and Learning Machines - Great textbook for all the theoretical background and mathematical proofs regarding Neural Nets and machine learning. I've had to crack open my calc books to freshen up while going through this...its dense.
• Fuzzy Models and Genetic Algorithms for Data Mining and Exploration - I can't recommend this book (see my Amazon review if you're curious why), but it does provide a decent vocab overview for the topics it covers.

Well that's it for the moment! This spring is a bit more theory heavy than the previous list... that probably reflects my work and school change, which has been more research oriented in the last year. I'm not moving away from the real world just trying to get a more comprehensive grasp on both worlds in the areas which I am interested. I've always thought this was one of my strengths-- knowing enough theory to shape my thinking skills and be knowledgeable-- but continuously, deliberately enriching my implementation and practical skills to actually put the theory to good use! The intersection of the two is the more rewarding area for me, I think.

Steve

Saturday, March 5, 2011

Building Derby with Eclipse gotcha

This is partially a reference for myself and for anyone else trying to build Derby 10 from source using Eclipse. My thesis project involves modifying the on-disk page layout for a database (long story for another post). I am currently using Eclipse for my IDE. I just wanted to document a quick pain point in case anyone else is Googling for the answer:

I wanted to do a clean build of all of Derby. So I set up an ANT External Tool launch configuration:
• Launch the build.xml from the root Derby directory
• Working directory is the root Derby directory
• Refresh resources upon completion
• Uncheck the "build before launch
• Choose the clobber, all targets (clobber is a total clean)
• Defaults for everything else

I started to build and failed with the follow errors (I'm omitting some of the ones in the middle):

[javac] C:\DerbyDev\java\build\org\apache\derbyBuild\javadoc\DiskLayoutTaglet.java:24: package com.sun.tools.doclets does not exist
[javac] import com.sun.tools.doclets.Taglet;
[javac] ^
[javac] C:\DerbyDev\java\build\org\apache\derbyBuild\javadoc\DiskLayoutTaglet.java:25: package com.sun.javadoc does not exist
[javac] ^

...

[javac] symbol : class Taglet
[javac] Taglet t = (Taglet) tagletMap.get(tag.getName());
[javac] ^
[javac] 40 errors

So as you can see from the first two errors -- its missing the dependency for Javadoc things. As it turns out Derby defines its own Taglet for Javadoc processing. All of these classes are defined in the JDKs tools.jar file, which is in the JDK's/lib directory (not included in the JRE).

Derby's build script includes tools.jar by:
<pathelement path="${java15compile.classpath};${java.home}/../lib/tools.jar"/>

Well I went to my command prompt and looked at my environment variable and java_home was set to the JDKs location. So this should've worked. I added a new launch configuration to execute the showenv target, and added an echo statement to include java.home.

I ran this and low and behold it was pointing to the JRE path, which doesn't have a tools.jar. So the world made sense now. In my eclipse installation, I have both the JDK and JRE installations defined in my preferences. For some reason (subconscious desire to shoot myself in the foot?), the JRE was chosen for this workspace, and thus the launch configuration used that.

So to resolve the problem, you can change the external tools launch configuration for the ANT script. Go to the JRE tab and select separate JRE and choose the JDK. Or change the default JRE for the workspace, and leave the launch configuration set to run in the same JRE as the workspace.

This isn't a tricky problem, but can be a little misleading...plus I haven't posted in a while ;-)