Thursday, April 29, 2010

Wow...erm...your database is leaking

It's no secret that some of the traits that go into making good software developers are: problem solving, analytical, and abstraction skills. As we write code and design a software system, we exercise all of these skills -- in particular: abstraction. Abstractions allow us to build larger and more complex systems without needing bigger and bigger brains. We can reason about the interactions of components at various levels, and only focus on the details germane to the topic at the time.

Operating systems provide various abstractions that we use everyday: various models of how to "do work" (processes and threads), how to exchange information (multi-threaded programming primitives, etc.). We leverage complex data structures from libraries. We wouldn't want to re-write a hash table every time we need dictionary lookup semantics, or drop into assembly to write CAS instructions to re-implement mutexes and spinlocks every time we write a multi-threaded program. So *high five* to abstraction-- you're a good friend to us dumb humans.

Yet... there is a problem. In the perfect world in our cerebral cortex, things fit together nicely, and we gloss over real world needs and various "cross-cutting concerns". As such, there is the famous principal of leaky abstraction, which states that there are no perfect abstractions: all abstractions leak-- i.e. all abstractions expose some part of their underlying implementation. The amount of "leakiness" can be a quality metric. I would suspect that "leakiness" is correlated with complexity, and it is certainly correlated with how many resources are involved in the implementation.

For example, how leaky are mutex synchronization primitives? On the surface they sound simple! You try to acquire the mutex and get blocked until you get what you want. A nice abstraction, very useful. But then there are these pesky details of lock fairness, thread affinity, etc. that complicate the elegant abstraction. They typically rear their head through "cross-cutting concerns" such as performance. And this is key to the issue-- why can't there be a perfect abstraction? Because at some point there is a *finite* set of resources that are going to give a quick dose of "reality" to these "head in the clouds", "flowers and kittens" abstractions. And where abstract, pure concepts meet reality-- finite constraints start to paint a different picture.

So as developers, we deal with this by straddling the fence: we design in the world of abstractions, but maintain some "implementation" knowledge, and then TEST TEST TEST (for performance and other non-functional requirements). And we pick up little best practices about thread affinity, processor cache coherence, scheduling fairness, IO access patterns, etc. that let us meet these goals.

But in my experience with different teams and colleagues, something funny happens when databases are brought into the pictures. They are almost always treated as a black box. "Oh thats the databases job", "oh well the database just sucks", "databases suck". I have encountered a number of people who choose to write their own little mini-databases instead. They have no problem "breaking the abstraction" to think about file locks, optimizing reads/writes to reduce disk head seeks, etc.-- but they hold database technology to some higher standard; they pretend databases are a perfect abstraction!

So yes, databases are very leaky. They are leakier than many other programming tools, and this seems natural given the complexity. But they bring a slew of features, a slew of research, and a lot of production time with them. Thus, we as developers should be interested in understanding this tool, its leakiness, and how to effectively use it, just as we would anything else (O/S, hard disks, etc.).

Where do databases leak the most? Here's a brief list of various technical nuggets about popular database implementations that are important to at least have in the back of your mind when you're designing something with a database:

  • Concurrency is hard!

    • Its multi-threaded programming! Databases aren't magic-- they have to employ the same synchronization as the rest of the world to use a shared memory model of parallelism. Thus read up on isolation levels, read up on optimistic and pessimistic locking, and don't rely on "magic" as an answer for how databases do what they do.

  • Transactional consistency is hard!

    • Understand why ACID properties are important, how they make reasoning about complex interactions simple, and why it is almost always economically impossible to create a robust, complex system without some transaction abstraction

  • Searching for data is hard!

    • Searching problems in computer science make up a major chunk of the literature for the last thirty years. Oh, you put your data into a table, didn't understand what you should index, and now you blame the database for performing slowly!? Understand the difference in clustered and non-clustered indexes. Understand the concept of selectivity, and learn how to evaluate which queries are sucking the life out of your application.

At some point, I'll actually put something technical on this blog :-)

*kicks soap box to the corner of the room*