Wednesday, December 1, 2010

Defender Methods: Neal Gafter is *not* painting a bike shed

I follow the lambda-dev mailing list, and there is a discussion going on regarding defender methods and automatic disambiguation. Defender methods are a language feature that allows some interesting things:
  1. Adding methods to existing interfaces without breaking backwards compatability
  2. Multiple inheritance of behavior (but not state-- think interfaces with code)
The second item is profound (well both are), and it will be interesting to see how its usage is adopted by the community. The approach taken with defender methods is described as "declaration-site" extension methods, because it has a syntax present in the interface declaration. For example, imagine adding a "randomElement" method to java.util.Collection. Today, you can't do that without breaking every class that implements Collection. Using a defender method you can do:

interface Collection<T> {
T randomElement() default Collections.randomElement;
...
}

// add the default implementation to Collections
public class Collections {
...
public static <T> randomElement(Collection<T> c) {
...
}
}
Other languages implement use-site extension methods (e.g. C#) in which case the owner of Collections doesn't have change anything. You could write (in C#)

namespace SteveCode.Extensions {
static class MyCollections {
T randomElement(this ICollection<T> c) {
...
}
}
}
Then anywhere you state "using SteveCode.Extensions" (i.e. "import" the namespace) then you can invoke someCollection.randomElement() and the compiler will invoke the static method. Ergo, the "user" of the extension method has to indicate that he is using it.

There are some great articles out there discussing the pros/cons of declaration vs use site approaches, so I wont re-hash them. See Remi Forax's post and another.

Java is going with declaration-site style and one of the benefits as now inheritance can still work. I.e. if you augment an interface to add a new method with a "default" and then later in your concrete class you wish to specialize this new method, you can override it and consumers will call the specialized method via polymorphism (as expected). This doesn't work with use-site extension methods as they are just a compiler trick.

So on the lambda-dev list, there has been a discussion over a particular feature in the current specification. Take the following:

public class Impls {
public static void one(Object o) { ... }
public static void two(A a) { ... }
public static void two(B b) { ... }
}

interface A {
void execute() default Impls.one;
}
interface B {
void execute() default Impls.one;
}
If you defined a class MyAB implements A,B what would you expect to happen? Well currently the spec allows this "automatic disambiguation" because they both delegate to the same method/overload. Where this becomes a problem is if later the owner of B changes the default of B to something else. Now when you recompile MyAB suddenly it breaks and you must manually disambiguate by implementing execute inside MyAB.

Note that if the interfaces took a different overload of the default, for example:

interface A {
void execute() default Impls.two;
}
interface B {
void execute() default Impls.two;
}
Now class MyAB implements A,B would fail right off the bat the first time-- because there is no overload in common. Thus, automatic disambiguation would not work in this case either (nor would you want it). Lastly, imagine having:

public class Impls {
public static void one(Object o) { ... }
public static void two(Object o) { ... }
public static void two(A a) { ... }
public static void two(B b) { ... }
}
In this case, I think (and someone please correct me if I'm wrong) that normal overload rules will apply and this will compile, invoking two(Object o) in MyAB, but in another class that only extends A, it would call two(A a).

There is also some additional complexity that must be added somewhere (talk of changing the .class file format to include linking information) so that in the presence of changing versions of the interface, the code compiled against the old version will still work (remain binary compatible).

All in all, there is some complexity for doing this "automatic disambiguation". Neal Gafter of Java Puzzlers, Sun Java spec, etc. fame (not in that order) has raised the point of whether this complexity is worth it. One of his proposed solutions is just to use method bodies in the interface itself. This would remove the overload resolution of the "default" entirely, moving the disambiguation to the client at the very beginning instead of making it potentially a source incompatible change to modify the defender's default (probably unexpected). This seems like an entirely reasonable conversation to have on the mailing list as its a language detail that has the potential to have expensive impacts. The discussion was going on for a few days-- and Reinier Zwitserloot (of Project Lambok, etc. fame) chimes in with:

When is the currently proposed "default method is part of the signature" aspect going to result in a problem? If this is about ideological purity, then I vote we stop painting this bikeshed.

If you don't get the bikeshed reference, check here. In classic Neal fashion, he had a precise (ableit somewhat arrogant) retort, which upon reading I had a good chuckle:

An interesting reference to Parkinson's *law of triviality*. The basis of Parkinson's law is that people seem to care less about issues they don't understand, no matter how important those issues might be relative to the issues for which they express a preference. "painting this bikeshed" is an *ad hominem* suggestion that the participants in this discussion are really only commenting on those issues simple enough for them to understand.

In this case, however, you also appear to be saying that you don't fully appreciate the issue, and therefore don't see any point in continuing the discussion. Which ironically reinforces the reference to Parkinson's law.

I like Neal Gafter and Reinier Zwitserloot and am happy that everyone on the dev lists cares enough to take time to discuss these things for the (I believe) betterment of the Java community. Its not to say there should be unbounded discussion, but I personally found Neal's points useful. Automatic disambiguation does seem to require complexity that outweighs its benefit-- and that's worth discussing.

Steve

Friday, November 19, 2010

Notes on the Synthesis of Form

I just got in Notes on the Synthesis of Form by Christopher Alexander. This is an academic, theoretical treatise on design. It deals with "what is design", etc. It was written in 1962, but the ideas are obviously still relevant today--even though some of the motivating examples seem a little anachronistic. While it has nothing to do with software architecture (the term didn't even exist back then!), it is a fascinating read for those of us who enjoy some philosophical pandering every once in a while.

In the introduction "The Need for Rationality", the author makes clear his opinion of self-proclaimed "artists":


The modern designer relies more and more on his position as an "artist", on catchwords, personal idiom and intuition--for all of these relieve him of some of the burden of decision, and make his cognitive problems manageable. Driven on his own resources, unable to cope with the complicated information he is supposed to organize, he hides his incompetence in a frenzy of artistic individuality. As his capacity to invent clearly conceived, well-fitting forms is exhausted further, the emphasis on intuition and individuality only grows wilder.

In this atmosphere, the designer's greatest gift, his intuitive ability to organize physical form, is being reduced to nothing by the size of the tasks in front of him, and mocked by the efforts of "artists". What is worse, in an era that badly needs designers with the synthetic grasp of organization of the physical world, the real work has to be done by less gifted engineers, because the designers hide their gift in irresponsible pretension to genius.

(pg. 11)

I always enjoy well-written calls to humility. I wonder who in particular the author had in mind when he wrote this? What's funny is this was written in 1962 and seems no less valid in 2010 :-)

Steve

Tuesday, October 26, 2010

Deadlock two ways (Microsoft SQL Server)

Think of "Duck two-ways" or some other delicious meal. This is as tasty ;-)

If you have done any development with Microsoft SQL Server with even a moderate amount of concurrent processing, then you have probably witnessed the dreaded:
Transaction (Process ID 42) was deadlocked on lock resources with another process and has been chosen as the deadlock victim
This indicates that Microsoft SQL Server has detected a deadlock and has killed one of the processes to eliminate the stale mate. Let's go through two very different deadlock conditions that have the same central theme. I have found these to be extremely common in production.

Intuitively deadlock is a "deadly embrace" that occurs when two resources are being contended over. Thus, people are often surprised when they get deadlocks and "they weren't touching the same records!". Unfortunately, in traditional pessimistic locking solutions to concurrency control, other records must be locked to guarantee ACID properties. There are other solutions to concurrency control (MVCC is the one everyone reaches for) that avoid most/all of this. However, MSSQL by default employs a pessimistic approach. Also, I don't mean to sound as if I'm picking on MSSQL-- the principles below are true in a number of database implementations. DB/2 is also particularly nasty with pessimistic locking style deadlocks. The two presented below were reported to the JTDS forums, and are from real production.

Deadlock I


This is the common unicode "mismatch" problem. This has been covered on blogs a lot. I am going to use this as a way to describe how to research deadlocks. But if you're already bored, go ahead and skip to Deadlock II
Here is a table definition (slightly changed for brevity):

CREATE TABLE [PM1](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[KEYNAME] [varchar](50) NOT NULL,
[VALUE] [varchar](255) NULL,
CONSTRAINT [PK_PM1] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
CREATE NONCLUSTERED INDEX [pmkv_idx] ON [PM1]
(
[KEYNAME] ASC,
)
Now we execute multiple, concurrent:
DELETE FROM PM1 WITH (ROWLOCK) WHERE KEYNAME = @P0

And we get a deadlock! Let's familiarize ourselves with some technical details and terms:

Clustered and Non-Clustered Indexes


There are two database structures for this DDL. One primary, clustered index, which has a key [ID]. And a secondary, non-clustered index, which has a key [KEYNAME]. A "clustered" index means that all of the data, the entire record, is stored in the leaf of the index. It is still an index--it is still searchable by [ID], but when it gets to a leaf, the entire data record is physically stored there. This is different from a secondary or non-clustered index in which the leaf of the index contains some kind of reference to the whole data record. In our secondary index, the index is "searchable" by [KEYNAME], but in the leaves of the index, the "reference" is actually the primary key value (ID), which can be used then to "search" the primary, clustered index. This is how databases use secondary indexes-- they find matching data with the index, then do a bookmark lookup, which is a join, into the clustered index to get the rest of the data.

Seeking vs Scanning


There are different actions the engine can perform on these structures: the engine can scan the index, in which case it will start somewhere (often the beginning) and start reading each record in order. As these indexes a B+-Trees, the leaf level is a doubly linked list. Thus, from a single leaf node, you can read sibling records in order without having to go "up and down" the index levels. Scanning is expensive as it is an O(n) operation which degrades as the number of records increases.

The engine can seek the index, in which case it will use a binary search to locate a leaf node (record). This is much preferable as it is a O(log n) operation-- and the base of that log is very high (the number of records per page). Thus seeking is why we put indexes on tables-- to take advantage of the "quick" searching.

Getting Information about your deadlock


In Microsoft SQL Server (2005+) there is a useful report that can give you good details about your deadlock. Without this information, it's very difficult to understand what's really happening. You can get this information two ways
  • SQL Server Profiler, capturing the Deadlock event
  • Enable Trace Flag 1222
I find it generally useful to enable T1222 as a startup parameter for SQL Server, so that whenever a deadlock occurs, it dumps the details to the SQL Server error log file. This is great, and incurs negligible performance overhead.

Trace 1222 Output


Let's look at the 1222 output and see what actually happened. There are two sections to the 1222: (1) the processes involved, (2) the wait-for graph, which shows which locks are acquired which are trying to be acquired. Here is an abbreviated 1222 for this deadlock:

<deadlock victim="process3dcab08">
 <process-list>
  <process id="process3dcab08" ...>
   <executionStack>
    <frame ... >DELETE FROM PM1 WITH (ROWLOCK) WHERE AND KEYNAME = @P0 </frame>
   </executionStack>
   <inputbuf>(@P0 nvarchar(4000))DELETE FROM PM1 WITH (ROWLOCK) WHERE KEYNAME = @P0 </inputbuf>
  </process>
  <process id="process3de1198" >
   ... (same as previous process) ...
  </process>
 </process-list>
 <resource-list>
  <keylock ... objectname="DB1.dbo.PM1" indexname="PM_IDidx" ... mode="X" ... >
   <owner-list>
    <owner id="process3de1198" mode="X"/>
   </owner-list>
   <waiter-list>
    <waiter id="process3dcab08" mode="U" requestType="wait"/>
   </waiter-list>
  </keylock>
  <keylock ... objectname="DB1.dbo.PM1" indexname="PM_IDidx" ... mode="X" ... >
   <owner-list>
    <owner id="process3dcab08" mode="X"/>
   </owner-list>
   <waiter-list>
    <waiter id="process3de1198" mode="U" requestType="wait"/>
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

The process-list section describes all of the processes (concurrent executions) involved. I have colored one blue and one green to help distinguish them. So as we see each is holding an X (exclusive) lock, which the other is trying to get a U (update) lock for.

The database must hold an X lock before actually modifying a record, thus we expect this to be involved. However, a U lock is unexpected. Update locks are an "in between" lock, which are used by the database when scanning for potential records that need to be updated. In this case, a U lock is acquired before checking the value to see if it matches your WHERE clause. If it does, then the U is "upgraded" to an X lock, if not, then it is released. However, our WHERE claue has KEYNAME = @P0, which is indexed. Thus, we shouldn't be scanning anything-- we should be seeking on the index. When seeking-- the database goes exactly to the records which qualify, and thus X locks are requested to begin with. Thus, we expect to see only X locks from the statements above.

So what happened?


The answer is subtle: look at the input buffer for the delete statement:

<inputbuf>(@P0 nvarchar(4000))DELETE FROM PM1 WITH (ROWLOCK) WHERE KEYNAME = @P0 </inputbuf>

The parameter for the delete statement was declared by the driver as nvarchar, but if you look back at the DDL, you will see that the KEYNAME column is declared as varchar.

The NVARCHAR datatype is a unicode text datatype, whereas VARCHAR is non-unicode, 8-bit character datatype. The engine cannot directly compare an nvarchar value to a varchar value-- it must use a conversion. This conversion makes the where clause predicate not sargable, and that disallows the optimizer from seeking on it. Instead it scans the whole index, one record at a time, and for each one, requests a U lock, converts and checks the value, releases the U lock. Thus, for every delete of a single record, every record must be checked with an expensive, conflicting lock. And as seen in our case, two concurrent deletes happen to be checking and ran into each other.

The solution


So why did the driver do that? Well this was a Java application, and in Java strings are unicode by default. By default, the driver binds strings to their unicode counterpart. However, there is a setting on the driver's connection url, which instructs the driver to bind to non-unicode instead. The key is that you want to ensure that the driver is sending parameters the same as the underlying schema is defined. For JTDS the setting is sendStringParametersAsUnicode and it defaults to TRUE. Thus, if your schema is defined with all VARCHARs and you are using them in indexes, then you want to ensure this property is set to FALSE in your connection url. Here is an example of a connection url that forces the driver to send them as varchar instead of nvarchar:

jdbc:jtds:sqlserver://mssql01:1433/DB1;sendStringParametersAsUnicode=false

Deadlock II


This deadlock is a little less common, but interesting, because it also has an "unexpected" wait-for graph in the 1222 output. The schema was jBPM, which may be familiar. I need to see if someone has already fixed this in jBPMs delivered schema.

Trace 1222 Output


First, let's look at the 1222 trace:

<deadlock victim="process3de0868">
 <process-list>
  <process id="process3deb08" ...>
   ...
   <inputbuf>(@P0 bigint,@P1 int)delete from JBPM4_SWIMLANE where DBID_= @P0 and DBVERSION_= @P1</inputbuf>
  </process>
  <process id="process3de0868" ...>
   ... same as other process ...
  </process>
 </process-list>
 <resource-list>
  <keylock ... objectname="dbo.JBPM4_TASK" indexname="PK__JBPM4_TASK__2BFE89A6" mode="X" ...>
   <owner-list>
    <owner id="process3deb08" mode="X"/>
   </owner-list>
   <waiter-list>
    <waiter id="process3de0868" mode="S" requestType="wait"/>
   </waiter-list>
  </keylock>
  <keylock ... objectname="dbo.JBPM4_TASK" indexname="PK__JBPM4_TASK__2BFE89A6" mode="X" ...>
   <owner-list>
    <owner id="process3de0868" mode="X"/>
   </owner-list>
   <waiter-list>
    <waiter id="process3deb08" mode="S" requestType="wait"/>
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

So we have two workers executing a delete by primary key... and they're deadlocking on S locks!? An S (shared) lock is used when the database needs to get a consistent read on something. Thus, by default, when you execute SELECT statements, shared locks are being issued under the covers to ensure read consistency (w.r.t the isolation level you're running at).

So what happened?


There is a hint in the wait-for graph that shows the problem: Look at the keylock that is being contended-- the indexname is PK__JBPM4_TASK__2BFE89A6. This is an auto-generated name for the primary, clustered index for the JBPM4_TASK table. We're deleting from the SWIMLANE table, so why would there be any locks on the TASK table? The TASK table has a foreign key reference (field swimlane_) to the SWIMLANE table. Thus, when you delete a SWIMLANE record, the database must ensure that there are no records referring to the SWIMLANE record. If there are, then the delete cannot proceed, because it would be a foreign key constraint violation. To do this check, the database needs to find records in the JBPM4_TASK table WHERE swimlane_ = id of swimlane being deleted. This read will be done using S locks to ensure consistency. Unfortunately, the JBPM4_TASK table does NOT have a secondary index on the swimlane_ column. Thus, the only option the database has is to scan the entire table by primary, clustered index. Thus, for each record, a S lock must be acquired, the record evaluated for the WHERE clause, and then the S lock is released. Each one of these is a deadlock opportunity if there are concurrent operations touching the TASK table.

The Solution


Adding an index on the swimlane_ column in the TASK table will reduce the n deadlock opportunities to 1, which is a significant improvement-- not to mention the performance improvement.

The Overall Theme


While these are very different deadlocks, the theme is the same: unintended scans. Scans are not only a performance problem, but a deadlock problem, as you increase the number of times that you request conflicting locks while holding locks. Thus, take care when considering which indexes to create-- its a tricky problem with many factors.

Steve

Monday, October 11, 2010

Some JIRA / GreenHopper Love

This will be another brief, non-technical, un-informative post to rant..but in a positive way! I actually will post something informative and useful one day ;-)

A team member and I have been using JIRA and recently started using the GreenHopper plug-in. The plug-in advertises that it is "agile" focused. I have always been extremely satisfied with Atlassian products, but this is something special. They really have done a great job in usability with GreenHopper.
  • The "boards" present a nice, intuitive, usable visualization of a part of the project.
  • Every aspect of the "boards" visualization (color coding, progress bars, in-place editing) reinforces with high affordance the functionality of that view.
  • The tools gets out of the way as much as possible in terms of navigation and functionality. Its intuitive. I find that I spend 95% of my time just on the Task board. I never need to leave.
  • Using the "boards" is fun!


For example, from the "task" board, I can open bugs, assign them to other people, begin progress, etc. This is my "specialized" view that I, a developer, use to see my tasks and at-a-glance get a view of the whole sprint.



Here are some noteworthy features of this visualization:
  • There are three columns for this sprint: To Do, In Progress, Done. You just drag a card between the columns to start progress or resolve as done. Just as you would with a card on a cork board (intuitive metaphor).
  • You can switch projects and sprints/versions by clicking the down arrows next in the headers
  • You can add bugs, features, etc. for this sprint just by clicking the button for that issue (right under the filter icon).
  • You can add comments to the issue simply or bring up a light box with the entire JIRA issue (see below).
  • There is a nice progress bar that shows the overall progress for the entire sprint.




When a new sprint comes around you can check out the "planning" board, where you can drag and drop cards to different releases or sprints. Also, there is a "chart" board which summarizes various charts (burndown rate, etc.), and a "released" board which provides metrics on completed sprints and releases.

This is just a little bit of why I like GreenHopper. I have just been so pleased with it that I wanted to share. I have used bugzilla, plain JIRA, and Team Foundation in the past. Team Foundation I know has much of the same (if not more) reporting than GreenHopper, but I never saw a visualization for me, the developer, that was as useful (defined by power/weight ratio) as what GreenHopper provides in the Planning Boards. Its the subtle simplicity and usability that I think makes GreenHopper so great-- not just sheer functionality.

If anyone actually reads this I would be interested to hear others' opinions about tools-- in particular other "killer features" like GreenHoppers planning boards.

Steve

Monday, September 20, 2010

Naming Matters

This title is kind of a two-for: "Naming matters" as in matters pertaining to naming and naming matters as gee golly-- names really do make a difference! So hurray for homonyms and making a neat opening to what is likely to be a boring post :-) (Have I hooked you to read further? dratz...)

API design "theory" is curious, because it seems to take a back seat in the academic literature, and yet the economic impact of bad API design is so great to the "problem" of software development. I guess the Software Engineering folk are fighting the good fight with code quality metrics, etc., but this is a hard topic, that I had never considered throughout my entire undergraduate career.

Here's an example of an API design mistake in my opinion (albeit a simple one). In Java, there are the following constants defined:
  • Integer.MIN_VALUE
  • Long.MIN_VALUE
  • etc.

These define the lowest value that can be represented by this datatype. Thus, for example, Integer.MIN_VALUE is -2147483648 given that its 4 bytes and uses a 2-s complement representation, etc.

These are used all over the place as sentinels. For example, a typical implementation of Dijkstra's shortest path algorithm initializes every node to a shortest distance of Integer.MAX_VALUE, and as the algorithm progresses, any value will (likely) be less than MAX_VALUE. Thus, it's a natural way to construct the algorithm without having to code a bunch of fragile "special cases" (adding to your cyclomatic complexity).

So what do you think Double.MIN_VALUE returns?

Well as a bug that I coded last week demonstrates, it does *not* return the minimum value of the range of values that can be represented in an instance of the Double data type. That would be ludicrous (this is me being sarcastic).

No, instead it returns the smallest positive value. You need to use Double.NEGATIVE_INFINITY for the smallest, *cough* minimum *cough* value.

Note that in C#/.NET Double.MinValue is the minimum value (i.e. negative number). I wonder what it is in limits.h...

So what's up Java?! I wonder how many man hours of time have been wasted on this pitfall since Java's inception? Can we ask Guy Steel and Bill Joy for a check? Just kidding...

Steve

Saturday, May 8, 2010

Save the commentary, please

I am working on a thesis right now on optimizing data structures and algorithms in database systems. I am doing my implementation for the first part in MySql, and will be augmenting one of the storage engines (maybe InnoDB, maybe something else) to test out some of the ideas in my paper.

Anyways, I picked up Understanding MySql Internals to use as a reference for some of the major components of the software. The technical information is pretty good, and its nice to get a discourse on such a large code base.

However, something bothers me about the book. Riddled throughout the technical details are little dollops of over-embellished praise for Monty Wideneous, the creator of MySql. Now, don't get me wrong-- I go a bit ga-ga for rockstar developers like Monty, Rod Johnson, Crazy Bob Lee, etc., but not in a technical book. It's just awkward to be reading about architectual decisions for the threading model in MySql and then read this:

As far as the programming skills were concerned, they were not lacking at all. Just as a good rider becomes one with the horse, Monty had become one with the computer. It pained him to see system resources wasted. He felt confident enough to be able to write virtually bug-free code.
(p. 109)

Hyperbole such as "Monty had become one with the computer" can be excluded from technical books in my opinion.

Here's another example of the Monty "love-fest":

...perhaps building on his God-given talent, Monty developed a habit and ability to write very efficient code naturally. He also developed, or perhaps was gifted from the start, with an unusually acute vision of what needed to be done to the code to make it useful in future development--without knowing in advance much detail about what that future development would be
(p. 1)

He could've removed this entire paragraph and we would know nothing less, except that the author clearly wants to be on Monty's Christmas card list. So, technical book authors: save the commentary, please. ...or maybe just keep it out of the main text and include an optional, opinion-based companion workbook!

For example, accompanying the section on Threads vs Forked Processes, there could be a multiple choice question:

Why did Monty Widenious choose a thread model instead of a forked process model for MySql?
  1. Large amount of shared data between workers
  2. Greater overhead in context switching processes over threads
  3. Greater overhead in creating a forked process over creating a thread
  4. Monty is a man-god, who is incapable of writing bugs in source code


Steve

Thursday, April 29, 2010

Wow...erm...your database is leaking

It's no secret that some of the traits that go into making good software developers are: problem solving, analytical, and abstraction skills. As we write code and design a software system, we exercise all of these skills -- in particular: abstraction. Abstractions allow us to build larger and more complex systems without needing bigger and bigger brains. We can reason about the interactions of components at various levels, and only focus on the details germane to the topic at the time.

Operating systems provide various abstractions that we use everyday: various models of how to "do work" (processes and threads), how to exchange information (multi-threaded programming primitives, etc.). We leverage complex data structures from libraries. We wouldn't want to re-write a hash table every time we need dictionary lookup semantics, or drop into assembly to write CAS instructions to re-implement mutexes and spinlocks every time we write a multi-threaded program. So *high five* to abstraction-- you're a good friend to us dumb humans.

Yet... there is a problem. In the perfect world in our cerebral cortex, things fit together nicely, and we gloss over real world needs and various "cross-cutting concerns". As such, there is the famous principal of leaky abstraction, which states that there are no perfect abstractions: all abstractions leak-- i.e. all abstractions expose some part of their underlying implementation. The amount of "leakiness" can be a quality metric. I would suspect that "leakiness" is correlated with complexity, and it is certainly correlated with how many resources are involved in the implementation.

For example, how leaky are mutex synchronization primitives? On the surface they sound simple! You try to acquire the mutex and get blocked until you get what you want. A nice abstraction, very useful. But then there are these pesky details of lock fairness, thread affinity, etc. that complicate the elegant abstraction. They typically rear their head through "cross-cutting concerns" such as performance. And this is key to the issue-- why can't there be a perfect abstraction? Because at some point there is a *finite* set of resources that are going to give a quick dose of "reality" to these "head in the clouds", "flowers and kittens" abstractions. And where abstract, pure concepts meet reality-- finite constraints start to paint a different picture.

So as developers, we deal with this by straddling the fence: we design in the world of abstractions, but maintain some "implementation" knowledge, and then TEST TEST TEST (for performance and other non-functional requirements). And we pick up little best practices about thread affinity, processor cache coherence, scheduling fairness, IO access patterns, etc. that let us meet these goals.

But in my experience with different teams and colleagues, something funny happens when databases are brought into the pictures. They are almost always treated as a black box. "Oh thats the databases job", "oh well the database just sucks", "databases suck". I have encountered a number of people who choose to write their own little mini-databases instead. They have no problem "breaking the abstraction" to think about file locks, optimizing reads/writes to reduce disk head seeks, etc.-- but they hold database technology to some higher standard; they pretend databases are a perfect abstraction!

So yes, databases are very leaky. They are leakier than many other programming tools, and this seems natural given the complexity. But they bring a slew of features, a slew of research, and a lot of production time with them. Thus, we as developers should be interested in understanding this tool, its leakiness, and how to effectively use it, just as we would anything else (O/S, hard disks, etc.).

Where do databases leak the most? Here's a brief list of various technical nuggets about popular database implementations that are important to at least have in the back of your mind when you're designing something with a database:

  • Concurrency is hard!

    • Its multi-threaded programming! Databases aren't magic-- they have to employ the same synchronization as the rest of the world to use a shared memory model of parallelism. Thus read up on isolation levels, read up on optimistic and pessimistic locking, and don't rely on "magic" as an answer for how databases do what they do.

  • Transactional consistency is hard!

    • Understand why ACID properties are important, how they make reasoning about complex interactions simple, and why it is almost always economically impossible to create a robust, complex system without some transaction abstraction

  • Searching for data is hard!

    • Searching problems in computer science make up a major chunk of the literature for the last thirty years. Oh, you put your data into a table, didn't understand what you should index, and now you blame the database for performing slowly!? Understand the difference in clustered and non-clustered indexes. Understand the concept of selectivity, and learn how to evaluate which queries are sucking the life out of your application.



At some point, I'll actually put something technical on this blog :-)

*kicks soap box to the corner of the room*

Steve

Tuesday, January 5, 2010

Spring Reading List

Wow...I really haven't posted regularly, have I? I'd like to catalog some things for my own reference, and to come back in years and chuckle at my musings and naiveté. This leads naturally into a topic: knowledge and learning.

I am an avid reader. I don't really enjoy much fiction, but I love computer science and technology books. I have a bit of an ...erm... addiction (some would say) to Amazon Marketplace. If we take a trip in the 'way back' machine to 2001ish, we will see a younger (still in High School) Steve, delighted to receive his first Amazon Marketplace shipment: Advanced Visual Basic 6: Power Techniques for Everyday Programs and Effective Visual Basic: How to Improve your VB/COM+ Applications. Once you have stopped laughing at the thought of "Effective" or "Advanced" Visual Basic, please continue

Here are two of my bookshelves now. I can't honestly claim that I have read all of them cover to cover, but I always read at least a few chapters of each, and note which sections I want to come back to. Many, though, I have read all the way through-- for example, if you haven't read Brian Goetz's Java Concurrency in Practice, then you are missing out!

Here are some of my favorite books:

  • Effective Java by Josh Bloch. A collection of pragmatic patterns, and more importantly sound reasoning as to why they are recommended.
  • Java Concurrency in Practice - excellent overview of the two most important aspects of multi-threaded programming: memory consistency and mutual exclusion. People seem to forget about the first one. Once you finish this book, please go read William Pugh's information on JSR 133. After that if you still don't understand what volatile really is, then repeat.
  • Implementation Patterns by Kent Beck. He describes the book as a companion to Fowler's canoncial "Refactoring" text. Implementation Patterns focuses on the 'small' decisions that we make every day as it pertains to design, code organization, and getting all of the '-ilities' working for us. Reading this was like re-living the first few years of professional Java development-- Kent just did a good job 'formalizing' a vocubulary with which to discuss and evaluate these various 'design pressures' that enter our mind as we code.
  • Real World Java EE Patterns by JEE guru Adam Bien. I found this book wonderful. In particular, his trade off analysis that accompanies each pattern is an excellent way to learn how he reasons about design.
  • The Pragmatic Programmer - this is a famous book and certainly worth the read. I think it's a good litmus test for oneself. I read this once before I started my career and found that I learned a ton. I thought all of the little gems of advice were new and useful. I re-read (or skimmed) this recently, and found myself going "yep, I remember learning that the hard way". I kind of view this as a way to guage the efficiency of the environment within which you work. I have been lucky to work in an environment that properly incentivized writing good software. Thus, I learned these lessons quickly and efficiently as part of my day to day work. If you work in a lazy environment, you may never learn these lessons.
  • Beautiful Code - this collection of essays is a great read. It's goal (and it succeeds) is imparting how great developers think. These lessons help 'shape' the way in which we approach problem solving. The entire "Beautiful" series is wonderful. I have read them all, except the new "Beautiful Security".
  • Java Generics - I am not recommending this book, unless you are (a) curious about some of the 'exotic' features of generics, (b) enjoy the theoretical aspects of language design (reification, contra/covariance, etc.) , or (c) want to understand the reasoning behind the design decisions that were made by the expert group.
  • SQL Server 2008 Query Performance Distilled - this is my favorite pick for SQL Server optimization. Of course Delaney's Inside SQL volume on the matter is great too, but Distilled is a little bit more bang for your buck. I found more useful information crammed into a smaller text. Once you read this (and all of the Inside SQL Server series) then you can graduate to Craig Freedman's Blog
Whew... well I know I am forgetting some. I should probably edit this later after I return home to check any that I may have missed. Another neat fact I learned in doing this is that I have had 137 separate orders with Amazon since 2004... most of which had multiple books :/ I need to seek group help or something...

Steve