Many Cups of Coffee: 2012

Friday, December 14, 2012

Why and how we use Spring

We brought on some new team members recently and I couldn't find a good intro to Spring that went through the motivation of why one might use Spring. Thus, I created this presentation. Don't know if anyone else will find it valuable:

I wanted something that went through the trade-off discussion of various options for dependency management (serivce locator, dependency injection) and then showed some practical uses of Spring and mentioned some pitfalls that I've encountered. This was meant to be a practical onboarding to the technology of Spring in Java.
Steve

Monday, June 25, 2012

The Manufacturing of Software

Among non-technologists, I think there is a common belief -- a wishful thought-- that building software is like any manufacturing process. You just need to design the process, and then get the widgets on the assembly line to execute the process-- and viola! If a widget quits, you just buy a new widget at ~~the widget store~~ Monster.com.

Thankfully, most of the truly remarkable software projects and software teams over the last decade have debunked this myth and shown the lie. The agile manifesto itself states to value people over processes and this is really a statement against this manufacturing projection.

I ran across this wonderful quote while reading Eric Evan's Domain Driven Design:

Manufacturing is a popular metaphor for software development. One inference from this metaphor: highly skilled engineers design; less skilled laborers assemble the products. This metaphor has messed up a lot of projects for one simple reason -- software development is all design.

Sunday, April 8, 2012

Bye bye Eclipse -- my disorganized, irresponsible little brother

I have been an Eclipse fan-boy for the last 7 years. I have participated on a number of open source Eclipse plugin projects. I have evangelized it to others. I also have a personal connection to it as my father was one of the project principles when it was still an IBM project. I have apologized for Eclipse a lot:

The auto-completion is slow and clunky and just doesn't "feel" right.
The static imports and favorites always seem to get in my way.
The Maven support (while dramatically improved) is still sometimes clunky.
The mismatch of hierarchical projects to non-hierarchical is awkward (maven multi-module projects)
The plugins, while extensive, are often not polished. I wanted to literally break my laptop in half after trying to write some Groovy code -- null pointers flowing up to popups, etc.
The EGit support, while improving, is just developing slowly and still has lots of bugs.
There is a lack of consistency across plugins -- they are obviously developed in isolation and lack a cohesive overall vision.
I still at least once a week have to hear about or help a team member who is stuck in some kind of Eclipse hell where builds are firing continuously or something else is going wrong.

I've always loved the dream of Eclipse as it is a simulacrum of why I love open source software. I like the bizarre over the cathedral. But at the end of the day, I need to get work done -- not fight with making Eclipses plugins all play nicely.

An anecdote of my point: last project we were trying to do some Tapestry5 development using Eclipse's WTP + Tomcat. Tapestry has some fancy classloaders so that it can hot load java class file changes without restarting/redeploying. We also wanted to use Tomcat7. Well WTP has the ability to host the tomcat instance from the workspace, which works well with Tapestry's class loader stuff. Unfortunately between the last version of WTP and Tomcat7 it was broken. The only person that works on the Tomcat connector for WTP is a guy named Larry Issacs who has a full time job at SAP. So it might get fixed when he has time...some day. I ended up patching the jar and distributing it to our team myself as a hack.

At the end of the day, I just need to get work done. And so far, IntelliJ has been an improvement. I still love Eclipse, but it needs more resources or more vision or more something to compete well. Its not the big things -- its the little things. The overall cohesiveness in IntelliJ is what I find refreshing.

But hey it could be worse -- it could be Visual Studio *giggle*. I love talking with these Visual Studio "fan boys" who have never used any other IDE or at least ReSharper. I was talking with a "chief architect evangelist" from Microsoft (a comical title) and asked if the utterly absurd feature anemia in Visual Studio was indicative of:

a lack of imagination or motivation by the development staff to investigate more productive ways to write software or
a cultural/systemic problem at Microsoft where the developers using the tools could not communicate with the product management who drives the feature sets.

He asked me for specific examples of features. I said "go look at the ReSharper feature list-- specifically Refactorings". He said literally: "Why would you want to refactor anything?".

In the most patronizing tone I could muster I suggested that he go read the chapter on refactoring in the Microsoft Press book titled "Code Complete".

*sigh*

Steve

Monday, April 2, 2012

Hibernate4 SessionFactoryBuilder gotcha

Just upgraded one of our projects to Spring 3.1.1 and Hibernate 4.1.1 today and ran into a small problem that ate up some time.

Symptom:

It appeared as if all of my hibernate <property>'s in hibernate.cfg.xml were suddenly not being picked up -- they were just silently being ignored.

So for example, for our unit tests, we run Derby with hibernate.hbm2ddl = create so that the schema is generated dynamically each time. In Hibernate 3.6 we used two files: hibernate.cfg.xml (packaged in the jar, not for customers) and hibernate.local.cfg.xml (placed in the customer's etc directory on the class path for support/production time overrides, etc.).

Well in migrating to Hibernate4 I took advantage of Spring's new hibernate4.LocalSessionFactoryBuilder class to help build my metadata and I got rid of my hibernate.cfg.xml entirely.

I used it like so:

    @Bean
    public SessionFactory sessionFactory() {
        return new LocalSessionFactoryBuilder(dataConfig.dataSource())
            .scanPackages(DatabasePackage)
            .addPackage(DatabasePackage)
            .addProperties(dataConfig.hibernateProperties())
            .addResource("hibernate.local.cfg.xml")
            .buildSessionFactory();
    }

And I had a hibernate.local.cfg.xml in my src/test/resources that looked like:

<hibernate-configuration>
  <session-factory>
    <property name="hibernate.hbm2ddl.auto">create</property>
  </session-factory>
</hibernate-configuration>

But it wasn't being picked up! I tried adding mappings in to it and those were picked up. Finally after digging through the Hibernate source I realized the problem. Adding hibernate.cfg.xml's through addResource only processes class mappings -- it ignores properties and listeners etc. However, you can add hibernate.cfg.xml's through the the .configure method and those are still processed as they were before Hibernate4.

    @Bean
    public SessionFactory sessionFactory() {
        return new LocalSessionFactoryBuilder(dataConfig.dataSource())
            .scanPackages(DatabasePackage)
            .addPackage(DatabasePackage)
            .addProperties(dataConfig.hibernateProperties())
            .configure("hibernate.local.cfg.xml")
            .buildSessionFactory();
    }

I couldn't find this well documented anywhere, nor was it specifically mentioned on the migration guide as a consideration. They do mention not paying attention to event listeners anymore, but don't mention properties.

No big deal -- just lost some time :(

Steve

Thursday, January 12, 2012

Multi-tier application + database deadlock or why databases aren't queues (part1)

Databases aren't queues.

And despite the ubiquitous presence of queuing technology out there (ActiveMQ, MSMQ, MSSQL Service Broker, Oracle Advanced Queuing) there are plenty of times when we ask our relational brethren to pretend to be queues. This is the story of one such folly, and along the way, we'll delve into some interesting sub-plots of deadlocks, lock escalation, execution plans, and covering indexes, oh my! Hopefully we'll laugh, we'll cry, and get the bad guy in the end (turned out I was the bad guy).

This is part one of a multi-part series describing the whole saga. In this part, I lay out the problem, the initial symptom, and the tools and commands I used to figure out what was going wrong.

And so it starts...

I'm going to set the stage for our discussion, to introduce you to the problem, and establish the characters involved in our tragedy. Let's say that this system organizes music CDs into labeled buckets. A CD can only be in one bucket at a time, and the bucket tracks at an aggregate level how many CDs are contained within it (e.g. bucket "size"). You can visualize having a stack of CDs and two buckets: "good CDs" and "bad CDs". Every once in a while you decide that you don't like your bucket choices, and you want to redistribute the CDs into new buckets--perhaps by decade: "1980s music", "1990s music", "all other (inferior) music". Later you might change your mind again and come up with a new way to organize your CDs, etc. We will call each "set" of buckets a "generation". So at generation zero you had 2 buckets "good CDs" and "bad CDs", at generation one you had "1980s CDs", etc, and so on and so on. The generation always increases over time as you redistribute your CDs from a previous generation's buckets to the next generation's buckets.

Lastly, while I might have my music collection organized in some bucket scheme, perhaps my friend Jerry has his own collection and his own bucket scheme. So entire sets of buckets over generations can be grouped into music collections. Collections are completely independent: Jerry and I don't share CDs nor do we share buckets. We just happen to be able to use the same system to manage our music collections.

So we have:

CDs -- the things which are contained within buckets, that we redistribute for every new generation
Buckets -- the organizational units, grouped by generation, which contain CDs. Each bucket has a sticky note on it with the number of CDs currently in the bucket.
Generation -- the set of buckets at a particular point in time.
Collection -- independent set of CDs and buckets

Even while we're redistributing the CDs from one generation's buckets to the next, a CD is only in one bucket at a time. Visualize physically moving the CD from "good CDs" (generation 0) to "1980s music" (generation 1).

NOTE: Our actual system has nothing to do with CDs and buckets-- I just found it easier to map the system into this easy to visualize metaphor.

In this system we have millions of CDs, thousands of buckets, and lots of CPUs moving CDs from bins in one generation to the next (parallel, but not distributed). The size of each bucket must be consistent at any point in time.

So assume the database model looks something like:

Buckets Table

bucketId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
name (e.g. 80s music, 90s music)
generation (e.g. 0, 1, 2)
size (e.g. 4323, 122)
collectionId (e.g. "Steves Music Collection") - NON-CLUSTERED INDEX

Cds Table

cdId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
name (e.g. "Modest Mouse - Moon and Antarctica", "Interpol - Antics")
bucketId (e.g. 1,2, etc. foreign key to the Bucket table) - NON-CLUSTERED INDEX

Note that both tables are clustered by their primary keys-- this means that the actual record data itself is stored in the leaf nodes of the primary index. I.e. the table itself is an index. In addition, Buckets can be looked up by "music collection" without scanning (see the secondary, non-clustered index on collectionId), and Cds can be looked up by bucketId without scanning (see the secondary, non-clustered index on Cds.bucketId).

The algorithm

So I wrote the redistribution process with a few design goals: (1) it needed to work online. I could concurrently add new CDs into the current generations bins while redistributing. (2) I could always locate a CD-- i.e. I could never falsely report that some CD was missing just because I happen to search during a redistribution phase. (3) if we interrupt the redistribution process, we can resume it later. (4) it needed to be parallel. I wanted to accomplish (1) and (2) with bounded blocking time so whatever blocking work I needed to do, I wanted it to be as short as possible to increase concurrency.

I used a simple concurrency abstraction that hosted a pool of workers who shared a supplier of work. The supplier of work would keep giving "chunks" of items to move from one bucket to another. We only redistribute a single music collection at a time. The supplier was shared by all of the workers, but it was synchronized for safe multi-threaded access.

The algorithm for each worker is like:

(I)   Get next chunk of work
(II)  For each CD decide the new generation bucket in which it belongs
        (accumulating the size deltas for old buckets and new buckets)
(III) Begin database transaction
(IV)   Flush accumulated size deltas for buckets
(V)    Flush foreign key updates for CDs to put them in new buckets
(VI)  Commit database transaction

Each worker would be given a chunk of CDs for the current music collection that was being redistributed (I). The worker would do some work to decide which bucket in the new generation should get the CD (II). The worker would accumulate the deltas for counts: decrementing from the original bucket and incrementing the count for the new bucket. Then the worker would flush (IV) these deltas in a correlated update like UPDATE Buckets SET size = size + 123 WHERE bucketId = 1. After the size updates were flushed, it would then flush (V) all of the individual updates to the foreign key fields to refer to the new generations buckets like UPDATE Cds SET bucketId = 123 WHERE bucketId = 101. These two operations happen in the same database transaction.

The supplier that gives work to the workers is a typical "queue" like SELECT query -- we want to iterate over all of the items in the music collection in the old generation. This happens in a separate connection, separate database transaction from the workers (discussed later). The next chunk will be read using the worker thread (with thread safe synchronization). This separate "reader" connection doesn't have its own thread or anything.

First sign of trouble - complete stand still

So we were doing some large volume testing on not-so-fast hardware, when suddenly...the system just came to a halt. We seemed to be in the middle of moving CDs to new buckets, and they just stopped making progress.

Finding out what the Java application was doing

So first step was to see what the Java application was doing:

c:\>jps
1234 BucketRedistributionMain
3456 jps
c:\>jstack 1234 > threaddump.out

Jps finds the java process id, and then run jstack to output a stacktrace for each of the threads in the java program. Jps and Jstack are included in the JDK.

The resulting stack trace showed that all workers were waiting in socketRead to complete the database update to flush the bucket size updates (step IV above).

Here is the partial stack trace for one of the workers (some uninteresting frames omitted for brevity):

"container-lowpool-3" daemon prio=2 tid=0x0000000007b0e000 nid=0x4fb0 runnable [0x000000000950e000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 ...
 at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(Unknown)
 at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(Unknown)
 ...
 at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(Unknown)
 ...
 at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1216)
 ...
 at com.mycompany.BucketRedistributor$Worker.updateBucketSizeDeltas()
 ...
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 ...
 at java.lang.Thread.run(Unknown Source)

As you can see our worker was updating the bucket sizes which resulted in a Hibernate "flush" (actually pushes the database query across the wire to the database engine), and then we await the response packet from MSSQL once the statement is complete. Note that we are using the jtds MSSQL driver (as evidenced by the net.sourceforge.jtds in the stack trace.

So the next question is why is the database just hanging out doing nothing?

Finding out what the database was doing...

MSSQL provides a lot of simple ways to get insight into what the database is doing. First let's see the state of all of the connections. Open SQL Server Management Studio (SSMS), click New Query, and type exec sp_who2

This will return output that looks like:

You can see all of the spids for sessions to the database. There should be five that we're are interested in: one for the "work queue" query and four for the workers to perform updates. The sp_who2 output includes a blkBy column which shows the spid that is blocking the given spid in the case that the given spid is SUSPENDED.

We can see that spid 56 is the "work queue" SELECT query (highlighted red). Notice that no one is blocking it... then we see spids 53, 54, 60, and 61 (highlighted in yellow) that are all waiting on 56 (or each other). Disregard 58 - its application source is the management studio as you can see.

So how curious! the reader query is blocking all of the update workers and preventing them from pushing their size updates. The reader "work queue" query looks like:

SELECT c.cdId, c.bucketId FROM Buckets b INNER JOIN Cds c ON b.bucketId = c.bucketId WHERE b.collection = 'Steves Collection' and b.generation = 23

Investing blocking and locking problems...

I see that spid 56 is blocking everyone else. So what locks is 56 holding? In a new query window, I ran exec sp_lock 56 and exec sp_lock 53 to see which locks each was holding and who was waiting on what.

You can see that 56 was holding a S (shared) lock on a key resource (key = row lock on an index) of object 1349579846, which corresponds to the Buckets table.

I wanted to the engine's execution plan for the reader "work queue" query. To get this, I executed a query that I created a while ago to dump as many details about current sessions in the system as possible-- think of it as a "super who2":

select es.session_id, es.host_name, es.status as session_status, sr.blocking_session_id as req_blocked_by,
datediff(ss, es.last_request_start_time, getdate()) as last_req_submit_secs,
st.transaction_id as current_xaction_id, 
datediff(ss, dt.transaction_begin_time, getdate()) as xaction_start_secs,
case dt.transaction_type 
 when 1 then 'read_write' 
 when 2 then 'read_only'
 when 3 then 'system'
 when 4 then 'distributed'
 else 'unknown'
end as trx_type,
sr.status as current_req_status, 
sr.wait_type as current_req_wait, 
sr.wait_time as current_req_wait_time, sr.last_wait_type as current_req_last_wait, 
sr.wait_resource as current_req_wait_rsc, 
es.cpu_time as session_cpu, es.reads as session_reads, es.writes as session_writes, 
es.logical_reads as session_logical_reads, es.memory_usage as session_mem_usage,
es.last_request_start_time, es.last_request_end_time, es.transaction_isolation_level,
sc.text as last_cnx_sql, sr.text as current_sql, sr.query_plan as current_plan
from sys.dm_exec_sessions es
left outer join sys.dm_tran_session_transactions st on es.session_id = st.session_id
left outer join sys.dm_tran_active_transactions dt on st.transaction_id = dt.transaction_id
left outer join 
 (select srr.session_id, srr.start_time, srr.status, srr.blocking_session_id, 
 srr.wait_type, srr.wait_time, srr.last_wait_type, srr.wait_resource, stt.text, qp.query_plan
 from sys.dm_exec_requests srr
 cross apply sys.dm_exec_sql_text(srr.sql_handle) as stt
 cross apply sys.dm_exec_query_plan(srr.plan_handle) as qp) as sr on es.session_id = sr.session_id

left outer join 
 (select scc.session_id, sct.text
 from sys.dm_exec_connections scc
 cross apply sys.dm_exec_sql_text(scc.most_recent_sql_handle) as sct) as sc on sc.session_id = es.session_id

where 
es.session_id >= 50

In the above output, the last column is the SQL XML Execution plan. Viewing that for spid 56, I confirmed my suspicion: The plan to serve the "work queue" query was to seek the "music collection" index on the buckets table for 'Steves collection', then seek to the clustered index to confirm 'generation = 23', then seek into the bucketId index on the Cds table. So to serve the WHERE clause in the "work queue" query, the engine had to use both the non-clustered index on Buckets and the clustered index (for the version predicate).

When joining and reading rows at READ COMMITTED isolation level, the engine will acquire locks as it traverses from index to index in order to ensure consistent reads. Thus, to read the value of the generation in the Buckets table, it must acquire a shared lock. And it has!

The problem comes in when the competing sessions that are trying to update the size on that same record of the Bucket table. It needs an X (exclusive) lock on that row (highlighted in red), and eek! it can't get it, because that reader query has a conflicting S lock already granted (highlighted in green).

Ok so that all makes sense, but why is the S lock being held? At READ COMMITTED you usually only hold the locks while the record is being read (there are exceptions and we'll get to that in Part 2). They are released as soon as the value is read. So if you read 10 rows in a single statement execution, the engine will: acquire lock on row 1, read row 1, release lock on row 1, acquire lock on row 2, read row 2, release lock on row 2, acquire lock on row 3, etc. So none of the four workers are currently reading-- they are writing -- or at least they're trying to if that pesky reader connection wasn't blocking them.

To find this, I was curious why the reader query was in a SUSPENDED state (see original sp_who2 output above). In the above "super who2" output, the current_req_wait value for the "work queue" read query is ASYNC_Network_IO.

ASYNC_Network_IO wait and how databases return results

ASYNC_Network_IO is an interesting wait. Let's discuss how remote applications execute and consume SELECT queries from databases.

That diagram is over-simplified, but within the database there are two chunks of memory to discuss: the buffer cache and the connection network buffer. The buffer cache is a shared chunk of memory, where the actual pages of the tables and indexes are kept to serve queries. So parts of the Buckets and Cds tables will be in memory while this "work queue" query executes. The execution engine executes the plan, it works out of the buffer cache, acquiring locks, and producing output records to send to the client. As it prepares these output records, it puts them in a connection-specific network buffer. When the application reads records from the result set, its actually being served from the network buffer in the database. The application driver typically has its own buffer as well.

When you just execute a simple SQL SELECT query and don't explicitly declare a database cursor, MSSQL gives you what it calls the "default result set" -- which is still a cursor of sorts -- you can think of it as a cursor over a bunch of records that you can only iterate over once in the forward direction. As your application threads iterate over the result set, the driver requests more chunks of rows from the database on your behalf, which in turn depletes the network buffer.

However, with very large result sets, the entire results cannot fit in the connection's network buffer. If the application doesn't read them fast enough, then eventually the network buffer fills up, and the execution engine must stop producing new result records to send to the client application. When this happens the spid must be suspended, and it is suspended with the wait event ASYNC_Network_IO. It's a slightly misleading wait name, because it makes you think there might be a network performance problem, but its more often an application design or performance problem. Note that when the spid is suspended -- just like any other suspension -- the currently held locks will remain held until the spid is resumed.

In our case, we know that we have millions of CDs and we can't fit them all in application memory at one time. We, by design, want to take advantage of the fact that we can stream results from the database and work on them in chunks. Unfortunately, if we happen to be holding a conflicting lock (S lock on Bucket record) when the reader query is suspended, then we create a multi-layer application deadlock, as we observed, and the whole system screeches to a halt.

So what to do for a solution? I will discuss some options and our eventual decision in Parts 2 and 3. Note that I gave one hint at our first attempt when I talked about "covering indexes", and then there is another hint above that we didn't get to in this post about "lock escalation".

Steve