Many Cups of Coffee

Guavate: tiny library bridging Guava and Java8

2016-07-04T13:46:00.001-05:00

Java8 is great and adds some useful abstractions to the JDK that have found popularity in the Java community via the wonderful Guava commons library from Google. Group discussion indicates that there will be a Guava version soon that requires Java 8 and closes the gap between Guava and Java8. However, until such a time, the rest of us using Guava+Java8 need a tiny shim library for things like Collector implementation's that produce Guava Immutable collections.

As always Stephen Colebourne threw together exactly such a tiny utility class: Guavate. Unfortunately, it's buried inside of Strata, and for all of my projects I don't want to depend on Strata just for this tiny shim. Also, I have a few Java8 shim methods myself that could use a home. Therefore, I forked Colebourne's Guavate and have released it to Maven Central for anyone else that wants to add this tiny shim library to their Java8 projects:

There are Collector implementations for each of the Immutable collections:

There are also some convenient methods for collecting to maps from Map.Entry (and Common-Lang3's Pair as it implements Entry):

Converting an arbitrary iterable into a stream (which should've been in the JDK to begin with):

and converting an Optional into a stream of zero or one element:

Checkout the GitHub project page to follow for updates or submit pull requests with your own Java8 additions:
https://github.com/steveash/guavate

Some neat Spring4 and Spring Boot features

2015-12-19T10:43:00.000-06:00

Here's a few slides on some of the Spring4 and Spring Boot features that I used on a recent project that are interesting. This is not really an introduction to Spring Boot -- but just highlighting a few features. This talk was an internal talk for our team - thus it is informal and brief. But in case anyone else has any interest:

Whirlwind tour of data science

2015-09-01T02:34:00.000-05:00

I'm giving a talk to the local Data Science / Machine Learning meetup. The topic is a shallow tour of data science topics including tasks and methods. This is a new group and many of the attendees do not have a data science background but are interested. Therefore, I thought a talk that went through the highlights would be useful. Not only is the tour a whirlwind, but creating the slides was a bit of a rush job in the wee hours of the night, so please excuse typos, factual errors, slander, blatent malicious lies, etc.

Presentation is: http://slides.com/steveash/datascience/

Streaming standard deviation and replacing values

2015-08-06T18:40:00.000-05:00

I'm working on a problem right now where I need to calculate the mean and standard deviation of a value in a streaming fashion. That is I receive a stream of events that generate data points and I need to calculate standard deviation without retaining all of the data points in memory. This has been covered ad nauseam in the literature -- I will just briefly summarize.

Ignoring the whole sample vs population correction, variance can be naively computed as the mean of the squares minus the square of the mean: $$\sigma^2 = \frac{\sum\nolimits_{i=0}^n x_{i}^2}{n} - \left( \frac{\sum\nolimits_{i=0}^n x_i}{n} \right)^2$$

But this approach is plagued with well-documented stability and overflow problems. There are two other methods that I've run across: Welford's which is by far the most published and Ross's which appears to be an independent derivation.

Welford's method keeps track of the count, the mean, and a sum of the squared differences from the mean as it goes like this:

public static double welford(DoubleStream values) {
  OfDouble iter = values.iterator();
  if (!iter.hasNext()) return 0;
  double m = iter.nextDouble();
  double s = 0;
  int count = 1;
  while (iter.hasNext()) {
    double x = iter.nextDouble();
    count += 1;
    double prevM = m;
    double delta = x - prevM;
    m = prevM + (delta / count);
    s += (delta * (x - m));
  }
  return Math.sqrt(s / (count - 1));
}

This works, but what happens if you need to replace a value in the data population with another? Here's a scenario: you want to track the number of times people click on your ads. You store some counter per person and increment it every time someone clicks something. If you group those people into segments like "people from USA", "people from Canada" you might want to keep track of what is the average number of ad clicks per day for someone from Canada? and what is the standard deviation for this average? The data population that you are averaging and calculating the stddev for is the number of clicks per day -- but you dont want to run big expensive queries over all of your people records.

A solution would be: when you get an ad click for Steve, you notice that his previous counter value for today was 3 and you're about to make it four. So you previously accounted for a 3 in the group's mean/stddev calculation. Now you just need to remove the 3 and add the 4.

I didn't find a solution to this at first glance...though I did as I was writing this blog post >:| (more on that later). Here's one approach to replacing a value:

public void replace(double oldValue, double newValue) {
  if (count == 0) {
    add(newValue); // just add it new
    return;
  }
  // precisely update the mean
  double prevM = m;
  double sum = m * count;
  sum -= oldValue;
  sum += newValue;
  m = sum / count;

  s -= ((oldValue - prevM) * (oldValue - m));
  s += ((newValue - prevM) * (newValue - m));
}

Since we have the mean and count we can precisely update the mean. Now we just back out the previous contribution to the sum of delta variance and add in the new. The means at the point of removal will be slightly different so it makes sense that this method would introduce a slight error. To measure the amount of error, I ran some simulations. I tried many variations on population size, modification count, drifting the mean over the modifications to simulate underlying data drift. Generally the error was very low -- keeping 7+ digits in agreement, which in most cases was an error < 10^-\7

Here is the output of a simulation with a data population size 1000 that went through 1 million random replacements. The mean was originally 50 and drifted up to 100,000 - so significant change in the underlying population. The final value calculated by the above "replacement" was 102.7715945 and the exact actual value was 102.7715947, which agrees to 9 digits. Here is output every 100,000 modifications. You can see that the error increases but very slowly.

Real 101.77630883973457 appx 101.77630883877222 err 9.455489999879566E-12
Real 108.95998062305821 appx 108.95998062134508 err 1.5722586742180294E-11
Real 111.47312659988472 appx 111.47312659954882 err 3.0133000046629816E-12
Real 104.24140320017268 appx 104.2414031714818  err 2.7523494838677614E-10
Real 108.11933587379178 appx 108.11933580405407 err 6.450068193382661E-10
Real 109.42545081143268 appx 109.42545070096071 err 1.0095637905406985E-9
Real 108.65152218111096 appx 108.65152203715265 err 1.3249544449723884E-9
Real 103.34165956961682 appx 103.3416593951618  err 1.6881384078315523E-9
Real 102.77159471215656 appx 102.77159451350815 err 1.9329116036550036E-9

As I was publishing this I found another approach for deleting values from Welford's standard deviation. This doesn't update the mean precisely, and I'm guessing that that introduces just a little more error. Running the exact same data through his approach shows just slightly more error:

Real 101.77630883973457 appx 101.77630883769623 err 2.0027587650921094E-11
Real 108.95998062305821 appx 108.95998062710102 err 3.710356226429184E-11
Real 111.47312659988472 appx 111.4731267005022  err 9.026165066977801E-10
Real 104.24140320017268 appx 104.2414035377799  err 3.2387056223459957E-9
Real 108.11933587379178 appx 108.11933681146603 err 8.672586049942834E-9
Real 109.42545081143268 appx 109.42545279287886 err 1.8107726862840628E-8
Real 108.65152218111096 appx 108.65152556310522 err 3.112698463527759E-8
Real 103.34165956961682 appx 103.34166620916582 err 6.424852307519515E-8
Real 102.77159471215656 appx 102.77160676463879 err 1.1727444988401424E-7

I put my code in a gist in case anyone wants to snag it.

Using ELK (elasticsearch + logstash + kibana) to aggregate cassandra and spark logs

2015-06-07T19:36:00.000-05:00

On one of my current projects we are using Cassandra and Spark Streaming to do some somewhat-close-to-real time analytics. The good folks at Datastax have built a commercial packaging (Datastax Enterprise, aka DSE) of Cassandra and Spark that allow you to get this stack up and running with relatively few headaches. One thing the Datastax offering does not include is a way to aggregate logs across all of these components. There are quite a few processes running across the cluster, each producing log files. Additionally, spark creates log directories for each application and driver program, each with their own logs. Between the high count of log files and the fact that work happens on all the nodes different each time depending on how the data gets partitioned- it can become a huge time sink to hunt around and grep for the log messages that you care about.

Enter the ELK stack. ELK is made up of three products:

Elasticsearch - a distributed indexing and search platform. It provides a REST interface on top of a fancy distributed, highly available full text and semi-structured text searching system. It uses Lucene internally for the inverted index and searching.
Logstash - log aggregator and processor. This can take log feeds from lots of different sources, filter and mutate them, and output them somewhere else (elasticsearch in this case). Logstash is the piece that needs to know about whatever structure you have in your log messages so that it can pull the structure out and send that semi-structured data to elasticsearch.
Kibana - web based interactive visualization and query tool. You can explore the raw data and turn it into fancy aggregate charts and build dashboards.

All three of these are open-source, Apache 2 licensed projects built by Elastic, a company founded by the folks that wrote Elasticsearch and Lucene. They have all of the training, professional services, and production support subscriptions, and a stable of products with confusing names that you aren't quite sure if you need or not...

So how does this look at a high level? Spark and Cassandra run co-located on the same boxes. This is by design so that your Spark jobs can use RDDs that use a partitioning scheme that is aware of Cassandra's ring topology. This can minimize over-the-wire data shuffles, improving performance. This diagram shows at a high level where each of these processes sit in this distributed environment:

This only depicts two "analytics" nodes and one ELK node, but obviously you will have more of each. Each analytics node will be producing lots of logs. Spark writes logs to:

/var/log/spark/worker
/var/lib/spark/worker/worker-0/app-{somelongtimestamp}/{taskslot#}
/var/lib/spark/worker/worker-0/driver-{somelongtimestamp}

To collect all of these logs and forward, there is an agent process on each node called Logstash-Forwarder that is monitoring user specified folders for new log files and is shipping them over via TCP to the actual logstash server process running on the ELK node. Logstash receives these incoming feeds, parses them and sends them to elasticsearch. Kibana responds to my interactive queries and delegates all of the search work to elasticsearch. Kibana doesn't store any results internally or have its own indexes or anything.

Others have already done a good job explaining how to setup ELK and how to use Kibana, and therefore I won't repeat all of that here. I will only highlight some of the differences, and share my Logstash configuration files that I had to create to handle the out-of-the-box log file formats for Cassandra and Spark (as packaged in DSE 4.7 at least).

I installed elasticsearch from the repo, which already created the systemd entries to run it as a service. Following the ELK setup link above, I created systemd entries for logstash and kibana. I also created a systemd unit file for the logstash-forwarder running on each analytics node.

The logstash-forwarder needs to be configured with all of the locations that spark and cassandra will put logfiles. It supports glob syntax, including recursive folder searches like "whatever/**/*.log", but I ended up not using that because it was duplicating some of the entries due to a wonky subfolder being created under the spark driver program log folder called cassandra_logging_unconfigured. My forwarder configuration picks up all of the output logs for the workers, the applications, and the driver and creates a different type for each: spark-worker for generic /var/log spark output, spark-app for app-* log folder, spark-driver for the driver programs (where most of the interesting logging happens). My logstash-forwarder.conf is in the gist.

For logstash I setup a few files as a pipeline to handle incoming log feeds:

00_input.conf - sets up the lumberjack (protocol the forwarder uses) port and configures certs
01_cassandra_filter.conf - parses the logback formats that DSE delivers for cassandra. Im not sure if vanilla open-source cassandra uses the same by defualt or not. There are two formats between sometimes there is an extra value in here -- possibly from the logback equivalent of NDC.
02_spark_filter.conf - parses the logback formats that DSE delivers for spark. I see there are two formats I get here as well. Sometimes with a line number, sometimes without.
07_last_filter.conf - this is a multiline filter that recognizes java stacktraces and causes them to stay with the original message
10_output.conf - sends everything to elasticsearch

All of my configuration files are available through the links above in this gist. Between the linked guides above and the configuration here that should get you going if you need to monitor cassandra and spark logs with ELK!

Quick tip: While you're getting things configured and working, you might need to kill the currently indexed data and resend everything (so that it can reparse and re-index). The logstash-forwarder keeps a metadata file called .logstash-forwarder in the working directory where you started the forwarder. If you want to kill all of the indexed data and resend everything, follow these steps:

Kill the logstash-forwarder:
sudo systemctl stop logstash-forwarder
Delete the logstash-forwarder metadata so it starts from scratch next time:
sudo rm /var/lib/logstash-forwarder/.logstash-forwarder
If you need to change any logstash configuration files, do that and then restart logstash:
sudo systemctl restart logstash
Delete your existing elastic search indexes (be careful with this!):
curl -XDELETE 'http://:9200/logstash-*'
Start the logstash-forwarder again:
sudo systemctl start logstash-forwarder

Also note that by default the logstash-forwarder will only pickup files less than 24 hours old. So if you're configuring ELK and playing with filters on stagnant data, make sure at least some files are touched recently so it will pick them up. Check out the log file in /var/log/logstash-forwarder to see if it's skipping particular entries. You can also run the logstash-forwarder with -verbose to see additional debug information.

Quick tip: use the wonderfully useful http://grokdebug.herokuapp.com/ to test out your grow regex patterns to make sure they match and pull out the fields you want.

Bushwhacker to help team members debug unit test failures

2014-08-24T15:20:00.003-05:00

Right now I'm working on a project with a few people who have never worked in Java before. There have been times where they'll experience some problem that they can't easily figure out. They send me the stack trace and instantly I know what the problem is. I wanted a simple way to automate this.

So I created Bushwhacker a simple library to help provide more helpful messages at development time. Senior members of the team edit and maintain a simple xml file that describes rules to detect particular exceptions and either replace or enhance the exception message in some way. The rules are fairly sophisticated -- allowing you to, for example, only detect IllegalArgumentException if it was throw from a stack frame from MyClassWhatever. Since the xml rules file is read off of the classpath, you can version the XML file along with your source code (in src/test/resources for example). And you can even maintain multiple rules files for different modules and compose them all together. So if there are common pitfalls in your corporate "commons" module you can keep the bushwhacker rules for it there, and put your team's additional rules in your module.

It's on Maven Central Repo so its easy to add to your project. Check out usage instructions here: https://github.com/steveash/bushwhacker

Here's a few of the rules that we're using right now:

Single record UPSERT in MSSQL 2008+

2014-06-24T14:58:00.000-05:00

I guess I just don't use SQL's MERGE statement enough to remember its syntax off the top of my head. When I look at the Microsoft documentation for it, I can't find just a simple example to do a single record UPSERT. When I google, I don't get any brief blog posts that show it. So here's a brief one to add to the pile:

Table:

DATA_VERSION
datakey varchar(16) not null primary key,
seqno bigint not null,
last_updated datetime not null

The table just keeps track of sequence numbers for things identified by 'datakey'. The first time we want to increment a value for a key we need to insert a new record with starting with seqno = 1 then every time after that we just want to increment the existing row.

MSSQL:

Spring Test Context Caching + AspectJ @Transactional + Ehcache pain

2014-03-30T16:08:00.000-05:00

Are you using AspectJ @Transactionals and Spring? Do you have multiple SessionFactory's maybe one for an embedded database for unit testing and one for the real database for integration testing? Are you getting one of these exceptions?

org.springframework.transaction.CannotCreateTransactionException: Could not open Hibernate Session for transaction; nested exception is org.hibernate.service.UnknownServiceException: Unknown service requested

java.lang.NullPointerException at net.sf.ehcache.Cache.isKeyInCache(Cache.java:3068) at org.hibernate.cache.ehcache.internal.regions.EhcacheDataRegion.contains(EhcacheDataRegion.java:223)

Then you are running in to a problem where multiple, cached application contexts are stepping on each other. This blog post will describe some strategies to deal with the problems that we have encountered.

Background

Spring's Text Context framework by default tries to minimize the number of times the spring container has to start by caching the containers. If you are running multiple tests that all use the same configuration, then you will only have to create the container once for all the tests instead of creating it before each test. If you have 1000's of tests and the container takes 10-15 seconds to startup, this makes a real difference in build/test time.

This only works if everyone (you and all of the libraries that you use) avoid static fields (global state), and unfortunately there are places where this is difficult/impossible to avoid -- even spring violates this! A few places that have caused us problems:

Spring AspectJ @Transactional support
EhCache cache managers

Aspects are singletons by design. Spring uses this to place a reference to the BeanFactory as well as the PlatformTransactionManager. If you have multiple containers each with their "own" AnnotationTransactionAspect, they are in fact sharing the AnnotationTransactionAspect and whichever container starts up last is the "winner" causing all kinds of unexpected hard to debug problems.

Ehcache is also a pain here. The ehcache library maints a static list of all of the cache managers that it has created in the VM. So if you want to use multiple containers, they will all share a reference to the same cache. Spring Test gives you a mechanism to indicate that this test has "dirtied" the container and that it needs to be created. This translates to destroying the container after the test class is finished. This is fine, but if your container has objects that are shared by the other containers, then destroying that shared object breaks the other containers.

Solutions

The easiest solution is to basically disable the application context caching entirely. This can be done simply by placing @DirtiesContext on every test or (better) you probably should use super classes ("abstract test fixtures") to organize your tests anyways, in which case just add @DirtiesContext on the base class. Unfortunately you also lose all of the benefit of caching and your build times will increase.

There is no general mechanism for the spring container to "clean itself up", because this sharing of state across container is certainly an anti-pattern. The fact that they themselves do it (AnnotationTransactionAspect, EhCacheManagerFactoryBean.setShared(true), etc.) is an indication that perhaps they should add some support. If you want to keep caching, then step 1 is making sure that you don't use any "static field" singletons in your code. Also make sure that any external resources that you write to are separated so that multiple containers can co-exist in the same JVM.

To address the AspectJ problem, the best solution that I have found is to create a TestExecutionListener that "resets" the AnnotationTransactionAspect to point to the correct bean factory and PTM before test execution. The code for such a listener is in this gist.

To then use the listener you put @TestListeners on your base class test fixture so that all tests run with the new listener. Note that when you use the @TestListeners annotation you then have to specify all of the execution listeners, including the existing Spring ones. There is an example in the gist.

The workaround for Ehcache is to not allow CacheManager instances to be shared between containers. To do this, you have to ensure that the cache managers all have unique names. This is actually pretty easy to configure.

Related Issues

Here are some links to spring jira issues covering this problem:
https://jira.spring.io/browse/SPR-6121
https://jira.spring.io/browse/SPR-6353

Java final fields on x86 a no-op?

2013-09-05T00:43:00.000-05:00

I have always enjoyed digging in to the details of multi-threaded programming, and always enjoy that despite reading for years about CPU memory consistency models, wait-free and lock-free algorithms, the java memory model, java concurrency in practice, etc. etc. -- I still create multi-threaded programming bugs. It's always a wonderfully humbling experience that reminds me how complicated of a problem this is.

If you've read the JMM, then you might remember that one of the areas they strengthened was the guarantee of visibility of final fields after the constructor completes. For example,

public class ClassA {
   public final String b;

   public ClassA(String b) {
      this.b = b;
   }
}
... 
ClassA x = new ClassA("hello");

The JMM states that every thread (even threads other than the one that constructed the instance, x, of ClassA) will always observe x.b as "hello" and would never see a value of null (the default value for a reference field).

This is really great! That means that we can create immutable objects just by marking the fields as final and any constructed instance is automatically able to shared amongst threads with no other work to guarantee memory visibility. Woot! The flip-side of this is that if ClassA.b were not marked as final then you would have no such guarantee. And other threads could observe a x.b == null result (if no other "safe publication" mechanisms were employed to get visibility)

Well when they created the new JMM, everyone's favorite JCP member, Doug Lea, created a cookbook to help JVM developers implement the new memory model rules. If you read this, then you will see that the "rules" state that JIT compilers should emit a StoreStore memory barrier, right before the constructor returns. This StoreStore barrier is a kind of "memory fence". When emitted in the assembly instructions, it means that no memory writes (stores) after the fence can be re-ordered before memory writes that appear before the fence. Note that it doesn't say anything about reads -- they can "hop" the fence in either direction.

So what does this mean? well if you think about what the compiler does when you call a constructor:

String x = new ClassA("hello");
  get's broken down in to pseudo-code steps of:

1. pointer_to_A = allocate memory for ClassA 
    (mark word, class object pointer, one reference field for String b)
2. pointer_to_A.whatever class meta data = ...
3. pointer_to_A.b = address of "hello" string
4. emit a StoreStore memory barrier per the JMM
5. x = pointer_to_A

The StoreStore barrier at step 4 ensures that any writes (such as class meta-data and to field b are not re-ordered with the write to x on step 5. This is what makes sure that if x is visible to any other threads -- that that other thread can't see x without seeing x.b as well. Without the StoreStore memory barrier, then steps 3 and 5 could be re-ordered and the write to main memory for x could show up before the write to x.b and another cpu core could observe pointer_to_A.b to be 0 (null), which would violate the JMM.

Great news! However, if you look at that cookbook you'll see some interesting things: (1) a lot of people are writing JVMs on lots of processor architectures! (2) *all* of the memory barriers on x86 are no-ops except the StoreLoad barrier! This means that on x86 this StoreStore memory barrier above is a no-op and thus no assembly is emitted for it. It does nothing! This is because the x86's memory model is a strong "total store ordering" (TSO). X86 makes sure that all memory writes are observed as if they were all made in the same order. Thus, the write of 5 would never appear before 3 to any other thread anyways due to TSO, and there is no need to emit a memory fence. Other cpu architectures have weaker memory models which do not make such guarantees, and thus the StoreStore memory fence is necessary. Note that weaker memory models, while perhaps harder or less-intuitive to program against, are generally much faster as the cpu can re-order things to make more efficient use of cache writes and reduce cache coherency work.

Obviously you should continue to write correct code that follows the JMM. BUT it also means (unfortunately or fortunately) that forgetting this will not lead to bugs if you're running on x86...like I do at work.

To really drill this home and ensure that there are no other side effects that maybe aren't being described in the cookbook, I ran the x86 assembly outputter as described here and captured the output of calling the constructor for ClassA (with the final on the reference type field) and the constructor for a ClassB, which was identical to ClassA except without the final keyword on the class member. The x86 assembly output is identical. So from a JIT perspective, on x86 (not itanium, not arm, etc), the final keyword has no impact.

If you're curious what the assembly code looks like here it is below. Note the lack of any locked instructions. When Oracle's 7u25 JRE emits an x86 StoreLoad memory fence it is done through emitting lock addl $0x0,(%rsp) which just adds zero to the stack pointer -- a no-op, but since its locked -- that has the effect of a full fence (which meets the criteria of a StoreLoad fence). There are a few different ways in x86 to cause the effect of a full fence, and they are discussed in the OpenJDK mailing list. They observed that at least on nehelem intel the lock add of 0 was most space compact/efficient.

  0x00007f152c020c60: mov    %eax,-0x14000(%rsp)
  0x00007f152c020c67: push   %rbp
  0x00007f152c020c68: sub    $0x20,%rsp         ;*synchronization entry
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@-1 (line 60)
  0x00007f152c020c6c: mov    %rdx,(%rsp)
  0x00007f152c020c70: mov    %esi,%ebp
  0x00007f152c020c72: mov    0x60(%r15),%rax
  0x00007f152c020c76: mov    %rax,%r10
  0x00007f152c020c79: add    $0x18,%r10
  0x00007f152c020c7d: cmp    0x70(%r15),%r10
  0x00007f152c020c81: jae    0x00007f152c020cd6
  0x00007f152c020c83: mov    %r10,0x60(%r15)
  0x00007f152c020c87: prefetchnta 0xc0(%r10)
  0x00007f152c020c8f: mov    $0x8356f3d0,%r11d  ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020c95: mov    0xb0(%r11),%r10
  0x00007f152c020c9c: mov    %r10,(%rax)
  0x00007f152c020c9f: movl   $0x8356f3d0,0x8(%rax)  ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020ca6: mov    %r12d,0x14(%rax)   ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020caa: mov    %ebp,0xc(%rax)     ;*putfield a
                                                ; - com.argodata.match.profiling.FinalConstructorMain$ClassA::@6 (line 17)
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@6 (line 60)
  0x00007f152c020cad: mov    (%rsp),%r10
  0x00007f152c020cb1: mov    %r10d,0x10(%rax)   ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020cb5: mov    %rax,%r10
  0x00007f152c020cb8: shr    $0x9,%r10
  0x00007f152c020cbc: mov    $0x7f152b765000,%r11
  0x00007f152c020cc6: mov    %r12b,(%r11,%r10,1)  ;*synchronization entry
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@-1 (line 60)
  0x00007f152c020cca: add    $0x20,%rsp
  0x00007f152c020cce: pop    %rbp
  0x00007f152c020ccf: test   %eax,0x9fb932b(%rip)        # 0x00007f1535fda000
                                                ;   {poll_return}
  0x00007f152c020cd5: retq   
  0x00007f152c020cd6: mov    $0x8356f3d0,%rsi   ;   {oop('com/argodata/match/profiling/FinalConstructorMain$ClassA')}
  0x00007f152c020ce0: xchg   %ax,%ax
  0x00007f152c020ce3: callq  0x00007f152bfc51e0  ; OopMap{[0]=Oop off=136}
                                                ;*new  ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
                                                ;   {runtime_call}
  0x00007f152c020ce8: jmp    0x00007f152c020caa  ;*new
                                                ; - com.argodata.match.profiling.FinalConstructorMain::callA@0 (line 60)
  0x00007f152c020cea: mov    %rax,%rsi
  0x00007f152c020ced: add    $0x20,%rsp
  0x00007f152c020cf1: pop    %rbp
  0x00007f152c020cf2: jmpq   0x00007f152bfc8920  ;   {runtime_call}

Steve

Database intro for developers

2013-07-08T12:28:00.000-05:00

Most of my academic background focused on low level database research. I wrote a dynamic programming based join-order optimizer for the open-source SQLite database as project in school. I have submitted a paper for publication in an academic conference that deals with optimizing low-level page structures and optimizer heuristics to improve throughput for certain kinds of workloads on solid-state drives. All in all, I've been a "nuts and bolts" database geek for a while -- more on the systems-level side than on the data mining and data warehousing side. I can talk about exotic fractal trees and various B*-tree variants for efficient indexing and lock-free versions ad nauseum ;)

In any case, the level of willful ignorance about the basics of how databases work among typical enterprise developers has always surprised me. The are the workhorses of most enterprise apps, and getting them to perform well requires some knowledge of how they work. At least a basic understanding of how indexing works is useful in building up your mental intuition about what will perform and what won't in production. Everyone should also at least know the basic processing pipeline: parsing, optimization, execution.

This talk was meant to be a simple introduction to some of the basics of how databases process your queries for you. I'd like to do more follow on talks to add more depth in the future. It is geared towards developers -- not DBAs. So there is nothing in here about best practices for backups or transaction log shipping.

Steve

Introduction to multi-threaded java programming

2013-07-08T12:17:00.000-05:00

I gave a talk for our group about multi-threaded java programming for some of our junior members. I lifted a bulk of the slides from the interwebz, but added a fair amount of new content and commentary. It's not really that good of an introduction, but its something.

Steve

Java implementation of Optimal String Alignment

2013-04-20T10:17:00.001-05:00

For a while, I've used the Apache Commons lang StringUtils implementation of Levenshtein distance. It implements a few well known tricks to use less memory by only hanging on to two arrays instead of allocating a huge n x m table for the memoisation table. It also only checks a "stripe" of width 2 * k +1 where k is the maximum number of edits.

In most practical usages of levenshtein you just care if a string is within some small number (1, 2, 3) of edits from another string. This avoid much of the n * m computation that makes levenstein "expensive". We found that with a k <= 3, levenshtein with these tricks was faster than Jaro-Winkler distance, which is an approximate edit distance calculation that was created to be a faster approximate (well there were many reasons).

Unfortunately, the Apache Commons Lang implementation only calculates Levenshtein and not the possible more useful Damerau-Levenshtein distance. Levenshtein defines the edit operations insert, delete, and substitute. The Damerau variant adds *transposition* to the list, which is pretty useful for most of the places I use edit distance. Unfortunately DL distance is not a true metric in that it doesn't respect the triangle inequality, but there are plenty of applications that are unaffected by this. As you can see from that wikipedia page, there is often confusion between Optimal String Alignment and DL distance. In practice OSA is a simpler algorithm and requires less book-keeping so the runtime is probably marginally faster.

I could not find any implementations of OSA or DL that used the memory tricks and "stripe" tricks that I saw in Apache Commons Lang. So I implemented my own OSA using those tricks. At some point I'll also implement DL with the tricks and see what the performance differences are:

Here's OSA in Java. It's public domain; feel free to use as you like. The unit tests are below. Only dependency is on Guava- but its just the preconditions class and an annotation for documentation so easy to remove that dependency if you like:

Some Java Puzzlers in a powerpoint

2013-04-19T15:32:00.001-05:00

Every week our team gets together on friday afternoons and have a presentation. The presenter rotates each week, and the topic can be programming language related, cool new libraries, theoretical -- or whatever related to our work.

I love doing Java Puzzlers since they do a great job of enriching your understanding of the language in a fun way. I've already done all of the Java Puzzlers that I could find from slides from Josh Bloche himeself. So this time I created my own, lifting the examples from the book.

In case this benefits anyone else-- here is mine: Java Puzzlers

Why and how we use Spring

2012-12-14T17:27:00.000-06:00

We brought on some new team members recently and I couldn't find a good intro to Spring that went through the motivation of why one might use Spring. Thus, I created this presentation. Don't know if anyone else will find it valuable:

I wanted something that went through the trade-off discussion of various options for dependency management (serivce locator, dependency injection) and then showed some practical uses of Spring and mentioned some pitfalls that I've encountered. This was meant to be a practical onboarding to the technology of Spring in Java.
Steve

The Manufacturing of Software

2012-06-25T11:33:00.000-05:00

Among non-technologists, I think there is a common belief -- a wishful thought-- that building software is like any manufacturing process. You just need to design the process, and then get the widgets on the assembly line to execute the process-- and viola! If a widget quits, you just buy a new widget at ~~the widget store~~ Monster.com.

Thankfully, most of the truly remarkable software projects and software teams over the last decade have debunked this myth and shown the lie. The agile manifesto itself states to value people over processes and this is really a statement against this manufacturing projection.

I ran across this wonderful quote while reading Eric Evan's Domain Driven Design:

Manufacturing is a popular metaphor for software development. One inference from this metaphor: highly skilled engineers design; less skilled laborers assemble the products. This metaphor has messed up a lot of projects for one simple reason -- software development is all design.

Bye bye Eclipse -- my disorganized, irresponsible little brother

2012-04-08T23:41:00.000-05:00

I have been an Eclipse fan-boy for the last 7 years. I have participated on a number of open source Eclipse plugin projects. I have evangelized it to others. I also have a personal connection to it as my father was one of the project principles when it was still an IBM project. I have apologized for Eclipse a lot:

The auto-completion is slow and clunky and just doesn't "feel" right.
The static imports and favorites always seem to get in my way.
The Maven support (while dramatically improved) is still sometimes clunky.
The mismatch of hierarchical projects to non-hierarchical is awkward (maven multi-module projects)
The plugins, while extensive, are often not polished. I wanted to literally break my laptop in half after trying to write some Groovy code -- null pointers flowing up to popups, etc.
The EGit support, while improving, is just developing slowly and still has lots of bugs.
There is a lack of consistency across plugins -- they are obviously developed in isolation and lack a cohesive overall vision.
I still at least once a week have to hear about or help a team member who is stuck in some kind of Eclipse hell where builds are firing continuously or something else is going wrong.

I've always loved the dream of Eclipse as it is a simulacrum of why I love open source software. I like the bizarre over the cathedral. But at the end of the day, I need to get work done -- not fight with making Eclipses plugins all play nicely.

An anecdote of my point: last project we were trying to do some Tapestry5 development using Eclipse's WTP + Tomcat. Tapestry has some fancy classloaders so that it can hot load java class file changes without restarting/redeploying. We also wanted to use Tomcat7. Well WTP has the ability to host the tomcat instance from the workspace, which works well with Tapestry's class loader stuff. Unfortunately between the last version of WTP and Tomcat7 it was broken. The only person that works on the Tomcat connector for WTP is a guy named Larry Issacs who has a full time job at SAP. So it might get fixed when he has time...some day. I ended up patching the jar and distributing it to our team myself as a hack.

At the end of the day, I just need to get work done. And so far, IntelliJ has been an improvement. I still love Eclipse, but it needs more resources or more vision or more something to compete well. Its not the big things -- its the little things. The overall cohesiveness in IntelliJ is what I find refreshing.

But hey it could be worse -- it could be Visual Studio *giggle*. I love talking with these Visual Studio "fan boys" who have never used any other IDE or at least ReSharper. I was talking with a "chief architect evangelist" from Microsoft (a comical title) and asked if the utterly absurd feature anemia in Visual Studio was indicative of:

a lack of imagination or motivation by the development staff to investigate more productive ways to write software or
a cultural/systemic problem at Microsoft where the developers using the tools could not communicate with the product management who drives the feature sets.

He asked me for specific examples of features. I said "go look at the ReSharper feature list-- specifically Refactorings". He said literally: "Why would you want to refactor anything?".

In the most patronizing tone I could muster I suggested that he go read the chapter on refactoring in the Microsoft Press book titled "Code Complete".

*sigh*

Steve

Hibernate4 SessionFactoryBuilder gotcha

2012-04-02T20:08:00.001-05:00

Just upgraded one of our projects to Spring 3.1.1 and Hibernate 4.1.1 today and ran into a small problem that ate up some time.

Symptom:

It appeared as if all of my hibernate <property>'s in hibernate.cfg.xml were suddenly not being picked up -- they were just silently being ignored.

So for example, for our unit tests, we run Derby with hibernate.hbm2ddl = create so that the schema is generated dynamically each time. In Hibernate 3.6 we used two files: hibernate.cfg.xml (packaged in the jar, not for customers) and hibernate.local.cfg.xml (placed in the customer's etc directory on the class path for support/production time overrides, etc.).

Well in migrating to Hibernate4 I took advantage of Spring's new hibernate4.LocalSessionFactoryBuilder class to help build my metadata and I got rid of my hibernate.cfg.xml entirely.

I used it like so:

    @Bean
    public SessionFactory sessionFactory() {
        return new LocalSessionFactoryBuilder(dataConfig.dataSource())
            .scanPackages(DatabasePackage)
            .addPackage(DatabasePackage)
            .addProperties(dataConfig.hibernateProperties())
            .addResource("hibernate.local.cfg.xml")
            .buildSessionFactory();
    }

And I had a hibernate.local.cfg.xml in my src/test/resources that looked like:

<hibernate-configuration>
  <session-factory>
    <property name="hibernate.hbm2ddl.auto">create</property>
  </session-factory>
</hibernate-configuration>

But it wasn't being picked up! I tried adding mappings in to it and those were picked up. Finally after digging through the Hibernate source I realized the problem. Adding hibernate.cfg.xml's through addResource only processes class mappings -- it ignores properties and listeners etc. However, you can add hibernate.cfg.xml's through the the .configure method and those are still processed as they were before Hibernate4.

    @Bean
    public SessionFactory sessionFactory() {
        return new LocalSessionFactoryBuilder(dataConfig.dataSource())
            .scanPackages(DatabasePackage)
            .addPackage(DatabasePackage)
            .addProperties(dataConfig.hibernateProperties())
            .configure("hibernate.local.cfg.xml")
            .buildSessionFactory();
    }

I couldn't find this well documented anywhere, nor was it specifically mentioned on the migration guide as a consideration. They do mention not paying attention to event listeners anymore, but don't mention properties.

No big deal -- just lost some time :(

Steve

Multi-tier application + database deadlock or why databases aren't queues (part1)

2012-01-12T01:41:00.000-06:00

Databases aren't queues.

And despite the ubiquitous presence of queuing technology out there (ActiveMQ, MSMQ, MSSQL Service Broker, Oracle Advanced Queuing) there are plenty of times when we ask our relational brethren to pretend to be queues. This is the story of one such folly, and along the way, we'll delve into some interesting sub-plots of deadlocks, lock escalation, execution plans, and covering indexes, oh my! Hopefully we'll laugh, we'll cry, and get the bad guy in the end (turned out I was the bad guy).

This is part one of a multi-part series describing the whole saga. In this part, I lay out the problem, the initial symptom, and the tools and commands I used to figure out what was going wrong.

And so it starts...

I'm going to set the stage for our discussion, to introduce you to the problem, and establish the characters involved in our tragedy. Let's say that this system organizes music CDs into labeled buckets. A CD can only be in one bucket at a time, and the bucket tracks at an aggregate level how many CDs are contained within it (e.g. bucket "size"). You can visualize having a stack of CDs and two buckets: "good CDs" and "bad CDs". Every once in a while you decide that you don't like your bucket choices, and you want to redistribute the CDs into new buckets--perhaps by decade: "1980s music", "1990s music", "all other (inferior) music". Later you might change your mind again and come up with a new way to organize your CDs, etc. We will call each "set" of buckets a "generation". So at generation zero you had 2 buckets "good CDs" and "bad CDs", at generation one you had "1980s CDs", etc, and so on and so on. The generation always increases over time as you redistribute your CDs from a previous generation's buckets to the next generation's buckets.

Lastly, while I might have my music collection organized in some bucket scheme, perhaps my friend Jerry has his own collection and his own bucket scheme. So entire sets of buckets over generations can be grouped into music collections. Collections are completely independent: Jerry and I don't share CDs nor do we share buckets. We just happen to be able to use the same system to manage our music collections.

So we have:

CDs -- the things which are contained within buckets, that we redistribute for every new generation
Buckets -- the organizational units, grouped by generation, which contain CDs. Each bucket has a sticky note on it with the number of CDs currently in the bucket.
Generation -- the set of buckets at a particular point in time.
Collection -- independent set of CDs and buckets

Even while we're redistributing the CDs from one generation's buckets to the next, a CD is only in one bucket at a time. Visualize physically moving the CD from "good CDs" (generation 0) to "1980s music" (generation 1).

NOTE: Our actual system has nothing to do with CDs and buckets-- I just found it easier to map the system into this easy to visualize metaphor.

In this system we have millions of CDs, thousands of buckets, and lots of CPUs moving CDs from bins in one generation to the next (parallel, but not distributed). The size of each bucket must be consistent at any point in time.

So assume the database model looks something like:

Buckets Table

bucketId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
name (e.g. 80s music, 90s music)
generation (e.g. 0, 1, 2)
size (e.g. 4323, 122)
collectionId (e.g. "Steves Music Collection") - NON-CLUSTERED INDEX

Cds Table

cdId (e.g. 1,2,3,4) - PRIMARY KEY CLUSTERED
name (e.g. "Modest Mouse - Moon and Antarctica", "Interpol - Antics")
bucketId (e.g. 1,2, etc. foreign key to the Bucket table) - NON-CLUSTERED INDEX

Note that both tables are clustered by their primary keys-- this means that the actual record data itself is stored in the leaf nodes of the primary index. I.e. the table itself is an index. In addition, Buckets can be looked up by "music collection" without scanning (see the secondary, non-clustered index on collectionId), and Cds can be looked up by bucketId without scanning (see the secondary, non-clustered index on Cds.bucketId).

The algorithm

So I wrote the redistribution process with a few design goals: (1) it needed to work online. I could concurrently add new CDs into the current generations bins while redistributing. (2) I could always locate a CD-- i.e. I could never falsely report that some CD was missing just because I happen to search during a redistribution phase. (3) if we interrupt the redistribution process, we can resume it later. (4) it needed to be parallel. I wanted to accomplish (1) and (2) with bounded blocking time so whatever blocking work I needed to do, I wanted it to be as short as possible to increase concurrency.

I used a simple concurrency abstraction that hosted a pool of workers who shared a supplier of work. The supplier of work would keep giving "chunks" of items to move from one bucket to another. We only redistribute a single music collection at a time. The supplier was shared by all of the workers, but it was synchronized for safe multi-threaded access.

The algorithm for each worker is like:

(I)   Get next chunk of work
(II)  For each CD decide the new generation bucket in which it belongs
        (accumulating the size deltas for old buckets and new buckets)
(III) Begin database transaction
(IV)   Flush accumulated size deltas for buckets
(V)    Flush foreign key updates for CDs to put them in new buckets
(VI)  Commit database transaction

Each worker would be given a chunk of CDs for the current music collection that was being redistributed (I). The worker would do some work to decide which bucket in the new generation should get the CD (II). The worker would accumulate the deltas for counts: decrementing from the original bucket and incrementing the count for the new bucket. Then the worker would flush (IV) these deltas in a correlated update like UPDATE Buckets SET size = size + 123 WHERE bucketId = 1. After the size updates were flushed, it would then flush (V) all of the individual updates to the foreign key fields to refer to the new generations buckets like UPDATE Cds SET bucketId = 123 WHERE bucketId = 101. These two operations happen in the same database transaction.

The supplier that gives work to the workers is a typical "queue" like SELECT query -- we want to iterate over all of the items in the music collection in the old generation. This happens in a separate connection, separate database transaction from the workers (discussed later). The next chunk will be read using the worker thread (with thread safe synchronization). This separate "reader" connection doesn't have its own thread or anything.

First sign of trouble - complete stand still

So we were doing some large volume testing on not-so-fast hardware, when suddenly...the system just came to a halt. We seemed to be in the middle of moving CDs to new buckets, and they just stopped making progress.

Finding out what the Java application was doing

So first step was to see what the Java application was doing:

c:\>jps
1234 BucketRedistributionMain
3456 jps
c:\>jstack 1234 > threaddump.out

Jps finds the java process id, and then run jstack to output a stacktrace for each of the threads in the java program. Jps and Jstack are included in the JDK.

The resulting stack trace showed that all workers were waiting in socketRead to complete the database update to flush the bucket size updates (step IV above).

Here is the partial stack trace for one of the workers (some uninteresting frames omitted for brevity):

"container-lowpool-3" daemon prio=2 tid=0x0000000007b0e000 nid=0x4fb0 runnable [0x000000000950e000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(Unknown Source)
 ...
 at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(Unknown)
 at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(Unknown)
 ...
 at org.hibernate.jdbc.BatchingBatcher.doExecuteBatch(Unknown)
 ...
 at org.hibernate.impl.SessionImpl.flush(SessionImpl.java:1216)
 ...
 at com.mycompany.BucketRedistributor$Worker.updateBucketSizeDeltas()
 ...
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 ...
 at java.lang.Thread.run(Unknown Source)

As you can see our worker was updating the bucket sizes which resulted in a Hibernate "flush" (actually pushes the database query across the wire to the database engine), and then we await the response packet from MSSQL once the statement is complete. Note that we are using the jtds MSSQL driver (as evidenced by the net.sourceforge.jtds in the stack trace.

So the next question is why is the database just hanging out doing nothing?

Finding out what the database was doing...

MSSQL provides a lot of simple ways to get insight into what the database is doing. First let's see the state of all of the connections. Open SQL Server Management Studio (SSMS), click New Query, and type exec sp_who2

This will return output that looks like:

You can see all of the spids for sessions to the database. There should be five that we're are interested in: one for the "work queue" query and four for the workers to perform updates. The sp_who2 output includes a blkBy column which shows the spid that is blocking the given spid in the case that the given spid is SUSPENDED.

We can see that spid 56 is the "work queue" SELECT query (highlighted red). Notice that no one is blocking it... then we see spids 53, 54, 60, and 61 (highlighted in yellow) that are all waiting on 56 (or each other). Disregard 58 - its application source is the management studio as you can see.

So how curious! the reader query is blocking all of the update workers and preventing them from pushing their size updates. The reader "work queue" query looks like:

SELECT c.cdId, c.bucketId FROM Buckets b INNER JOIN Cds c ON b.bucketId = c.bucketId WHERE b.collection = 'Steves Collection' and b.generation = 23

Investing blocking and locking problems...

I see that spid 56 is blocking everyone else. So what locks is 56 holding? In a new query window, I ran exec sp_lock 56 and exec sp_lock 53 to see which locks each was holding and who was waiting on what.

You can see that 56 was holding a S (shared) lock on a key resource (key = row lock on an index) of object 1349579846, which corresponds to the Buckets table.

I wanted to the engine's execution plan for the reader "work queue" query. To get this, I executed a query that I created a while ago to dump as many details about current sessions in the system as possible-- think of it as a "super who2":

select es.session_id, es.host_name, es.status as session_status, sr.blocking_session_id as req_blocked_by,
datediff(ss, es.last_request_start_time, getdate()) as last_req_submit_secs,
st.transaction_id as current_xaction_id, 
datediff(ss, dt.transaction_begin_time, getdate()) as xaction_start_secs,
case dt.transaction_type 
 when 1 then 'read_write' 
 when 2 then 'read_only'
 when 3 then 'system'
 when 4 then 'distributed'
 else 'unknown'
end as trx_type,
sr.status as current_req_status, 
sr.wait_type as current_req_wait, 
sr.wait_time as current_req_wait_time, sr.last_wait_type as current_req_last_wait, 
sr.wait_resource as current_req_wait_rsc, 
es.cpu_time as session_cpu, es.reads as session_reads, es.writes as session_writes, 
es.logical_reads as session_logical_reads, es.memory_usage as session_mem_usage,
es.last_request_start_time, es.last_request_end_time, es.transaction_isolation_level,
sc.text as last_cnx_sql, sr.text as current_sql, sr.query_plan as current_plan
from sys.dm_exec_sessions es
left outer join sys.dm_tran_session_transactions st on es.session_id = st.session_id
left outer join sys.dm_tran_active_transactions dt on st.transaction_id = dt.transaction_id
left outer join 
 (select srr.session_id, srr.start_time, srr.status, srr.blocking_session_id, 
 srr.wait_type, srr.wait_time, srr.last_wait_type, srr.wait_resource, stt.text, qp.query_plan
 from sys.dm_exec_requests srr
 cross apply sys.dm_exec_sql_text(srr.sql_handle) as stt
 cross apply sys.dm_exec_query_plan(srr.plan_handle) as qp) as sr on es.session_id = sr.session_id

left outer join 
 (select scc.session_id, sct.text
 from sys.dm_exec_connections scc
 cross apply sys.dm_exec_sql_text(scc.most_recent_sql_handle) as sct) as sc on sc.session_id = es.session_id

where 
es.session_id >= 50

In the above output, the last column is the SQL XML Execution plan. Viewing that for spid 56, I confirmed my suspicion: The plan to serve the "work queue" query was to seek the "music collection" index on the buckets table for 'Steves collection', then seek to the clustered index to confirm 'generation = 23', then seek into the bucketId index on the Cds table. So to serve the WHERE clause in the "work queue" query, the engine had to use both the non-clustered index on Buckets and the clustered index (for the version predicate).

When joining and reading rows at READ COMMITTED isolation level, the engine will acquire locks as it traverses from index to index in order to ensure consistent reads. Thus, to read the value of the generation in the Buckets table, it must acquire a shared lock. And it has!

The problem comes in when the competing sessions that are trying to update the size on that same record of the Bucket table. It needs an X (exclusive) lock on that row (highlighted in red), and eek! it can't get it, because that reader query has a conflicting S lock already granted (highlighted in green).

Ok so that all makes sense, but why is the S lock being held? At READ COMMITTED you usually only hold the locks while the record is being read (there are exceptions and we'll get to that in Part 2). They are released as soon as the value is read. So if you read 10 rows in a single statement execution, the engine will: acquire lock on row 1, read row 1, release lock on row 1, acquire lock on row 2, read row 2, release lock on row 2, acquire lock on row 3, etc. So none of the four workers are currently reading-- they are writing -- or at least they're trying to if that pesky reader connection wasn't blocking them.

To find this, I was curious why the reader query was in a SUSPENDED state (see original sp_who2 output above). In the above "super who2" output, the current_req_wait value for the "work queue" read query is ASYNC_Network_IO.

ASYNC_Network_IO wait and how databases return results

ASYNC_Network_IO is an interesting wait. Let's discuss how remote applications execute and consume SELECT queries from databases.

That diagram is over-simplified, but within the database there are two chunks of memory to discuss: the buffer cache and the connection network buffer. The buffer cache is a shared chunk of memory, where the actual pages of the tables and indexes are kept to serve queries. So parts of the Buckets and Cds tables will be in memory while this "work queue" query executes. The execution engine executes the plan, it works out of the buffer cache, acquiring locks, and producing output records to send to the client. As it prepares these output records, it puts them in a connection-specific network buffer. When the application reads records from the result set, its actually being served from the network buffer in the database. The application driver typically has its own buffer as well.

When you just execute a simple SQL SELECT query and don't explicitly declare a database cursor, MSSQL gives you what it calls the "default result set" -- which is still a cursor of sorts -- you can think of it as a cursor over a bunch of records that you can only iterate over once in the forward direction. As your application threads iterate over the result set, the driver requests more chunks of rows from the database on your behalf, which in turn depletes the network buffer.

However, with very large result sets, the entire results cannot fit in the connection's network buffer. If the application doesn't read them fast enough, then eventually the network buffer fills up, and the execution engine must stop producing new result records to send to the client application. When this happens the spid must be suspended, and it is suspended with the wait event ASYNC_Network_IO. It's a slightly misleading wait name, because it makes you think there might be a network performance problem, but its more often an application design or performance problem. Note that when the spid is suspended -- just like any other suspension -- the currently held locks will remain held until the spid is resumed.

In our case, we know that we have millions of CDs and we can't fit them all in application memory at one time. We, by design, want to take advantage of the fact that we can stream results from the database and work on them in chunks. Unfortunately, if we happen to be holding a conflicting lock (S lock on Bucket record) when the reader query is suspended, then we create a multi-layer application deadlock, as we observed, and the whole system screeches to a halt.

So what to do for a solution? I will discuss some options and our eventual decision in Parts 2 and 3. Note that I gave one hint at our first attempt when I talked about "covering indexes", and then there is another hint above that we didn't get to in this post about "lock escalation".

Steve

MSSQL non-clustered indexes INCLUDE feature explained

2011-12-22T12:17:00.002-06:00

Today I received a question from someone about the nature of the INCLUDE feature when creating a non-clustered (secondary) index on a table. My response was a bit long, and I haven't posted in a while -- ergo blogpost! Here's the question:

What is your opinion about using the include statement when building your indexes? I’ve never really used the included functionality and I’m curious if there are upsides or downsides to them. My reading lead me to believe that I should use the include when I have a non clustered index and space appears to be a concern. I can use the ‘include’ as part of the index with the columns that may not be used as frequently. We’re using standard datatypes and nothing with very large column widths. So is there a benefit to using include?

A clustered index is stored in a B+-tree data structure and the actual data the whole row of data is in the leaf. So if you think of a b-tree structure (simplified) as something like:

           A
         /   \
        B     C
       / \   / \
      M   N O   P

And you have records like

[ Id =1, FirstName = Steve, LastName = Ash ]
[ Id = 2, FirstName = Neil, LastName=Gafter ]

Then for the clustered index, the id data (id=1 and id=2) will exist at every node (A, B, C, M, N, O, P), but the bytes for “steve” “ash” “neil” “gafter” will only exist in leaf nodes (either M N O or P). The id is used to locate which leaf holds the whole record. (note this is a simplification, see Wikipedia for more info about b+-trees). The noteworthy fact is being a clustered index means that the whole record is in a leaf (i.e. M N O P).

Now let’s think of a non-clustered index on lastName that corresponds to the clustered index above.

In this case the “last name” is used to get to the leaf, and the leaf holds the primary key of the corresponding row in the clustered index. So D E or F will have things like lastName=Ash and lastName=Gafter and the leaves Q and R will have both lastnames and IDs. So an entry in Q or R might look like [lastname=Ash, id=1] (again simplifying).

So if you issue a query like

SELECT firstName FROM ThisTable WHERE lastName = ‘Ash’

(and the optimizer chooses to use the non-clustered index) then the database engine will do something like:

Seek the non-clustered index for ‘Ash’

Look in D to decide which direction to go E or F (lets say E is the right choice)
Look in E to decide which direction to go Q or R (lets say Q is the right choice)

Find the primary key for ‘Ash’

In Q find id for Ash – which is id=1

Seek the clustered index for id = 1

Look in A to decide which direction to go B or C (lets say B is right)
Look in B to decide which direction to go M or N (lets say N is right)

Find the value of firstName for id=1

In N find firstName for id=1 which is ‘Steve’

return ‘Steve’ as the query result

Notice that we created a non-clustered indexed on lastName and we can use that index to quickly locate things by last name, but if we need any additional info, then we have to go back to the clustered index to get the other info in the SELECT list.

The “include” provides a way for you to shove additional information in the leaf nodes of non-clustered indexes (Q and R) to alleviate this "going back" to the clustered index.

So had I created the non-clustered index above on LastName INCLUDE FirstName then the engine would only need to do:

Seek the non-clustered index for ‘Ash’

Look in D to decide which direction to go E or F (lets say E is the right choice)
Look in E to decide which direction to go Q or R (lets say Q is the right choice)

Find the firstName for ‘Ash’

In Q due to the include there is id=1 AND firstName=’Steve’ so we have the first name right here!

return Steve

So you get rid of that entire other seek into the clustered index. This additional seek shows up as a “bookmark lookup” operation in the query plan in MSSQL 2008 & MSSQL 2000 and just another join in MSSQL 2005 query plans. Bookmark lookup is a join -- just with a different name to indicate its semantic role in the query plan.

When you have an index such that they query can be completely served from the index without needing to go back to the clustered index – such an index is called a covering index. The index covers the needs of the query completely. And it’s a performance boost as you don’t need the other join.

So this means a few things:

You can obviously only “include” fields in non-clustered indexes. Clustered indexes already have all the fields in the leaf...so it doesn’t mean anything to include more.
You can only have one clustered-index for a table, but you can simulate have multiple clustered indexes by INCLUDing the rest of the columns on your secondary indexes
By duplicating the data in the secondary index, if you UPDATE firstName – you now have to update both the clustered index and the nonclustered index. (Main trade-off consideration)
- This is also a huge deadlock opportunity if you’re not using read committed snapshot isolation (RCSI) level. Think of two queries: (1)
```
UPDATE MyTable SET firstName = ‘Steve2’ where id = 1
```
  and (2)
```
SELECT shoeSize FROM MyTable WHERE lastName = ‘Ash’
```
  (pretend shoe size is a new field that is in the clustered index but NOT in the non-clustered, i.e. NOT in the INCLUDE). Then the SELECT will seek the non-clustered index, grab shared (S) locks, then (while holding S locks) traverse the clustered index. Whereas (in the opposite order) the UPDATE will seek the clustered index, hold an exclusive (X) lock, then seek the non-clustered index to update firstName. The fact that these two queries are holding incompatible locks in opposite directions, is a deadlock waiting to happen. If I had a dollar for every time I diagnosed this deadlock scenario...
By duplicating the data in the secondary index, each page in the leaf (Q and R) now has fewer rows per page and thus there is a greater memory demand on the buffer cache (and more IO to get the same number of records) (Second trade off consideration)

Some databases (even MSSQL < 2005) don’t support this feature, but you can approximate it by creating non-clustered indexes with compound keys. I.e. if I created the non-clustered index on
```
(lastName, firstName)
```
then the index still covers queries like
```
SELECT firstName FROM myTable WHERE lastName = ‘Ash’
```
Note that this is not as good as the INCLUDE solution (for this particular query) as now the bytes of firstName=’Steve’ take up some space in non-leaf nodes D E F.

So deciding to use an INCLUDE is (like everything) a trade off. If you have a performance critical query that is being executed frequently, then you can usually use INCLUDEs to reduce the number of joins and increase performance. In an environment where CPU is more precious than memory this can be a big win (we can talk about cache locality benefits of includes later). However, if you INCLUDE a column that will be UPDATEd later – then you often are shooting yourself in the foot as the cost can easily outweigh the benefit.

Last tidbit about INCLUDEs that I’ll mention. The sql index analyzer is REALLY aggressive about recommending you add indexes with lots of INCLUDEs. This is because the index analyzer usually doesn’t know how many UPDATEs you’re doing. Usually you just tell it what SELECT you want to speed up and it naively says “oh well of course if you add these three covering indexes, this SELECT will be faster.” And while that’s true it doesn’t take into account the _total_ workload (UPDATEs DELETEs etc) so just be skeptical when you see this if you use the index analyzer.

I can’t give you a hard and fast rule to say INCLUDE is GOOD or EVIL as – like everything with database performance tuning – it depends ;)

Steve

Delegates or interfaces? Functional and OO Dualism

2011-07-31T20:14:00.001-05:00

I have a mixed background: doing C#/.NET for ~4 years then switching to Java (switched jobs). I have been in the Java enterprise ecosystem for the last 4 years. I do mostly Java, but enjoy doing a little C# every now and again. C# is really a nice language. Shame its in such a horrible Microsoft-centric ecosystem.

In any case, I've been writing a little thing in C# and needed a type with a single method to "doWork". So coming from a Java bias I created a:

public interface IWorker {
   void DoWork();
}

Later, however, I wanted to offer the users an API option of just using a lambda to "doWork". Unfortunately, there is no type conversion from a delegate to the matching "single method interface" (at least that I could find, if someone knows the answer, please share!). So as a shim, I created:

public delegate void WorkerDelegate();

public WorkerWrapper : IWorker {
   readonly WorkerDelegate workerDelegate;
   public WorkerWrapper(WorkerDelegate workerDelegate) {
      this.workerDelegate = workerDelegate;
   }

   public void DoWork() {
      workerDelegate();
   }
}

So I wrap the lambda in a little wrapper and don't need to change my entire IWorker-based infrastructure. It works, but I'm not happy with this. I know that in the Java lambda mailing list, they are planning to include a "lambda conversion" so that lambdas can be converted to compatible single abstract method (SAM) types. This would've alleviated my need for the shim above as I could've assigned the lambda directly to an IWorker and all would've been well.

I believe the "C# way" would've been for me to use delegates all the way through to begin with. Had someone come along with an IWorker interface then that would've been assignable to my WorkerDelegate.

But is this the right answer? Conceptually, how should I think of these? What is an IWorker in the above case? Is it really just a chunk of code that should be passed around as such? Or is it a member in the space of collaborating types that make up my system...

This is an example of the conceptual problems reconciling a Functional view of the world with an object oriented view of the world (dualism). I know that many people smarter than me have thought about these problems, and I'm hoping that I can find some good articles discussing them.

It feels like we're describing the same concept: a "chunk of code" that is defined by an interface (call it a delegate or a SAM, same thing). I don't think that I would have any dissonance if both were assignable to each other, and thus can be treated as different expressions of the same concept. If this were the case, then maybe I would view delegates as just SAM types -- so my "world view" is still object oriented, I just have an additional, concise lambda syntax to create SAMs. Actually, if this were the case, then you could probably invoke the Liskov substitution principle and call functional-OO dualism reconciled...

But something still seems amiss. There is more to the identity of an IWorker than the fact that it takes no arguments and returns nothing. I suppose the same questions are true of reconciling structural typing to static typing. Hmm.. I have a lot of reading to do.

I imagine there are a number of philosophical problems between functional and OO. This is just the one I ran across and felt dissonance with C#s implementation. Maybe they truly are different things and should be treated as such. I hope (despite my extremely small readership) to get some links to articles on this topic.

Steve

I <3 Robert C. Martin

2011-07-09T13:59:00.001-05:00

Im reading Clean Code by Uncle Bob. I have read sections of this book before when it came out and actually had the pleasure of watching Robert Martin present it at SDWest in 2008. I've decided to read the whole thing.

A while ago, I read a blog post from someone who was arguing that software wasn't a craft but a trade. I believe the authors intention was to say that we software developers should recognize that the value of the software is the business value and thus we shouldn't wax philosophic about "elegance in design" or software aesthetics as that was all wasting time trying to get to the goal. I may be misrepresenting the author's intention. I couldn't find the post to link it.

In any case, I disagreed entirely with this opinion. While I agree that business value is the motivator-- the craft aspects such as aesthetics, conceptual purity, elegance, etc. All contribute to the solution and its extensibility and maintainability. Maybe we're just arguing over the definition of craft, trade, or art, but in any case I feel there is value in recognizing the challenge of good engineering for today and tomorrow. The masters do it almost effortlessly-- almost accidentally. That feels like art to me and thus should be labelled appropriately as craft.

To this point, clean code is more art than science and Mr. Martin has something to say about it that I really enjoyed:

Every system is built from a domain specific language designed by the programmers to describe their system. Functions are the verbs, classes are the nouns. This is not some throwback to the hideous old notion that the nouns and verbs in a requirements document are the first guess of the classes and functions of a system. Rather, this is a much older truth. The art of programming is, and always has been, the art of language design.

Master programmers think of systems as stories to be told rather than programs to be written. They use facilities of their chosen programming language to construct a much richer and more expressive language that can be used to tell that story. Part of that domain-specific language is the hierarchy of functions that describe all the actions that take place within that system. In an artful act of recursion those actions are written to use the very domain specific language they define to tell their own small part of the story.

So to argue that software is not art is to naively ignore the reality that language is hard and has a dramatic effect on the bottom line of your code base. How many software systems never change or never need to be understood after they are written? Such systems must not be very interesting or do anything important.

Let's recognize the art of good software engineering! It will motivate us to continue to improve if we recognize these things have a value.

Maven -_-

2011-06-27T22:54:00.001-05:00

I have spent the last few days mucking about with POM files. Anyone that has done this understands where I'm going with this. So I'll just leave these two quotes that fit nicely:

But there has to be something fundamentally wrong with any tool that, whenever I use it, seems to have at least a 50% chance of completely fucking up my day.

-Charles Miller

The people who love Maven love the theory. The people who hate Maven hate the reality.

-Zutubi

Frustration.

EDIT: For anyone running across this. Now that I am well over the "learning curve" hump. I <3 maven. Seriously. Yes there are some rough edges-- I really should be able to delete folders simply without having to do ant-runs, etc. But its benefit drastically outweighs its cost in our environment.

OO Reading List

2011-04-13T10:18:00.002-05:00

One of my favorite parts of Pragmatic Thinking is the description of the Dreyfuss model of skills acquisition. This describes the phenomena of how people are frequently distributed along some non-trivial "skill" (like programming), and defines metrics about what differentiates each skill level. Overall, as one moves up to higher skill levels, they are increasing their intuition about the problem space. Intuition is something for which we must train our brain through experience and knowledge. To that end, there are a few books that have helped me in increasing my intuition, which I would like to catalog. If you have others that are missing, please leave a comment! My appetite for the amazon marketplace is insatiable ;-)

OO and Design Patterns

But Uncle Bob Essay on SOLID - This is Robert C Martin's article outlining the principles of Object Oriented Design. If SOLID doesn't ring a bell, then start with this article. Note that the now-famous acronym: SOLID is not actually mentioned in this post, but Robert C Martin is still credited as the inventor.
Agile Software Development Principles, Patterns, and Practices - lovingly referred to as the PPP book (not to be confused with the protocol). This describes much of the why of OOD, and explains the Agile mindset.
Object Design - another often referenced book describing the why and various design decisions that go into object oriented design. Thinking about objects isn't hard-- category theory and abstraction is something that our brain does quite naturally (hence the appeal of this design methodology). However, the ergonomics of OOD can sometimes lead to an undeserved sense of self-confidence in one's design. It may feel like OOD, because "oh look-- there are objects there! inheritance! oh, my!", but within the context of software engineering there are many factors that make some designs good and some bad.
Design Patterns: Elements of Reusable Object Oriented Software - Canonical book by the "gang of four". Has good description of patterns and why they are useful. If you prefer something with prettier pictures, starting with the Head First Design Patterns book is nice too
Essays on OO Software Engineering - Maybe not terribly well-known, but well written theoretical overview of OO, the design forces in OOD, and the motivations behind them. I had the pleasure to work with Ed.
Patterns of Enterprise Application Architecture - Canonical book on patterns by the man himself
Refactoring - Another Fowler book. I haven't read this one cover to cover, but read many chapters. Good examples to get your head around particular refactoring patterns
Domain Driven Design - Eric Evan's now-famous book on how to turn business problems into rich object models. Great read.
Real World Java EE Patterns - while this is a "java" book, the principles and design trade-off analysis that Adam Bien does for each pattern is universal. Good read.

Code

Beautiful Code - Essays where programmers reflect on what is beauty in code. As the authors wax philosophic about their "beautiful" examples, you glean insight into their thought process. It's a nice read.
Implementation Patterns - Kent Beck's book about the low-level decisions we make as we actually type the code and create APIs (knowingly or unknowingly). One of my top recommendations. I really love this book.
Clean Code - guide to code readability, writing code at appropriate abstractions, etc. A nice companion to Implementation Patterns. I had the pleasure of meeting Robert C Martin at a conference once. I'm sure I acted appropriately star-struck.

Other "meta" and philosophic musings

Notes on the Synthesis of Form - this is not a computer science book, but deals with the theory of what is design, and describes a methodology for decomposition. It's a quick read, and at least the first half is worth the exploration just to get some ideas in your head about methodologies for design.
Pragmatic Programmer - what list would be complete without this one! Good overview of the pragmatic aspects of becoming a good programmer. From problem solving skills to tools, this is required reading. I think if you read this, then skip or at most skim Productive Programmer
Pragmatic Thinking: Refactoring your wetware - I mentioned this at the front of the post. I enjoyed most of this book-- in particular the front half. Some of the topics towards the end were less interesting, because they were more familiar. In any case at least the first three chapters are a great read.
The Little Schemer - you can run through this in an afternoon. This is an introduction to recursion, via Scheme/LISP. This book is fun, because its a simple and unique format, but takes you through a journey that helps shape your mind to "think" of solutions recursively.
Beautiful Architecture - This one wasn't as successful to me as Beautiful Code, but still worth a read. There are a few chapters you can skip, but a few (especially the first chapter) that are great.
Beautiful Data - I'm only about half way through this one, and need to spend more time with it, as I'm sure there are some gems in there I have yet to run across.

Reading TO-DOs

These are books that are on my shelf to read next (that are relevant to this list at least)

The Design of Design - Fred Brooks (of Mythical Man Month fame) waxes about design philosophies and goes through a few case studies. Im really excited to make time for this.
Masterminds of Programming - interviews with the creators of many of the major languages used over the last 30 years. Looks promising.
Thoughtworks Anthology - I am going to be selective in which essays I read. I will start with those that are most applicable to my world, and then move out to the more "exotic" (relatively). I know that there are essays in here that will be more useful to my current world than others.

I feel like I'm missing some...

Steve

Spring 2011 Reading List

2011-03-06T14:15:00.003-06:00

I gave last Spring's reading list (which was more what I had read in the previous year) in this post. This spring, I am going to write what I am currently reading or hope to finish this Spring.

Seven Languages in Seven Weeks - this book looks interesting, and certainly the challenge of trying to meaningfully cover seven languages and language philosophies in a short book will be interesting if nothing else.
Programming in Scala - I have "played" with Scala, but not done more than toy programs. However, from everything I know about it- I believe I will really enjoy it. I buy into a lot of the functional vs object oriented dualism, and certainly enjoy the terse syntax. I think there are still interesting questions with regards to software development economics (i.e. how do we integrate Scala in a team of mixed skill levels, cost trade offs, etc.)
Thinking Forth - I'm not particularly interested in Forth, but I am interested in language design and problem solving philosophies.
Domain Driven Design - this was recommended by a commenter in the previous post, and I picked it up. I've made it through quite a bit of the book, and have enjoyed it so far. I had always heard this book referenced by Fowler et al, and really I should've read it years ago...
Introduction to Reliable Distributed Programming - its a Springer book, 'nuff said. I've been through a few distributed computing books, and have a fairly broad knowledge of the space. I am hoping to get a more mature, theoretically rigorous view of the problems now.
The Art of Multiprocessor Programming - this book is great. I enjoyed the nice mix of intuitive description and theoretical rigor. It's a nice blend of theory and pragmatics. I really recommend reading this for anyone that wants in depth knowledge of parallel computing.
Introductions to Neural Networks for Java - this was an interesting book to skim I don't really want to recommend it, because most of it is explaining his neural network code base instead of the underlying concepts. In any case, its a nice thing to skim over an afternoon.
Neural Networks and Learning Machines - Great textbook for all the theoretical background and mathematical proofs regarding Neural Nets and machine learning. I've had to crack open my calc books to freshen up while going through this...its dense.
Fuzzy Models and Genetic Algorithms for Data Mining and Exploration - I can't recommend this book (see my Amazon review if you're curious why), but it does provide a decent vocab overview for the topics it covers.

Well that's it for the moment! This spring is a bit more theory heavy than the previous list... that probably reflects my work and school change, which has been more research oriented in the last year. I'm not moving away from the real world just trying to get a more comprehensive grasp on both worlds in the areas which I am interested. I've always thought this was one of my strengths-- knowing enough theory to shape my thinking skills and be knowledgeable-- but continuously, deliberately enriching my implementation and practical skills to actually put the theory to good use! The intersection of the two is the more rewarding area for me, I think.

Steve

Building Derby with Eclipse gotcha

2011-03-05T13:46:00.003-06:00

This is partially a reference for myself and for anyone else trying to build Derby 10 from source using Eclipse. My thesis project involves modifying the on-disk page layout for a database (long story for another post). I am currently using Eclipse for my IDE. I just wanted to document a quick pain point in case anyone else is Googling for the answer:

I wanted to do a clean build of all of Derby. So I set up an ANT External Tool launch configuration:

Launch the build.xml from the root Derby directory
Working directory is the root Derby directory
Refresh resources upon completion
Uncheck the "build before launch
Choose the clobber, all targets (clobber is a total clean)
Defaults for everything else

I started to build and failed with the follow errors (I'm omitting some of the ones in the middle):

[javac] C:\DerbyDev\java\build\org\apache\derbyBuild\javadoc\DiskLayoutTaglet.java:24: package com.sun.tools.doclets does not exist
[javac] import com.sun.tools.doclets.Taglet;
[javac] ^
[javac] C:\DerbyDev\java\build\org\apache\derbyBuild\javadoc\DiskLayoutTaglet.java:25: package com.sun.javadoc does not exist
[javac] import com.sun.javadoc.*;
[javac] ^

...

[javac] C:\DerbyDev\java\build\org\apache\derbyBuild\javadoc\UpgradeTaglet.java:102: cannot find symbol
[javac] symbol : class Taglet
[javac] location: class org.apache.derbyBuild.javadoc.UpgradeTaglet
[javac] Taglet t = (Taglet) tagletMap.get(tag.getName());
[javac] ^
[javac] 40 errors

So as you can see from the first two errors -- its missing the dependency for Javadoc things. As it turns out Derby defines its own Taglet for Javadoc processing. All of these classes are defined in the JDKs tools.jar file, which is in the JDK's/lib directory (not included in the JRE).

Derby's build script includes tools.jar by:
<pathelement path="${java15compile.classpath};${java.home}/../lib/tools.jar"/>

Well I went to my command prompt and looked at my environment variable and java_home was set to the JDKs location. So this should've worked. I added a new launch configuration to execute the showenv target, and added an echo statement to include java.home.

I ran this and low and behold it was pointing to the JRE path, which doesn't have a tools.jar. So the world made sense now. In my eclipse installation, I have both the JDK and JRE installations defined in my preferences. For some reason (subconscious desire to shoot myself in the foot?), the JRE was chosen for this workspace, and thus the launch configuration used that.

So to resolve the problem, you can change the external tools launch configuration for the ANT script. Go to the JRE tab and select separate JRE and choose the JDK. Or change the default JRE for the workspace, and leave the launch configuration set to run in the same JRE as the workspace.

This isn't a tricky problem, but can be a little misleading...plus I haven't posted in a while ;-)