11/11/2013: 12:12 am: Reviews

Nothing to Envy: Ordinary Lives in North Korea is a fascinating book on a topic that I previously knew almost nothing about. Barbara Demick very much deserved the National Book Award Finalist award.

I first found out about this book when I went to the Goodreads office in San Francisco for a talk on their recommendation engine. On top of a great explanation of recommendation engines and some cool anecdotes about how they tuned their recommender algorithms, I walked away with a recommendation for this great book. The engineer giving the presentation said it was his goal to get as many people as possible to read Nothing to Envy.

Demick’s early reporting on North Korea was thwarted when she realized that she wouldn’t be able to have any meaningful conversations with citizens, due to the oversight of her North Korean minders. So, she painstakingly interviewed North Korean defectors and fact-checked their stories. Demick did an amazing job of turning those interviews into riveting stories of the soul-crushing poverty, totalitarianism and terror that still reigns in North Korea. The North Korean famine in the mid 90′s very likely resulted in more deaths than the Great Famine in Ireland.

On a related note, one of my cousins very recently traveled to South Korea and was able to go to the Joint Security Area. She wrote that, “to visit that area, we had to board a special military bus, ride on roads surrounded by mine fields and anti-tank obstacles rigged with C-4 explosives, wear visitor ID badges, and line up single file when walking about.” While standing near armed South Korea guards, a North Korean soldier only a few hundred feet away stared at them through binoculars. Another North Korean peaked through a window and photographed them. Given the sad stories in Nothing to Envy of poverty, even amongst the North Koreans, I wouldn’t be surprised if it were a film camera for which film has long since not been available.

8/23/2013: 9:10 pm: Conference, MongoDB

Yesterday I presented at NoSQL Now! 2013 in San Jose, CA, on how Castlight Health uses MongoDB to support low latency geospatial searches on huge amounts of health care pricing data. Given the number of parallel tracks and the total number of people attending sessions, I was pleased to get about 50 engaged people in my session.

When I presented a very similar talk at MongoDB Days SF in May, we had about 600 million prices. In August we went over 1 billion. That’s a lot of JSON documents.

The main update in this version of the presentation is that I described my process of switching from the haystack index to the 2DSphere index. The thing that has changed is our data. It is now more common for us to have very large collections of prices in which the majority of rates in an area are for provider networks that are not in-network for the user. The 2DSphere index allows me to add the provider network id as a third part of the composite index. The performance I’m now seeing is anywhere from 10% slower to 5 times faster. My admittedly statistically insignificant testing so far suggests I’m going to get an average speedup of about 50%.

5/12/2013: 9:21 pm: Conference, MongoDB

My presentation at MongoDB Days SF on Friday went very well, though I got to the last slide of my 40-minute talk in just 30 minutes. Hopefully, I didn’t speak too fast for a lot of people. Fortunately, I remembered at the end that I had also wanted to talk about how I saved space by omitting keys with a very common default value. While that required adding a very small amount of app logic to fill in default values for missing keys, it enabled me to save quite a few bytes per document. Every byte counts when you are storing 600+ million documents.

The extra time at the end of my talk was also nicely filled in by a lot of good questions. I ended up having to rush off the stage for the next talk.

It was also cool that I was speaking to about 100 or so people in the Gold Room at the beautiful Palace Hotel. The Castlight Xmas party was in the same room two years ago.

As always, 10gen put on another excellent show. Meghan and her team are awesome. I’ve attended several of these events and they get better each time.

Here are the slides I presented:

3/29/2013: 9:32 pm: Conference, MongoDB

MongoDB_Days_2013_Speaker_Badge_200x200 2 I’m very excited that my talk proposal “Geo Searches for Healthcare Pricing Data” was accepted for MongoDB SF 2013. I’ll be speaking at 11:55 am on May 10th in the Palace Hotel Gold Ballroom about why I chose to deploy MongoDB at Castlight Health to serve large amounts of healthcare pricing data with very low latency. I’ll go into as much detail as I can in forty minutes as to how we use it, how we deployed it and the excellent results we’ve gotten.

I previously spoke about logging to MongoDB at Mongo SV a few years ago in Mountain View and have attended several other MongoDB conferences and events. The 10Gen team has built a fantastic product and has an amazing events team. Their conferences are very well run, economical, educational and I highly recommend attending them.

3/21/2013: 10:32 pm: Reviews

After finishing Facebook COO Sheryl Sandberg’s new book Lean In, I was surprised by the disconnect with some of the early criticism I had read. I actually thought Sandberg wrote a lot about her personal life and greatly appreciated her experience and co-workers at Google. She also clearly admitted she is in a place of privilege. I don’t understand why people think that invalidates her from advising women about what is possible in their career. Obviously, not every woman will achieve what she has achieved. But neither will every man. I have to wonder if those reviewers just read other reviews by reviewers who only lightly skimmed the book.

Lean In also has a lot to offer to male readers. In addition to advice to men in senior positions on how to ensure women and men have equal opportunities, there’s just a lot of great career advice that applies to most everyone.

I had initially wondered about the amazing detail of the footnotes and how she had time to do that much research. Then I discovered, as I should have expected, that she had a lead researcher. However, Sandberg did provide very detailed acknowledgements giving credit where credit was due. And I think her presentations and Q&A sessions show that she’s not just putting her name on someone else’s work.

4/7/2012: 9:53 pm: Java

At work we have a Java-based service that caches a very large amount of data. I spend a lot of time optimizing performance and memory usage for this service. The amount of memory it uses at runtime to cache a sufficient amount of data for performance reasons is now reaching the the 32 GB boundary.

One downside of using a 64-bit JVM is that the object pointers used by the JVM to reference objects would normally need to increase from 4 bytes (32 bits) to 8 bytes (64 bits). But, if the heap is under 32 GB, the JVM can take a shortcut because it knows that the offset will always fit in 4 bytes.

The JVM argument -XX:+UseCompressedOops was added to force the JVM to use compressed ordinary object pointers whenever possible. In Java 6 Update 23, the JVM was updated to enable Compressed Oops by default. Compressed Oops works very well for heaps up to about 26 GBs, but can still be advantageous for larger heaps.

However, I found that with Java 6 Update 24, Compressed Oops are not used at times when they could be, even if you specify -XX:+UseCompressedOops.

Specifically, I was using a 31.7 GB heap and my service was unexpectedly running out of memory. On one of our other systems, the heap for the same service reached only 26 GB after its large internal cache was fully loaded. After a lot of investigation that weekend, I discovered that the working system was on Java 6 Update 26. I downgraded it to Update 24 and easily reproduced the problem.

I had previously done a lot of testing to see what our penalty was for crossing the 32 GB heap boundary and found that it was about 8 GB. That’s actually not too bad, as general estimates I had heard from others ranged from 30-50%. This is probably due to the fact the objects we are caching mostly have only primitive data types as fields.

I highly recommend the Everything I Ever Learned About JVM Performance Tuning @Twitter slides and presentation for more info on JVM tuning. Or read Andrew’s summary of Attila’s talk.

By reducing the cache size on the production server enough to not blow out the heap, I confirmed that the heap size when using 1.6.0.24 was about 8 GB higher than when using 1.6.0.26. JVM, meet smoking gun.

The release notes for 1.6.0.26 references a compressedoops related bug fix that may have resulted in this feature now working as described. I read through all of the many bugs fixed in that release and it seems to be the most likely candidate.

3/6/2012: 11:20 pm: MySQL

At work today I ran into a new reason for not keeping open MySQL connections for a long time. It involves dynamic session variables like long_query_time.

I wanted to capture a couple of hours worth of all queries in the slow query log so I could analyze them with pt-query-digest from the excellent Percona Toolkit. So, I used

set global long_query_time=0;

After a couple of hours I had a lot of data. I discovered later that I was actually missing some of the early queries.

Then I set long_query_time back to 1 so only queries longer than 1 second would be logged. To my initial amazement, lots of very short queries continued to be logged to the slow query log.

A little research turned up the fact that the long_query_time session variable for a connection is initialized from the global variable only when a connection is opened. So, any connections that were open when I set long_query_time to 0 continued on as if it were still set to 1. Therefore I missed capturing those queries.

Worse, though, is that some of our code uses connection pools that can keep alive connections for a few hours. Short queries on those connections continued to be written to the slow query log after I set long_query_time = 1. Fortunately, I was tailing the slow query log, so I noticed this before it got too big. At this point, you can temporarily disable slow query logging, let it go and hope those connections don’t last too much longer, or go all club and hammer on the connections and kill them and hope your code handles it gracefully. The last one generally isn’t so great of an idea for a production app.

As I mentioned above, my slow query log capture didn’t include sub-second queries from connections that were open when I set long_query_time to 0. That would obviously affect the results of my analysis by leaving out the queries from long-lived connections.

So, if you’re doing something like this, you should capture data for a long enough time to offset this factor or at least throw out the data from the earlier parts of the log. You should also check out the open connections with show processlist; before changing long_query_time.

1/30/2012: 12:07 am: Food and Drink, The Unusual and the Weird

Mixology cover Whenever I’m looking for a drink recipe, I first seek out my trusty Mixology pamphlet, courtesy of the Southern Comfort Corporation, ca. 1974. What better source could there be for cocktail recipes than a pamphlet that mixes astrology with photos of swinging dudes in gaudy polyester leisure suits accompanied by the smiling Stepford wives. Sure, this guy is in a pretty reasonable looking sweater, but steel yourself now for what is to come.

Back in the 70′s, subliminal messaging was a controversial topic. I remember running across a book from that same year called Subliminal Persuasion that had a lot of images from advertising and movies with supposedly embedded suggestive words and images, primarily of a sexual nature. Check out the Subliminal Manipulation blog if you don’t sex believe me.

Take a closer look at the hair of the guy on the Mixology cover. Now, think about other parts of a man’s body. Good luck getting this image out of your head (no pun intended) anytime soon. I’m really, really sorry.

Next up we’ve got a dude confidently sporting a pink suit. The previous sentence is the only known sentence on the internet including the words dude and pink suit, but not the word pimp. Ignore the faint yellow polka-dots. I double-checked the pamphlet and they must be a scanning artifact. I was so hoping they weren’t, though. I think my scanner understandably puked on the image.

Almost everyone knows his Zodiac sign today. But few have any real knowledge of astrology.
Intent of astrology data herein is simply to inform, not to advise. Therefore any personal application is the individual’s responsibility.

Check out Mr. quilted pants on the left. What grandmother wouldn’t want to see her granddaughter coming home with a nice boy wearing a handmade quilt? OK, besides any grandmother with something against hobos. Those pants are so appalling that I almost didn’t notice the crazy blue plaid suit in the back. He’s channeling Rodney Dangerfield from Caddyshack, but coming up well short. Powder blue cardigan boy looks positively normal in this photo.

If your party kilt is at the drycleaner, a full tartan suit is always a great substitute. It’s a little hard to see, but, yes, those are matching pants. Fortunately, I don’t think it’s the royal Stewart tartan. Too bad his promiscuous plaid partner up front isn’t in a matching tartan. I can’t identify the tartans for certain due to the cumulative retinal scarring, but I’m suspecting they’re variants of the Montgomery Ward tartan.

If you’re daring and desperate for the full pamphlet in a high enough resolution to actually read it, download the 2 MB zipfile.

10/22/2011: 12:23 pm: The Unusual and the Weird

The competition for toilet supremacy is heating up. The NY Times has a great review of Kohler’s Numi, which opens up like a Transformer to accept your tributes. Someone should hack the opening chime to play a recording of Optimus Prime saying ”No sacrifice is too great in the service of freedom.” And I would love to see it do battle with Toto’s Megatron, I mean Neorest.

Kohler Numi transforming

When I first glanced at the image of the remote control, I thought the bottom left button said “Lasers”. Now, that would be freaking awesome. Whether as a laser light show to accompany the event or as a modern alternative to the incineration of your contributions, I’m all for it. And surely a couple frickin’ lasers would come in handy when kicking some Neorest butt.

Kohler Numi remote control

6/3/2011: 5:00 pm: Java, MySQL

When you use the MySQL JDBC driver to select rows from a table, the connection will block until the entire ResultSet has been pulled over to the client. In most cases this makes sense, especially if the server is on a different host. Retrieving the entire ResultSet will minimize the number of TCP packets that must be sent from the server.

However, if you are returning a very large ResultSet, the client will have to allocate a lot of memory on the heap. If you end up accessing each row to create an object from the data, then you will need enough heap space for the entire ResultSet plus all of the objects you instantiate.

The driver documentation explains how to force the driver to stream the ResultSet row-by-row.

The first catch is that you must be using a regular Statement object, not a PreparedStatement.

The documentation says you need to add the following non-intuitive code before executing the query:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

though you can actually just use conn.createStatement() since TYPE_FORWARD_ONLY and CONCUR_READ_ONLY are the defaults.

There are a couple caveats in the documentation, though they are fairly obvious. You should process the ResultSet as quickly as possible, since locks will be held as long as the statement (and any transaction it is in) is open.

In addition to being non-intuitive, setting the fetch size to Integer.MIN_VALUE might cause unexpected results if you run your code against a database server other than MySQL.

If you’re willing to go all out in committing to MySQL, you can cast the return value of createStatement() to com.mysql.jdbc.Statement.StatementImpl and then call enableStreamingResults(). That will, at least, make the behavior of your code more obvious.

At work I needed to cache a lot of data from a couple of tables. Using the default behavior caused the heap to grow to over 12.5 GB. That made for trouble when running on my 8 GB laptop. By switching to streaming the ResultSet, the heap maxed out at only 5 GB.


Fork me on GitHub