Thursday, October 20, 2011

Quick and easy way to split lots of strings in Java

Recently, I had to split strings in a method which would be called potentially couple of billion times. The strings were delimited by only one character e.g. '#'. To be as efficient as possible, I just used String.indexOf(delimiter) and then String.substring(i, j) to get the splits. For this case, I think this certainly is the most efficient solution because the String.substring() method does not allocate new memory, it just returns a new String object which has some meta information to get the string from the original string's buffer (this is a nice optimization but it has its own pitfall). I was happy with it but the code looked messy with all those String.indexOf() calls and lots of index maintenance. Upon getting a review from a colleague I looked for ways to make it look better and easier to maintain. I found couple of alternatives to do the this:

1. StringTokenizer
2. Scanner
3. String.split

StringTokenizer had to be discarded immediately as it is getting old and is only there for backward compatibility. The Javadoc itself says, "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code." The Scanner is a great utility but seems to be bit heavy for that method. I had to keep in mind that the method would be called potentially billions of times in a row and it has to be as quick as possible. This left me with only the last option. String.split works in a simple way. It returns an array of strings. So, I could just use something like this:

String splits [] = str.split("#");
for (String split : splits) {
    // Do something ...
}

However, as noted here, there is a performance issue with this. Each String.split() call compiles a regular expression pattern and then uses the pattern to split the string. The pattern compilation is indeed very expensive and can slow the method down significantly. So, essentially what the String.split() method does is this:

public String[] split(String regex) {
     return Pattern.compile(regex).split(this);
}

By looking at the implementation, it becomes immediately obvious what one can do if lots of strings need to be split by the same regular expression. One can just have a pre-compiled pattern and call its split method repeatedly:

Patter pattern = Pattern.compile(regex);
...
for (String str : strings) {
    String splits [] = pattern.split(str);
    // Do something with the splits ...
}


The only thing unknown to me is whether the returned strings allocate new memory or the Pattern.split() method just behaves as String.substring(). But from running the code with more than 2.5 billion strings, I did not see any performance difference between the old and the new approach. It might mean that copies of strings are actually not made.

Tuesday, October 11, 2011

Netflix does not support Linux!

After getting annoyed by Netflix because of its recent dramas, I got pissed off today to see it does not support Linux. Its a shame for a new tech company like Netflix to not support a popular Linux distro like Ubuntu. May be the number of users is small compared to other OS users, but in this world of open source and open standards supporting Ubuntu is just a gesture of goodwill. I would say that the disliking that started to grow in me after the price hike just got bigger after discovering this.