Thursday, October 20, 2011

Quick and easy way to split lots of strings in Java

Recently, I had to split strings in a method which would be called potentially couple of billion times. The strings were delimited by only one character e.g. '#'. To be as efficient as possible, I just used String.indexOf(delimiter) and then String.substring(i, j) to get the splits. For this case, I think this certainly is the most efficient solution because the String.substring() method does not allocate new memory, it just returns a new String object which has some meta information to get the string from the original string's buffer (this is a nice optimization but it has its own pitfall). I was happy with it but the code looked messy with all those String.indexOf() calls and lots of index maintenance. Upon getting a review from a colleague I looked for ways to make it look better and easier to maintain. I found couple of alternatives to do the this:

1. StringTokenizer
2. Scanner
3. String.split

StringTokenizer had to be discarded immediately as it is getting old and is only there for backward compatibility. The Javadoc itself says, "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code." The Scanner is a great utility but seems to be bit heavy for that method. I had to keep in mind that the method would be called potentially billions of times in a row and it has to be as quick as possible. This left me with only the last option. String.split works in a simple way. It returns an array of strings. So, I could just use something like this:

String splits [] = str.split("#");
for (String split : splits) {
    // Do something ...
}

However, as noted here, there is a performance issue with this. Each String.split() call compiles a regular expression pattern and then uses the pattern to split the string. The pattern compilation is indeed very expensive and can slow the method down significantly. So, essentially what the String.split() method does is this:

public String[] split(String regex) {
     return Pattern.compile(regex).split(this);
}

By looking at the implementation, it becomes immediately obvious what one can do if lots of strings need to be split by the same regular expression. One can just have a pre-compiled pattern and call its split method repeatedly:

Patter pattern = Pattern.compile(regex);
...
for (String str : strings) {
    String splits [] = pattern.split(str);
    // Do something with the splits ...
}


The only thing unknown to me is whether the returned strings allocate new memory or the Pattern.split() method just behaves as String.substring(). But from running the code with more than 2.5 billion strings, I did not see any performance difference between the old and the new approach. It might mean that copies of strings are actually not made.

2 comments:

Omar Faruk said...

It would be interesting to post the performance benchmarking results comparing different options that you presented

Russoue said...

Its a good idea. I think I saw some performance comparison in a blog between two of the methods I mentioned here. I will try to look it up and post here.