Leveraging Multicore CPU architecture with fork-join framework in Java

Introduction

Hardware trends drive programming idioms. The Java language has had support for threads and concurrency from day 1; the language includes synchronization primitives such as synchronized and volatile, and the class library includes classes such as Thread. However, the concurrency primitives that were sensible in 1995 reflected the hardware reality of the time: most commercially available systems provided no parallelism at all, and even the most expensive systems provided only limited parallelism. In those days, threads were used primarily for expressing asynchrony, not concurrency, and as a result, these mechanisms were generally adequate to the demands of the time.

Going forward, the hardware trend is clear; Moore's Law will not be delivering higher clock rates, but instead delivering more cores per chip. As we enter the multi-core era, we will need to find finer-grained parallelism or risk keeping processors idle even though there is plenty of work to do. As the dominant hardware platform shifts, so too must the software platform if we wish to keep up. To this end, Java 7 includes a framework for representing a certain class of finer-grained parallel algorithms: the fork-join framework.

Fork-Join

The fork-join framework is an implementation of the ExecutorService interface that helps speed up parallel processing by attempting to use all available processor cores. This is accomplished through a divide and conquer approach. The fork/join framework is distinct because it uses a work-stealing algorithm. Worker threads that run out of things to do can steal tasks from other threads that are still busy. [1]

The center of the fork-join framework is the ForkJoinPool class which implements the core work-stealing algorithm and can execute ForkJoinTask processes. [1]

Divide and conquer

The standard technique for attacking exploitable parallelism is called recursive decomposition, or divide and conquer. In this approach, we repeatedly divide the problem into subproblems until the subproblems are small enough that it makes more sense to solve them sequentially; we then solve the subproblems in parallel, and combine the partial results of the subproblems to derive a total result.

Listing 1 illustrates a typical divide-and-conquer solution in pseudocode, using a hypothetical CONCURRENT primitive for concurrent execution.

Listing 1. Recursive decomposition

R solve(Problem<R> problem) {

    if (problem.isSmall())

        return problem.solveSequentially();

    R leftResult, rightResult;

    CONCURRENT {

        leftResult = solve(problem.leftHalf());

        rightResult = solve(problem.rightHalf());

    }

    return problem.combine(leftResult, rightResult);

}

Wrap this code in a ForkJoinTask subclass, typically using one of its more specialized types, either RecursiveTask (which can return a result) or RecursiveAction [1].

Workshop

Now let us implement Merge Sort algorithm using this approach. Multithreaded Merge Sort algorithm is given below.

The array is divided into two parts initially, and then the two sub arrays are sorted. This can be made in parallel. Once the two parallel tasks completed, merge is performed. Notice the sync point here. Now let us see the code using the fork-join framework to implement the Multi threaded Merge Sort algorithm.

Our client code should now look like this.
Two additional algorithmic steps are introduced by divide-and-conquer — dividing the problem, and combining the partial results.

Our problem Merge sort, admits decomposition nicely. The partitioning step of a problem like this is O(1)— "find the average of the top and bottom indices." Moreover since the two sub problems are almost equal in size, we can explore more exploitable parallelism. Merge sort has O(n log n) worst-case and average-case performance. But because the merging is difficult to do in place, generally, it has higher memory requirements than in-place sort algorithms, such as quick-sort. But because the sorting of the subproblems can be done in parallel, it parallelizes better than quick-sort. [3]

Given a fixed number of processors, parallelization still can't turn an O(n log n) problem into an O(n) one, but the more amenable the problem to parallelization, the closer to a factor of n_cpus by which parallelization can reduce the total runtime. Reducing the total runtime means that the user gets the result sooner — even if it takes more total CPU cycles to do the work in parallel than it does sequentially.

The principal benefit of using the fork-join technique is that it affords a portable means of coding algorithms for parallel execution. The programmer does not have to be aware of how many CPUs will be available in deployment; the runtime can do a good job of balancing work across available workers, yielding reasonable results across a wide range of hardware. [4]

When the source data is split into two, each node in the computation tree corresponds to a binary split, forming a binary tree. Ideally, this tree would be balanced — with each leaf node representing exactly the same amount of work. This ideal is not achievable in practice, but some sources come much closer than others. Arrays are the best case. We can describe a segment of an array by using a reference to the array base and integral offsets of the start and end of the segment. The cost of splitting this segment into two equal segments is cheap: We compute the midpoint of the segment, create a new descriptor for the first half, and move the starting index for the current segment to the first element in the in the second half [5].

The invokeAll() method is the most convenient way to submit a sequence of ForkJoinTasks to the ForkJoinPool. It takes tasks as parameters (two tasks, var args, or a collection), forks them returns a collection of Future objects in the order in which they were produced [2].

Benchmark Results

Here’s the benchmark of Multi threaded merge sort implemented using fork-join approach against the imperative naive approach suitable for mono core CPUs. I have chosen a large array of size 10⁸ elements. To sort such an array Multi threaded fork-join based approach takes 2828 milliseconds on my machine. The same input array takes 9339 milliseconds to be sorted using the sequential counterpart on my machine. Also notice that here on my machine refers to a core i7 which has 4 physical cores and 4 logical cores.

Conclusion

Use fork-join approach only if you need to solve extremely large computational problems while leveraging multi core CPU architecture. If you have several thousands of elements to sort, it is counter intuitive to use this approach. For an instance Oracle tutorial [1], gives a nice example of blurring an image while representing it as an array of pixels. For a smaller problem, imperative approach may be suitable and sometimes outperforms the fork-join approach because of the startup costs imposed by parallel execution.

References

[1] https://docs.oracle.com/javase/tutorial/essential/concurrency/forkjoin.html
[2] http://www.baeldung.com/java-fork-join
[3] https://www.ibm.com/developerworks/java/library/j-java-streams-4-brian-goetz/index.html?ca=drs-
[4] https://www.ibm.com/developerworks/java/library/j-jtp03048/
[5] https://www.ibm.com/developerworks/java/library/j-java-streams-5-brian-goetz/index.html?ca=drs-

Search This Blog

Ravindra Ranwala's Blog