When does the use of parallel stream processing make sense
Although the implementation of parallel data processing isn’t a difficult task anymore, the use of it is not always profitable. The main reason is that the processing time isn’t always lower when implementing data processing parallelly. According to small examples in ,  and experiments in the context of this work, many situation can lead even to a much higher execution time when using parallel streams, than using them in a sequential way. It has to be noted that parallel streams have a much higher overhead compared to sequential ones. Coordinating, splitting and combining are mandatory and time consuming jobs when dealing with parallel streams. But these are not the only reasons why parallel stream processing could behave worse than sequential ones. So the question is: when to use parallel data processing, and when to use streams with sequential access or even use external iteration? The following list should give some indicators which influence the execution time when using parallel stream  .
1. Number of CPUs respectively cores
2. Amount of data (N – Number of items)
3. Complexity of the task (Q – Cost): A higher value of N as well as a higher value of Q leads to a higher chance of a better performance when using a parallel stream.
4. Kind of operation: Some methods will naturally perform worse in parallel mode than on a sequential stream. Such operations like findFirst() or limit often rely on the order of the stream and so they are expensive in parallel mode. Others like findAny() or map() will perform much better with parallel streams because it isn’t constrained to operate in the encountered order.
5. Underlying data structure: For example ArrayList can be split with much less effort than a LinkedList because the ArrayList can be divided without traversing the whole collection, as it is necessary for linked data structures (see performance experiment in the next section).
6. Boxing and unboxing: Boxing (or Autoboxing) is the automatic conversion that the Java compiler accomplished between a primitive type and the corresponding object wrapper class, for example between the primitive type int and its corresponding class Integer. This operation can influence the costs of processing of streams in a very bad way. The solution for this problem when working with primitive values is to use also primitive streams which comes with Java 8. These primitive streams are represented in Java 8 by the classes IntStream, DoubleStream and LongStream. By using these classes instead of using for example Stream, the additional and expensive step of boxing can be omitted.
The above enumeration outlines how many aspects can influence the performance of parallel stream processing. They all have to be considered when making decisions between sequential and parallel processing. Although they give good indicators only more experience in dealing with streams and parallel streams can prevent the programmer of picking the wrong one. So in case of having doubt about the best solution it is advisable to measure the time-effort of the two approaches and compare them. Since the change between the sequential and the parallel version of stream processing is an easy task by only substitute a method name like stream() to parallelStream() the effort of testing which processing method behaves faster, is low.
Quantitative Comparison between external iteration, sequential and parallel stream processing
This section should give an idea of how the difference factors (compare with the enumeration of the previous section) like amount of data, complexity of task/operation, kind of operation, Boxing and the the underlying data structure influence the execution time. Therefore four different test scenarios should measure how the execution time changes by influencing these factors. So the execution time in milliseconds acts as the benchmark. Table 1.1 summarizes the different test scenarios and emphasizes the test settings roughly.
Test Settings: According to table 1.1, the first test scenario Test A demonstrates an example from the article ’Iterating over collections in Java 8’  executed on a 64 Bit Windows system with a dual-core Pentium processor and 4 GB RAM, while the following three test scenarios (Test B – Test D) are own designed tests executed on a 64 Bit Mac OSX system with a Intel i5 – 2,5 GHz processor (two cores) and 8 GB RAM.
Measuring Approach: To measure the execution time, one and the same test scenario was executed 100 times. The duration of each iteration was measured in nanoseconds, converted to milliseconds and averaged over all 100 iterations (see listing 1.1). This is how the values in the result tables were calculated. Since for every single test scenario (like an external iteration of an ArrayList or a parallel stream processing of an ArrayList), a new application was launched, the JVM warm up phase was not taken into account, because the prerequisites were the same for all of them. This fact ensures the comparability of the results. To demonstrate also the impact of the underlying data structure when traversing collections, the tests ’A’, ’B’ and ’C’ were executed for the most common ones, that are LinkedList, ArrayList, HashSet, LinkedHashSet and TreeSet.
Result Representation: The results of the test scenario are represented by a table with the following columns:
Collection type: Underlying type of data structure on which the test scenario is executed.
External iteration (A): Execution time in milliseconds for processing the collection with an external iteration approach.
Sequential stream (B): Execution time in ms for processing the collection with a sequential stream approach.
Parallel stream (C): Execution time in ms for processing the collection with a parallel stream approach.
Relative change (C/A): Shows the relative change of execution time between external iteration and parallel stream processing. A positive percentage value indicates a higher execution time for parallel processing in contrast to external iteration, while a negative one indicates a faster execution time.
Relative change (C/B): Similar to the previous point, but this column demonstrates the relative change between sequential stream and parallel stream processing.
Trend: Possible values are increasing, indicating a slower execution time of parallel execution, and decreasing to indicate a faster execution time parallel execution.
This test scenario is taken from article . The benchmark code counts the numbers of names in a collection with 1 million elements (N = 1.000.000) that begin with letter ’A’. The costs for each element (Q) are very low, since there is only one method call to check if the element starts with the letter ’A’.
The test scenarios were run with five different collection types and by using the three different approaches that are external iteration, sequential and parallel stream processing (see table 1.2). The average results of several runs are shown in table 1.2.
To verify the data taken out of the literature in the last section, a similar test should be performed where the number of elements of the collection (N = 20 Mill.) is much higher. To examine also the influence of boxing (autoboxing), the test scenario has been changed slightly. The test now deals with integers and checks a modulo-condition. This fulfills the requirements needed to emphasize the impact of autoboxing. The last method in table 1.3 shows the same test scenario, but by using an IntStream in parallel mode to avoid auto boxing, to emphasize also the impact of boxing.
Table 1.3 shows the measurement results. The trends of these results are the same like in the previous test case (compare with table 1.2), but in contrast to them, the influence on the execution time is more distinct, since the number of elements (N) is much higher.
The additional last row in table 1.2 shows the influence on execution time when omitting autoboxing, by using an IntStream instead of Stream<Integer>. As a result the execution time decreased dramatically, no matter which iteration approach was used (external, internal or parallel internal approach).
In contrast to the last test scenario, the number of elements is now very small (N = 10) while the cost (Q = 2 Mill.) of processing each element is much higher by iterating over a loop 2 million times.
The measurement results are represented in table 1.4. The finding clearly shows that the influence of the underlying data structure is negligible in this case. That is because the number of elements which has to be processed is with only 10 elements very low, and so the influence of a worse decomposability of some data structure is marginal. That leads to an improvement of execution time when using parallel stream processing, no matter which type of data structure the underlying collection has.
The last test should demonstrate the influence of blocking or synchronized functions. For that test scenario, only the collection type ArrayList is used, which is a perfect decomposable data structure and consists of 100.000 elements (N). The cost for processing each element (Q) is low since every element is only outputted to console without doing further operations. As the values in table 1.5 show, this task is not suitable for parallelism, because the output function is a blocking function. Because of the additional overhead, which has to be performed when using streams and parallel streams, the execution time is even higher when parallel stream processing is used.
 D. Lea, B. Goetz, P. Sandoz, A. Shipilev, H. Kabutz, and J. Bowbeer. When to use parallel streams. Online-Draft: http://gee.cs.oswego.edu/dl/html/ StreamParallelGuidance.html, Sept 2014.
 J. I. Moore. Iterating over collections in java 8. Java World, pages 1–36, Aug. 2014.
 R.G. Urma, M. Fuso, and A. Mycroft. Java 8 in Action. Manning, 2014.
 Richard Warburton. Java 8 Lambdas : Pragmatic Functional Programming. O’Reilly Media, Inc., 2014.