使用Java 8处理并行数据库流

什么是并行数据库流？

阅读这篇文章，了解如何使用并行流和Speedment并行处理数据库中的数据。在许多情况下，并行流可能比通常的顺序流快得多。

随着Java 8的引入，我们得到了期待已久的Stream库。流的优点之一是使流并行非常容易。基本上，我们可以采用任何流，然后只应用方法parallel()获得并行流，而不是顺序流。默认情况下，并行流由公共ForkJoinPool执行。

尖塔和公爵并行工作

因此，如果我们有工作量相对较高的工作项，那么并行流通常是有意义的。如果要在并行流管线中执行的工作项在很大程度上未耦合并且在几个线程相对较低。同样，合并并行结果的努力也必须相对较低。

Speedment是开源的ORM Java工具包和RuntimeJava工具，它将现有的数据库及其表包装到Java 8流中。我们可以使用现有的数据库并运行Speedment工具，它将生成与我们使用该工具选择的表相对应的POJO类。

Speedment的一项很酷的功能是，数据库流使用标准的Stream语义支持并行性。 这样，与顺序处理流相比，我们可以轻松地并行处理数据库内容并更快地产生结果！

加速入门

访问GitHub上的开放源Speedment ，了解如何开始Speedment项目。将工具连接到现有数据库应该非常容易。

在本文中，下面的示例使用以下MySQL表。

CREATE TABLE `prime_candidate` (`id` int(11) NOT NULL AUTO_INCREMENT,`value` bigint(20) NOT NULL,`prime` bit(1) DEFAULT NULL,PRIMARY KEY (`id`)
) ENGINE=InnoDB;

这个想法是人们可以在表中插入值，然后我们将编写一个应用程序来计算插入的值是否是质数。在实际情况下，我们可以使用MySQL，PostgreSQL或MariaDB数据库中的任何表。

编写顺序流解决方案

首先，我们需要一个方法，如果值是素数则返回。这是一种简单的方法。请注意， 故意使算法变慢，因此我们可以清楚地了解并行流在昂贵的操作上的效果。

public class PrimeUtil {/*** Returns if the given parameter is a prime number.** @param n the given prime number candidate* @return if the given parameter is a prime number*/static boolean isPrime(long n) {// primes are equal or greater than 2 if (n < 2) {return false;}// check if n is evenif (n % 2 == 0) {// 2 is the only even prime// all other even n:s are notreturn n == 2;}// if odd, then just check the odds// up to the square root of n// for (int i = 3; i * i <= n; i += 2) {//// Make the methods purposely slow by// checking all the way up to nfor (int i = 3; i <= n; i += 2) {if (n % i == 0) {return false;}}return true;}}

同样，这篇文章的目的不是设计一种有效的质数确定方法。

使用这种简单的质数方法，我们现在可以轻松编写一个Speedment应用程序，该应用程序将扫描数据库表以查找未确定的质数候选者，然后将确定它们是否为质数并相应地更新表。看起来可能是这样：

final JavapotApplication app = new JavapotApplicationBuilder().withPassword("javapot") // Replace with the real password.withLogging(LogType.STREAM).build();final Manager<PrimeCandidate> candidates = app.getOrThrow(PrimeCandidateManager.class);candidates.stream().filter(PrimeCandidate.PRIME.isNull())                      // Filter out undetermined primes.map(pc -> pc.setPrime(PrimeUtil.isPrime(pc.getValue())))   // Sets if it is a prime or not.forEach(candidates.updater());                             // Applies the Manager's updater

最后一部分包含有趣的内容。首先，我们在“ prime”列为
使用stream().filter(PrimeCandidate.PRIME.isNull())方法stream().filter(PrimeCandidate.PRIME.isNull()) null 。重要的是要了解，Speedment流实现将识别过滤谓词，并能够使用该谓词来减少从数据库中实际提取的候选者的数量（例如，“ SELECT * FROM FROM WHERE prime IS NULL”将使用）。

然后，对每个这样的总理候选人PC，我们无论是“黄金”列设置为true ，如果pc.getValue()是一个主要的或false ，如果pc.getValue()是不是一个素数。有趣的是， pc.setPrime()方法返回实体pc本身，使我们能够轻松地标记多个流操作。在最后一行，我们通过应用candidates.updater()函数update。 candidates.updater()将检查结果更新数据库。因此，该应用程序的主要功能实际上是单行的（分为五行以提高可读性）。

现在，在测试应用程序之前，我们需要生成一些测试数据输入。这是使用Speedment如何完成的示例：

final JavapotApplication app = new JavapotApplicationBuilder().withPassword("javapot") // Replace with the real password.build();final Manager<PrimeCandidate> candidates = app.getOrThrow(PrimeCandidateManager.class);final Random random = new SecureRandom();// Create a bunch of new prime candidatesrandom.longs(1_100, 0, Integer.MAX_VALUE).mapToObj(new PrimeCandidateImpl()::setValue)  // Sets the random value .forEach(candidates.persister());              // Applies the Manager's persister function

同样，我们只需几行代码就可以完成我们的任务。

尝试默认的并行流

如果要并行化流，则只需向以前的解决方案中添加一个方法即可：

candidates.stream().parallel()                                 // Now indicates a parallel stream.filter(PrimeCandidate.PRIME.isNull()).map(pc -> pc.setPrime(PrimeUtil.isPrime(pc.getValue()))).forEach(candidates.updater());             // Applies the Manager's updater

和我们平行！但是，默认情况下，Speedment使用Java的默认并行化行为（在Spliterators::spliteratorUnknownSize定义），该行为针对非计算密集型操作进行了优化。如果我们分析Java的默认并行化行为，我们将确定它将对第一个1024个工作项使用第一个线程，对随后的2 * 1024 = 2048个工作项使用第二个线程，然后对第三个线程使用3 * 1024 = 3072个工作项线程等。

这对我们的应用程序不利，因为每个应用程序的成本都很高。如果我们正在计算1100个主要候选对象，我们将仅使用两个线程，因为第一个线程将处理前1024个项目，第二个线程将处理其余的76个项目。现代服务器的线程要多得多。阅读下一节，了解如何解决此问题。

内置并行化策略

速度有许多内置的并行化策略，我们可以根据工作项的预期计算需求进行选择。这是对仅具有一种默认策略的Java 8的改进。内置的并行策略是：

@FunctionalInterface
public interface ParallelStrategy {/*** A Parallel Strategy that is Java's default <code>Iterator</code> to* <code>Spliterator</code> converter. It favors relatively large sets (in* the ten thousands or more) with low computational overhead.** @return a ParallelStrategy*/static ParallelStrategy computeIntensityDefault() {...}/*** A Parallel Strategy that favors relatively small to medium sets with* medium computational overhead.** @return a ParallelStrategy*/static ParallelStrategy computeIntensityMedium() {...}/*** A Parallel Strategy that favors relatively small to medium sets with high* computational overhead.** @return a ParallelStrategy*/static ParallelStrategy computeIntensityHigh() {...}/*** A Parallel Strategy that favors small sets with extremely high* computational overhead. The set will be split up in solitary elements* that are executed separately in their own thread.** @return a ParallelStrategy*/static ParallelStrategy computeIntensityExtreme() {...}<T> Spliterator<T> spliteratorUnknownSize(Iterator<? extends T> iterator, int characteristics);static ParallelStrategy of(final int... batchSizes) {return new ParallelStrategy() {@Overridepublic <T> Spliterator<T> spliteratorUnknownSize(Iterator<? extends T> iterator, int characteristics) {return ConfigurableIteratorSpliterator.of(iterator, characteristics, batchSizes);}};}

应用并行策略

我们要做的唯一一件事就是为这样的管理器配置并行化策略，我们很高兴：

Manager<PrimeCandidate> candidatesHigh = app.configure(PrimeCandidateManager.class).withParallelStrategy(ParallelStrategy.computeIntensityHigh()).build();candidatesHigh.stream() // Better parallel performance for our case!.parallel().filter(PrimeCandidate.PRIME.isNull()).map(pc -> pc.setPrime(PrimeUtil.isPrime(pc.getValue()))).forEach(candidatesHigh.updater());

ParallelStrategy.computeIntensityHigh()策略会将工作项分解成更小的块。因为我们现在将使用所有可用的线程，所以这将使我们获得更好的性能。如果我们深入研究，可以看到该策略的定义如下：

private final static int[] BATCH_SIZES = IntStream.range(0, 8).map(ComputeIntensityUtil::toThePowerOfTwo).flatMap(ComputeIntensityUtil::repeatOnHalfAvailableProcessors).toArray();

这意味着，在具有8个线程的计算机上，它将在线程1-4上放置一个项目，在线程5-8上放置两个项目，当任务完成时，接下来的四个可用线程上将有四个项目，然后是八个依此类推，直到达到256，这是任何线程上的最大项目数。显然，对于该特定问题，此策略比Java的标准策略好得多。

这是常见的ForkJoinPool中的线程在我的8线程笔记本电脑上的样子：

共叉联合池

创建自己的并行策略

Speedment的一件很酷的事情是，我们可以很容易地编写并行化策略，然后将其注入流中。考虑以下自定义并行化策略：

public static class MyParallelStrategy implements ParallelStrategy {private final static int[] BATCH_SIZES = {1, 2, 4, 8};@Overridepublic <T> Spliterator<T> spliteratorUnknownSize(Iterator<? extends T> iterator, int characteristics) {return ConfigurableIteratorSpliterator.of(iterator, characteristics, BATCH_SIZES);}}

实际上，它可以表达得更短：

ParallelStrategy myParallelStrategy = ParallelStrategy.of(1, 2, 4, 8);

此策略将在第一个可用线程上放置一个工作项，在第二个可用线程上放置两个，在第三个线程上放置四个，在第四个线程上放置八个，其中八个是数组中的最后一位。最后一位数字将用于所有后续可用线程。因此，订单实际上变成了1、2、4、8、8、8、8...。现在，我们可以使用以下新策略：

Manager<PrimeCandidate> candidatesCustom = app.configure(PrimeCandidateManager.class).withParallelStrategy(myParallelStrategy).build();candidatesCustom.stream().parallel().filter(PrimeCandidate.PRIME.isNull()).map(pc -> pc.setPrime(PrimeUtil.isPrime(pc.getValue()))).forEach(candidatesCustom.updater());

瞧！我们完全控制工作项在可用执行线程上的布局方式。

基准测试

所有基准都使用相同的主要候选者输入。测试是在MacBook Pro，具有4个物理核心和8个线程的2.2 GHz Intel Core i7上进行的。

StrategySequential                       265 s (One thread processed all 1100 items)
Parallel Default Java 8          235 s (Because 1024 items were processed by thread 1 and 76 items by thread 2)
Parallel computeIntensityHigh()   69 s (All 4 hardware cores were used)