说下那些导致Presto查询变慢的JVM Bug和解决方法

背景

维护Presto集群,遇到了一些JVM Bug会严重导致Presto查询变慢,这里分享下Bugs表现及解决方法。

Ref Proc 耗时太久

线上Presto运行一段时间后,查询会越来越慢,打了下火焰图,发现30%的CPU都浪费在young gc上了,看了下gc log,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

2020-12-21T20:50:15.009+0800: 4753827.928: [GC pause (GCLocker Initiated GC) (young) (initial-mark), 0.5723063 secs]
[Parallel Time: 66.9 ms, GC Workers: 33]
[GC Worker Start (ms): Min: 4753827940.4, Avg: 4753827940.5, Max: 4753827940.7, Diff: 0.4]
[Ext Root Scanning (ms): Min: 26.2, Avg: 27.3, Max: 54.9, Diff: 28.7, Sum: 899.6]
[Update RS (ms): Min: 0.0, Avg: 9.2, Max: 9.8, Diff: 9.8, Sum: 304.3]
[Processed Buffers: Min: 0, Avg: 19.9, Max: 35, Diff: 35, Sum: 657]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Object Copy (ms): Min: 0.5, Avg: 11.2, Max: 13.8, Diff: 13.3, Sum: 369.9]
[Termination (ms): Min: 0.0, Avg: 7.8, Max: 8.4, Diff: 8.4, Sum: 256.8]
[Termination Attempts: Min: 1, Avg: 1.5, Max: 4, Diff: 3, Sum: 51]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.5]
[GC Worker Total (ms): Min: 55.3, Avg: 55.5, Max: 55.7, Diff: 0.5, Sum: 1831.3]
[GC Worker End (ms): Min: 4753827996.0, Avg: 4753827996.0, Max: 4753827996.1, Diff: 0.1]
[Code Root Fixup: 0.4 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.6 ms]
[Other: 504.4 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 501.3 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.5 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.2 ms]
[Eden: 320.0M(3264.0M)->0.0B(3392.0M) Survivors: 416.0M->288.0M Heap: 9068.7M(72.0G)->8759.2M(72.0G)]
[Times: user=1.70 sys=0.39, real=0.57 secs]

从上面我们可以看到 Ref Proc: 501.3 ms,总共young gc耗时0.57s,而Ref Proc耗时就达到了0.5s。我们需要看下Ref Proc耗时主要在哪个地方。
我们通过配置gc参数-XX:+PrintReferenceGC,可以看到详细的Reference GC时间,并且如下图Case,JNI Weak Reference是主要瓶颈点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
2021-02-07T09:17:42.299+0800: 3865450.117: [GC pause (G1 Evacuation Pause) (mixed)2021-02-07T09:17:42.394+0800: 3865450.212: [SoftReference, 0 refs, 0.0020337 secs]2021-02-07T09:17:42.396+0800: 3865450.214: [WeakReference, 75 refs, 0.0014819 secs]2021-02-07T09:17:42.397+0800: 3865450.215: [FinalReference, 1447 refs, 0.0012562 secs]2021-02-07T09:17:42.398+0800: 3865450.217: [PhantomReference, 0 refs, 70 refs, 0.0021257 secs]2021-02-07T09:17:42.401+0800: 3865450.219: [JNI Weak Reference, 0.1966801 secs], 0.3046075 secs]
[Parallel Time: 91.6 ms, GC Workers: 33]
[GC Worker Start (ms): Min: 3865450119.3, Avg: 3865450119.5, Max: 3865450119.7, Diff: 0.4]
[Ext Root Scanning (ms): Min: 2.9, Avg: 7.5, Max: 88.0, Diff: 85.0, Sum: 247.0]
[Update RS (ms): Min: 0.0, Avg: 21.9, Max: 25.3, Diff: 25.3, Sum: 721.7]
[Processed Buffers: Min: 0, Avg: 27.6, Max: 40, Diff: 40, Sum: 912]
[Scan RS (ms): Min: 0.0, Avg: 12.3, Max: 13.9, Diff: 13.8, Sum: 406.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 1.3, Diff: 1.3, Sum: 2.1]
[Object Copy (ms): Min: 2.7, Avg: 48.7, Max: 57.2, Diff: 54.6, Sum: 1605.6]
[Termination (ms): Min: 0.0, Avg: 0.3, Max: 0.3, Diff: 0.3, Sum: 8.8]
[Termination Attempts: Min: 1, Avg: 123.7, Max: 152, Diff: 151, Sum: 4081]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum: 4.3]
[GC Worker Total (ms): Min: 90.5, Avg: 90.8, Max: 91.1, Diff: 0.6, Sum: 2995.6]
[GC Worker End (ms): Min: 3865450210.2, Avg: 3865450210.3, Max: 3865450210.4, Diff: 0.2]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 1.4 ms]
[Other: 211.6 ms]
[Choose CSet: 1.2 ms]
[Ref Proc: 204.5 ms]
[Ref Enq: 0.5 ms]
[Redirty Cards: 1.4 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 1.9 ms]
[Eden: 3200.0M(3200.0M)->0.0B(3200.0M) Survivors: 480.0M->480.0M Heap: 17.9G(72.0G)->14.8G(72.0G)]
[Times: user=2.41 sys=0.84, real=0.31 secs]

我们需要看下为啥JNI Weak Reference耗时这么久,jstack pid 看下,发现jni global references达到了960W个。

持续观察Presto,发现JNI global references会持续增长,打印了几天的数据,这个值就不会降低,一直增长,然后会导致young gc越来越慢,手动触发full gc也不会降低,几周后,可能young gc就会占用cpu 30%以上时间了。

解决方法

升级到jvm最新版本,jdk8我升级到jdk1.8.0_271版本,持续跑了三周后,jstack查看:

jvm逆优化

这个说过几次了,并且很多搞Presto的朋友也遇到或咨询过这个问题,详细见:https://mp.weixin.qq.com/s/OK39zkUj-kJfY6HYixj_Lw

这个Bug分为两阶段,第一阶段就是低版本的jdk 8 触发逆优化后,会触发Deoptimization::Action_none,啥事不干,无限循环的状态,新版本jdk 8修复了这个问题,但是依然会导致性能变慢,打个火焰图,如下图所示:

如果你们火焰图发现程序都在做Deoptimization,那表示遇到了逆优化问题,会严重影响性能。

解决方法

修改JVM参数:

1
2
3
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000

young gc Termination耗时太久

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
2021-01-29T22:10:48.109+0800: 341.173: [GC pause (GCLocker Initiated GC) (young)2021-01-29T22:10:50.843+0800: 343.907: [SoftReference, 0 refs, 0.0000531 secs]2021-01-29T22:10:50.843+0800: 343.907: [WeakReference, 11 refs, 0.0000136 secs]2021-01-29T22:10:50.843+0800: 343.907: [FinalReference, 9 refs, 0.0000572 secs]2021-01-29T22:10:50.844+0800: 343.907: [PhantomReference, 75 refs, 154 refs, 0.0000679 secs]2021-01-29T22:10:50.844+0800: 343.907: [JNI Weak Reference, 0.0001989 secs], 2.7381984 secs]
[Parallel Time: 2732.6 ms, GC Workers: 28]
[GC Worker Start (ms): Min: 341174.0, Avg: 341174.1, Max: 341174.2, Diff: 0.2]
[Ext Root Scanning (ms): Min: 1.8, Avg: 2.1, Max: 6.9, Diff: 5.1, Sum: 60.0]
[Update RS (ms): Min: 0.0, Avg: 1.9, Max: 3.2, Diff: 3.2, Sum: 54.4]
[Processed Buffers: Min: 0, Avg: 20.6, Max: 46, Diff: 46, Sum: 577]
[Scan RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 2.9]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
[Object Copy (ms): Min: 172.4, Avg: 225.0, Max: 283.3, Diff: 110.9, Sum: 6299.9]
[Termination (ms): Min: 2443.7, Avg: 2503.1, Max: 2555.8, Diff: 112.1, Sum: 70085.7]
[Termination Attempts: Min: 16850, Avg: 18434.4, Max: 19585, Diff: 2735, Sum: 516163]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 2.3]
[GC Worker Total (ms): Min: 2732.2, Avg: 2732.3, Max: 2732.5, Diff: 0.3, Sum: 76505.5]
[GC Worker End (ms): Min: 343906.4, Avg: 343906.4, Max: 343906.5, Diff: 0.2]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 1.3 ms]
[Other: 4.2 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.8 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.3 ms]
[Humongous Register: 0.1 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 1.1 ms]
[Eden: 15.1G(14.9G)->0.0B(16.2G) Survivors: 672.0M->960.0M Heap: 22.8G(96.0G)->8251.1M(96.0G)]
[Times: user=76.24 sys=0.29, real=2.74 secs]

由上面的gc log可以看到Termination耗时2555.8ms,Termination Attempts达到了19585次。

打个火焰图,如下图所示:

可以看到young gc在做offer_termination,并且会有大约30%的CPU浪费在自旋锁(SpinPause)里。

这个问题非常棘手,我们尝试过cms、g1、zgc(jdk 11,会很慢,但是瓶颈不在Termination里了),以及jdk 8、jdk 11都没有解决。后来找了专门搞jvm的同事询问了下,Termination这块逻辑,cms及g1是一样的,如果一个遇到,那这2个gc算法都会遇到,可以尝试jdk 15的zgc,看下是否能解决。

但是这个问题一旦遇到,重启也恢复不了,为了解决这个问题,我们又尝试了阿里的开源jdk: dragonwell8,发现问题解决,应该是dragonwell新加了一个参数,默认开启。

1
-XX:+UseOWSTTaskTerminator

原理有点像是针对大对象,不要多work去偷任务,偷了也只能忙等。

同时,我发现腾讯也应该遇到了类似问题,开源的Kona 8 JDK也解决了类似问题,如下面文档所示:

https://github.com/tencentyun/qcloud-documents/blob/master/product/%E5%A4%A7%E6%95%B0%E6%8D%AE%E4%B8%8EAI/%E5%BC%B9%E6%80%A7MapReduce/EMR%20%E5%BC%80%E5%8F%91%E6%95%99%E7%A8%8B/EMR%E5%A4%A9%E7%A9%B9%E7%89%88%E6%9C%AC%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97/EMR%E5%A4%A9%E7%A9%B9%E7%89%88%E6%9C%AC%E8%87%AA%E7%A0%94%20Kona%20JDK%E7%89%B9%E6%80%A7%E4%BB%8B%E7%BB%8D%E4%B8%8E%E4%BD%BF%E7%94%A8.md

GC 并行线程通过 Work Stealing 进行负载均衡,减少 GC 停顿时间。原有 Work Stealing 算法中所有空闲 GC 线程都会频繁尝试 work stealing,造成 CPU 资源浪费,新算法采用一个 master 线程监控是否进行 work stealing 并在适当时机唤醒其它空闲 GC 线程,极大减少 CPU 资源浪费和进程间的 CPU 资源竞争。OWST 在 OpenJDK 社区版本12以后支持,大数据 Hibench 测试有最多30%的性能提升,平均提升8%。但是我测试后,发现文中提到的参数是不存在的,并且现象依然复现。

解决方法

总结下这个问题的解决方法,这个问题,目前社区还没有修复,可以试下阿里开源的dragonwell,如果不行,需要升级到JDK 15,使用zgc再尝试下。

总结

在遇到性能问题,重要的是定位性能变慢的原因,是引擎问题还是JVM问题等等,如果是JVM问题,又需要定位哪个阶段导致的问题,遇到bug也尽量去openjdk社区看下是否已经修复了这个问题,同时建议多perf看下性能瓶颈点。这里只介绍了三个常见的会严重影响Presto性能的jvm bug,而新的PrestoSQL在使用过程中,还有一些公司会遇到codegen导致性能变慢的问题,我们目前还没有遇到。同时,由于Presto新版本一般会使用很多jvm新的特性,在编译Presto时候也会遇到很多jvm bug,我在从PrestoDB升级PrestoSQL时候,其中有一天遇到了5次JVM编译Bug,这里建议各位如果遇到jvm问题,尽量升级到最新jvm版本,避免踩坑。

目录

  1. 1. 背景
  2. 2. Ref Proc 耗时太久
    1. 2.1. 解决方法
  3. 3. jvm逆优化
    1. 3.1. 解决方法
  4. 4. young gc Termination耗时太久
    1. 4.1. 解决方法
  5. 5. 总结