Presto Master JVM Core问题调研

背景

Presto master出现jvm coredump情况,排查问题,这里记录下排查过程。

排查过程

先看下JVM Coredump日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2640), pid=22120, tid=0x00007ef9d2ed3700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode linux-amd64 )
# Core dump written. Default location: /data1/cluster-data/core or core.22120 (max size 52428800 kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java again
#


--------------- T H R E A D ---------------

Current thread (0x00007eff5eaf5000): JavaThread "IPC Client (1499102350) connection to presto-host.nmg01/data-node:8020 from user" daemon [_thread_new, id=28463, stack(0x00007ef9d2e93000,0x00007ef9d2ed4000)]

Stack: [0x00007ef9d2e93000,0x00007ef9d2ed4000], sp=0x00007ef9d2ed2760, free space=253k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xacb18a] VMError::report_and_die()+0x2ba
V [libjvm.so+0x4ff4db] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*)+0x8b
V [libjvm.so+0x927d23] os::Linux::commit_memory_impl(char*, unsigned long, bool)+0x103
V [libjvm.so+0x927dec] os::pd_commit_memory(char*, unsigned long, bool)+0xc
V [libjvm.so+0x9217ba] os::commit_memory(char*, unsigned long, bool)+0x2a
V [libjvm.so+0x9261df] os::pd_create_stack_guard_pages(char*, unsigned long)+0x7f
V [libjvm.so+0xa6ffce] JavaThread::create_stack_guard_pages()+0x5e
V [libjvm.so+0xa797b4] JavaThread::run()+0x34
V [libjvm.so+0x92a338] java_start(Thread*)+0x108
C [libpthread.so.0+0x7dc5] start_thread+0xc5

可以看到报错是显示资源不足而导致JVM Coredump了。我们需要先看下是否是内存导致的问题:

所以先看下JVM日志,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2020-09-09T16:20:57.056+0800: 501342.858: [GC pause (G1 Evacuation Pause) (young), 0.1361232 secs]
[Parallel Time: 116.8 ms, GC Workers: 28]
[GC Worker Start (ms): Min: 501342860.5, Avg: 501342860.9, Max: 501342861.4, Diff: 0.9]
[Ext Root Scanning (ms): Min: 2.3, Avg: 4.7, Max: 39.3, Diff: 36.9, Sum: 132.9]
[Update RS (ms): Min: 0.0, Avg: 6.2, Max: 8.4, Diff: 8.4, Sum: 173.4]
[Processed Buffers: Min: 0, Avg: 20.4, Max: 63, Diff: 63, Sum: 571]
[Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 4.3]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
[Object Copy (ms): Min: 77.0, Avg: 104.9, Max: 106.9, Diff: 29.9, Sum: 2937.9]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 28]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.2, Sum: 3.9]
[GC Worker Total (ms): Min: 115.6, Avg: 116.2, Max: 116.6, Diff: 1.0, Sum: 3252.8]
[GC Worker End (ms): Min: 501342977.0, Avg: 501342977.0, Max: 501342977.2, Diff: 0.2]
[Code Root Fixup: 0.2 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 2.1 ms]
[Other: 17.0 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 8.3 ms]
[Ref Enq: 1.2 ms]
[Redirty Cards: 0.4 ms]
[Humongous Register: 0.2 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 2.7 ms]
[Eden: 27.0G(27.0G)->0.0B(24.7G) Survivors: 384.0M->768.0M Heap: 39.2G(64.0G)->12.6G(64.0G)]
[Times: user=1.75 sys=1.55, real=0.14 secs]

可以看到GC日志是正常的,young gc后,内存占用8G左右。

之前写脚本统计下内存占用,查看了/proc/meminfo信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
MemTotal:       131779324 kB
MemFree: 28158200 kB
MemAvailable: 54607180 kB
Buffers: 229940 kB
Cached: 29816108 kB
SwapCached: 0 kB
Active: 94592988 kB
Inactive: 6479088 kB
Active(anon): 73867112 kB
Inactive(anon): 1373984 kB
Active(file): 20725876 kB
Inactive(file): 5105104 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 388 kB
Writeback: 0 kB
AnonPages: 71026488 kB
Mapped: 232460 kB
Shmem: 4215064 kB
Slab: 1416472 kB
SReclaimable: 1244120 kB
SUnreclaim: 172352 kB
KernelStack: 28192 kB
PageTables: 188212 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 65889660 kB
Committed_AS: 118328976 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 500096 kB
VmallocChunk: 34291843068 kB
HardwareCorrupted: 0 kB
AnonHugePages: 4096 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 208476 kB
DirectMap2M: 17494016 kB
DirectMap1G: 118489088 kB

可以看到 MemAvailable 还有50+G内存,说明当时内存是充足的。不是内存不足导致的问题。那再需要确认下是否是线程数太多导致的问题?

1
2
[presto@prestomaster00 cluster-data]$ grep "JavaThread" hs_err_pid22120.log | wc -l
30993

线程数接近3.1W,有些多,去server.log里查找是否有线程创建失败的Log,命令:

grep -C 70 “unable to create new native thread” server.log.1 | less

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "prestomaster00/xx.xx.xx.xx"; destination host is: "datanode":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1507)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy218.getListing(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:580)
at sun.reflect.GeneratedMethodAccessor1052.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy219.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2154)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1029)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1012)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:957)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:953)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:971)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1771)
at com.facebook.presto.hive.HadoopDirectoryLister.list(HadoopDirectoryLister.java:30)
at com.facebook.presto.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:130)
at com.facebook.presto.hive.util.HiveFileIterator$FileStatusIterator.<init>(HiveFileIterator.java:118)
at com.facebook.presto.hive.util.HiveFileIterator.getLocatedFileStatusRemoteIterator(HiveFileIterator.java:107)
at com.facebook.presto.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:100)
at com.facebook.presto.hive.util.HiveFileIterator.computeNext(HiveFileIterator.java:37)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at java.util.Spliterators$IteratorSpliterator.tryAdvance(Spliterators.java:1811)
at java.util.stream.StreamSpliterators$WrappingSpliterator.lambda$initPartialTraversalState$0(StreamSpliterators.java:294)
at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.fillBuffer(StreamSpliterators.java:206)
at java.util.stream.StreamSpliterators$AbstractWrappingSpliterator.doAdvance(StreamSpliterators.java:161)
at java.util.stream.StreamSpliterators$WrappingSpliterator.tryAdvance(StreamSpliterators.java:300)
at java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:259)
at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:91)
at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:185)
at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:822)
at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:394)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1556)
at org.apache.hadoop.ipc.Client.call(Client.java:1480)
... 42 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:815)
... 45 more

发现确实是无法创建线程,因为内存是充足的,说明是线程数太多导致的,那我们看下系统允许创建的线程数是多少?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[presto@prestomaster00 log]$ ulimit -a
core file size (blocks, -c) 52428800
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 600000
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1000000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 102400
cpu time (seconds, -t) unlimited
max user processes (-u) 100000
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

可以看到max user processes 是10W个,但是这个是系统限制某用户下最多可以运行多少进程或线程,这个值是表象,这个值是受全局的kernel.pid_max的值限制,那我们需要看下pid_max值:

1
2
[presto@prestomaster00 log]$ cat  /proc/sys/kernel/pid_max
40960

猜测超过了40960个进程,修改此值到10W,依然复现。

猜测线程数超过了JVM最大线程数,需要设置下JVM运行创建的线程数:

1
2
echo "200000" > /proc/sys/vm/max_map_count

默认65530,永久修改在/etc/sysctl.conf。在其他资源可用的前提下,单个JVM能开启的最大线程数是/proc/sys/vm/max_map_count的设置数的一半。

修改后,发现JVM不再Coredump。

但是这个只是临时解决,帮助避免节点频发JVM Coredump,核心还是看下为啥会创建3W+线程。