通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例- 故障背景
- 故障分析
- 第一次定位问题
- 跨架构假设
- 生产重现故障
- Wireshark抓包分析
- 客户端10秒超时源码跟踪
- 结合日志查看服务端卡点
- 第二次故障定位
- arthas定位
- 通过trace查看耗时方法
- 通过thread查看线程
- jstack
- kill -3 pid
- 最终问题定位
- 结合ignite源代码回顾
- 客户端收集本地信息
- 服务端反序列化解析hostname
- 跨架构假设
- 生产重现故障
- Wireshark抓包分析
- 客户端10秒超时源码跟踪
- 结合日志查看服务端卡点
- arthas定位
- 通过trace查看耗时方法
- 通过thread查看线程
- jstack
- kill -3 pid
- 客户端收集本地信息
- 服务端反序列化解析hostname
故障背景
一次维护人员在上完线后,发现在分布式内存数据网格apache Ignite集群上通过客户端执行加载数据任务时,出现客户端节点连不上服务节点的问题。
生产环境情况: 分为A、B两中心 A中心 x86架构主机8台,为ignite服务节点 B中心 power架构主机8台,为ignite服务节点 Ignite版本为1.10,jdk为1.8
问题反馈: 当用x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而power 客户端连power服务端没有问题;x86 或者power客户端连x86节点也没有问题。
故障分析
由于开发人员log4j2日志配置不对,导致生产上没有看到错误日志
目前从维护人员得到的信息来看,猜测可能有2个原因导致故障: 1、跨架构导致故障,因为客户端x86架构,服务端power架构 2、网络问题导致故障,怀疑是跨中心,跨了网段导致故障
第一次定位问题
跨架构假设
针对第一点的假设,通过并行环境模拟,模拟不出故障,因此只能到生产环境重现故障,再做下一步定位。
生产重现故障
修改log4j2日志,得到报错信息(以下是后续本地环境模拟的,报错内容和生产一致): 服务端节点错误:
代码语言:javascript复制[2021-11-25T18:13:28,887][ERROR][tcp-disco-sock-reader-[045905d0 192.168.139.129:44809 client]-#5-#58][TcpDiscoverySpi] Caught exception on message read [sock=Socket[addr=/192.168.139.129,port=44809,localport=47500], locNodeId=5be3f5a4-08a1-4f46-aa14-e009963dc83c, rmtNodeId=045905d0-bcaa-4631-bd70-7cc1922a8422]
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
at java.net.SocketOutputStream.write(SocketOutputStream.java:134) ~[?:1.8.0_211]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]
客户端报错日志:
代码语言:javascript复制[2021-11-25T18:13:23,792][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Read timed out
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_211]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_211]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_211]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_211]
at java.net.SocketInputStream.read(SocketInputStream.java:224) ~[?:1.8.0_211]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.readReceipt(TcpDiscoverySpi.java:1904) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:806) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
[18:13:23] Failed to connect to any address from IP finder (will retry to join topology every 2000 ms; change 'reconnectDelay' to configure the frequency of retries): [/192.168.139.128:47500]
[2021-11-25T18:13:27,042][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
org.apache.ignite.IgniteCheckedException: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:102) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
Caused by: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:99) ~[ignite-01-1.0-SNAPSHOT.jar:?]
... 12 more
Suppressed: java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) ~[?:1.8.0_211]
at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:100) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
Wireshark抓包分析
通过tcpdump分别对客户端,服务端进行抓包,来分析网络情况。 命令如下: tcpdump -i ens192 host 10.1.12.XXX and port 47500 -w server.pcap
47500 为Ignite客户端与服务器建立连接的端口号 ens192 指定网卡
服务端抓包如下:
TCP层,有个FLAGS字段,这个字段有以下几个标识:SYN, FIN, ACK, PSH, RST, URG。 SYN表示建立连接 FIN表示关闭连接 ACK表示响应 PSH表示有 DATA数据传输 RST表示连接重置
以下重点分析几处: 第10行结合源码分析,客户端发TcpDiscoveryJoinRequestMessage包到服务端,seq=440,ack=434,Len=14480
服务端顺利接收到,为13行,seq=434,ack=14920,Len=0
备注: 在已经建立好连接的TCP上(只考虑数据包和ack包),seq和ack的计算规则为 本次要发送的包的 seq = 上一个发送的包的seq 上一个发送的包的长度(不含包头) 本次要发送的包的 ack = 上一个接收到的包的seq 上一个接收到的包的长度(不含包头)
上面图中画圈的部分发现,第19行和第20行相差了10秒,而且客户端(10.48.131.13)发起[RST,ACK]重连。
接着重试后往下看:
在客户端重试的TCP包中,37行发现发送len为1的应答包(RES_OK),但是重试前服务端没有发送这个应答包,结合处理TcpDiscoveryJoinRequest源码看,代码为:
代码语言:javascript复制 private boolean processJoinRequestMessage(TcpDiscoveryJoinRequestMessage msg,
@Nullable ClientMessageWorker clientMsgWrk) throws IOException {
....
spi.writeToSocket(msg, sock, RES_OK, sockTimeout);
....
}
这里得到结论:服务端在第一次处理TcpDiscoveryJoinRequest包等待了很久,客户端socket到了超时时间10秒后,重连了,在重连后,后续都正常。
客户端10秒超时源码跟踪
ClientImpl 805行:
代码语言:javascript复制 return new T3<>(new SocketStream(sock),
spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0)),
res.clientAck());
timeoutHelper.nextTimeoutChunk继续往下跟IgniteSpiOperationTimeoutHelper :
代码语言:javascript复制 public long nextTimeoutChunk(long dfltTimeout) throws IgniteSpiOperationTimeoutException {
long now = System.nanoTime();
long left;
// 因为timeoutEnabled为true,走到这一行
if (timeoutEnabled)
left = timeoutThreshold - now;
...
timeoutThreshold的值源于ClientImpl 722:
代码语言:javascript复制IgniteSpiOperationTimeoutHelper timeoutHelper = new IgniteSpiOperationTimeoutHelper(spi, true);
再跟到IgniteSpiOperationTimeoutHelper 58行
代码语言:javascript复制long timeout = (lastRelatedOperationTime > 0 ? lastRelatedOperationTime : System.nanoTime())
U.millisToNanos(srvOp ? adapter.failureDetectionTimeout() : adapter.clientFailureDetectionTimeout());
srvOp为true ,即System.nanoTime() failureDetectionTimeout,failureDetectionTimeout默认10秒
找到源码后,客户端修改超时时间setFailureDetectionTimeout示例如下:
代码语言:javascript复制 IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setClientMode(true);
cfg.setFailureDetectionTimeout(30000L);
结合日志查看服务端卡点
结合日志来看,11:05:50,288 服务端收到Initialized connection with remote client node ,11:06:05,325后,服务端打印日志Message has been received,这里得到结论,服务端卡住15秒
代码语言:javascript复制2021-11-08 11:05:50,288 INFO tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:478) - Initialized connection with remote client node [nodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, rmtAddr=/10.45.179.99:24449]
2021-11-08 11:06:05,325 DEBUG tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:465) - Message has been received: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, consistentId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, addrs=ArrayList [10.45.179.99], sockAddrs=HashSet [/10.45.179.99:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1636340750309, loc=false, ver=2.10.0#20211108-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@637f47b5, super=TcpDiscoveryAbstractMessage [sndNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, id=3b2528dfc71-a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
2021-11-08 11:06:05,326 ERROR tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:498) - Caught exception on message read [sock=Socket[addr=/10.45.179.99,port=24449,localport=47500], locNodeId=faa30201-a16e-4035-bf56-b2901c6a1298, rmtNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7]
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0]
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:122) ~[?:1.8.0]
at java.net.SocketOutputStream.write(SocketOutputStream.java:145) ~[?:1.8.0]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]
第二次故障定位
发现服务端有超时15秒的问题后,决定通过arthas继续跟踪生产哪个方法导致耗时这么久。
arthas定位
通过trace查看耗时方法
根据服务端报错信息发现org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body这个方法有问题
执行trace命令跟踪: trace org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader body ‘#cost>10000’
代码语言:javascript复制`---ts=2021-11-25 10:55:55;thread_name=tcp-disco-sock-reader-[]-#49-#468;id=1fa;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
`---[35092.814909ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:body()
---[0.029574ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:getConfiguredNodeId() #6750
---[0.013376ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #6756
---[0.417309ms] org.apache.ignite.IgniteLogger:info() #6757
---[0.081638ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:configureSocketOptions() #6767
---[2.029919ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:readMessage() #6821
---[0.024223ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6863
---[0.006424ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:creatorNodeId() #6865
---[0.047115ms] org.apache.ignite.internal.util.typedef.internal.U:id8() #6869
---[0.005641ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6870
---[0.200221ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #6869
---[0.008674ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode:internalOrder() #6873
---[0.012978ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:<init>() #6873
---[0.075802ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:allNodesSupport() #6875
---[0.060064ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:setDiscoveryDataPacketCompression() #6875
---[0.00566ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6877
---[0.013879ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:clientAck() #6878
---[0.005648ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6941
---[0.026675ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6943
---[0.049792ms] org.apache.ignite.IgniteLogger:debug() #6942
---[0.004571ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6946
---[0.48207ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:writeToSocket() #6946
---[0.007472ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6959
---[0.030182ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:<init>() #6960
---[0.007268ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #6963
---[0.002032ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6992
---[0.023339ms] org.apache.ignite.IgniteLogger:debug() #6993
---[0.002474ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7001
---[0.010884ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #7003
---[0.021883ms] org.apache.ignite.IgniteLogger:info() #7002
---[0.005246ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #7060
---[0.007627ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7062
---[0.010837ms] org.apache.ignite.internal.processors.security.SecurityUtils:serializeVersion() #7064
---[0.006864ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:marshaller() #7066
---[0.001182ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:ignite() #7067
---[0.005046ms] org.apache.ignite.Ignite:configuration() #7067
---[0.004567ms] org.apache.ignite.internal.util.typedef.internal.U:resolveClassLoader() #7067
---[35086.430093ms] org.apache.ignite.internal.util.typedef.internal.U:unmarshal() #7066
---[0.021627ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryAbstractMessage:senderNodeId() #7069
---[0.025284ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:messageLogger() #7071
---[0.026901ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:isDebugEnabled() #7073
---[0.073315ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:debug() #7074
---[0.042802ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics:onMessageReceived() #7076
---[0.010485ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryJoinRequestMessage:responded() #7091
---[0.272454ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:processJoinRequestMessage() #7092 [throws Exception]
---[0.002176ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7331
---[0.61067ms] org.apache.ignite.internal.util.typedef.internal.U:error() #7332
---[0.002346ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7335
---[0.008792ms] org.apache.ignite.internal.util.typedef.X:hasCause() #7339
---[0.017524ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$9200() #7340
---[0.025243ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:onException() #7346
---[0.013936ms] org.apache.ignite.internal.processors.security.SecurityUtils:restoreDefaultSerializeVersion() #7352
---[0.001579ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7358
---[0.013629ms] org.apache.ignite.IgniteLogger:debug() #7359
---[0.004428ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #7362
---[0.002056ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:runner() #7364
---[0.004352ms] org.apache.ignite.internal.util.typedef.internal.U:interrupt() #7364
---[0.025334ms] org.apache.ignite.internal.util.typedef.internal.U:closeQuiet() #7367
---[0.00144ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7369
---[0.012279ms] org.apache.ignite.IgniteLogger:info() #7370
---[0.039086ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:isLocalNodeCoordinator() #7373
---[0.0024ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$1500() #7373
---[0.002276ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing:hasRemoteServerNodes() #7373
---[0.005297ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$10200() #7374
`---[0.006973ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #7374
发现unmarshal() 这个方法最耗时,继续trace跟踪: trace --skipJDKMethod false org.apache.ignite.marshaller.jdk.JdkMarshaller unmarshal0 ‘#cost>10000’ 备注: --skipJDKMethod false 开启这个可以跟踪jdk方法,以及开启options unsafe true(系统级别的类(即java.*)默认不能进行增强,需要增强是请参考这里的unsafe开关,增强系统类时请谨慎操作)
代码语言:javascript复制`---ts=2021-11-25 10:31:36;thread_name=tcp-disco-sock-reader-[5f6cbd0a 192.168.139.129:35461 client]-#45-#298;id=150;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
`---[35081.943106ms] org.apache.ignite.marshaller.jdk.JdkMarshaller:unmarshal0()
---[0.001963ms] org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper:<init>() #122
---[50.965952ms] org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream:<init>() #122
---[35030.958673ms] java.io.ObjectInputStream:readObject() #124
`---[0.002752ms] java.io.ObjectInputStream:close() #125
这里通过trace和结合查看源码,大体明白服务节点卡在反序列化代码
通过thread查看线程
本次也可以通过日志"ERROR tcp-disco-sock-reader-"得知发生报错的线程,从而通过arthas的thread查看tcp-disco-sock-reader线程卡在什么位置: 执行命令:thread -all | grep tcp-disco-sock-reader- 得到结果:
代码语言:javascript复制2362 tcp-disco-sock-reader-[0e1a224e 192.168.1 main 10 RUNNABLE 0.0 0.000 0:0.017 false false
根据线程ID继续查此线程卡在什么位置: 执行命令:thread 2362 得到结果:
代码语言:javascript复制"tcp-disco-sock-reader-[0e1a224e 192.168.139.129:39423 client]-#180-#2312" Id=2362 RUNNABLE (in native)
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
现在查到这里,已经确认hostname导致本次问题
jstack
jstack也可以导出卡住线程的堆栈信息,命令如下: jstack -l PID >> output.log ,PID可以通过JPS得到
得到结果后,过滤出tcp-disco-sock-reader线程信息
代码语言:javascript复制"tcp-disco-sock-reader-[1d3a8137 192.168.139.129:52379 client]-#9-#70" #95 prio=10 os_prio=0 tid=0x0000000000a20000 nid=0x251b runnable [0x00007f71d0c65000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
Locked ownable synchronizers:
- None
kill -3 pid
由于power架构下 arthas或者jstack 不能使用,所以也可以通过kill -3 加上pid 导出卡住线程堆栈 命令如下: kill -3 pid ,pid通过jps可以查到 结果会输出到控制台,过滤出tcp-disco-sock-reader线程信息
代码语言:javascript复制"tcp-disco-sock-reader-[4eeee5c0 192.168.139.129:36905 client]-#4-#45" #65 prio=10 os_prio=0 tid=0x00000000010d6000 nid=0x2444 runnable [0x00007f20b869e000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)
最终问题定位
由上面分析,已经分析到是因为hostname导致本地故障,对照生产配置发现: power服务节点在/etc/hosts 配置了power客户端的hostname,但是没有配置x86客户端节点的hostname, 而x86服务节点服务端都配置了x86和power的客户端的hostname,所以会导致x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而x86 或者power客户端连x86节点是没有问题的。
对于这个问题,有个前提条件:客户端节点在自己这台的/etc/hosts 配置上了的hostname,如果没配置,是不会引发问题。
至于和DNS的关系,原因如下: 导致问题的代码为InetSocketAddress(String hostname, int port),InetSocketAddress会自动去解析hostname,也就是DNS解析,获取到IP,然后连接;
之前生产环境为什么没发现问题,是因为dns没有开启,本次由于其他业务上线,开启dns
相关配置如下: 原来没开启DNS: cat /etc/nsswitch.conf hosts: files 本次上线,维护在files后加上dns就会开启DNS,导致问题产生 hosts: files dns
15秒配置如下: 位于/etc/resolv.conf options timeout:5 attempts:3 rotate 超时时间5秒,重试次数为3
总结:这次引起bug的原因有2个,一是power服务端节点遗漏配置x86客户端的hostname,二是维护开启DNS,导致DNS解析hostname找不到,卡住15秒
结合ignite源代码回顾
客户端收集本地信息
首先客户端在启动加入节点前,会收集本地信息,这时候就会带上hostname,源码跟踪如下:
在IgniteUtils类里,
代码语言:javascript复制 public static IgniteBiTuple<Collection<String>, Collection<String>> resolveLocalAddresses(InetAddress locAddr,
boolean allHostNames) throws IOException {
...
for (InetAddress addr : locAddrs)
addresses(addr, addrs, hostNames, allHostNames);
...
}
在addresses(addr, addrs, hostNames, allHostNames)方法里,获得hostname 即InetAddress getHostName方法
代码语言:javascript复制 private static void addresses(InetAddress addr, Collection<String> addrs, Collection<String> hostNames,
boolean allHostNames) {
...
String hostName = addr.getHostName();
String ipAddr = addr.getHostAddress();
...
如果客户端节点在自己这台的/etc/hosts 配置上了自己的hostname,这时代码就会把hostname存在加入节点里(locNode)。
最后,ClientImpl 把要加入的客户端节点(包含上面的hostname),包装为TcpDiscoveryJoinRequestMessage,传给服务端
代码语言:javascript复制 ...
TcpDiscoveryNode node = locNode;
...
TcpDiscoveryJoinRequestMessage joinReqMsg = new TcpDiscoveryJoinRequestMessage(node, discoveryData);
...
msg = joinReqMsg;
...
spi.writeToSocket(sock, msg, timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));
服务端反序列化解析hostname
客户端把TcpDiscoveryJoinRequestMessage发给服务端,服务端进行反序列化,代码为ServerImpl 7066行,
代码语言:javascript复制 TcpDiscoveryAbstractMessage msg = U.unmarshal(spi.marshaller(), in,
U.resolveClassLoader(spi.ignite().configuration()));
U.unmarshal底层调用ObjectInputStream的readObject方法,ObjectInputStream 里会调用readExternal接口,ignite里TcpDiscoveryJoinRequestMessage 重写了readExternal,在重写的readExternal里会初始化InetSocketAddress。
TcpDiscoveryJoinRequestMessage 主要包含TcpDiscoveryNode和DiscoveryDataPacket
TcpDiscoveryNode里包含writeExternal和readExternal
查看readExternal源码,包含初始化InetSocketAddress:
代码语言:javascript复制 @Override public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
...
sockAddrs = U.toSocketAddresses(this, discPort);
...
继续跟入U.toSocketAddresses源码,找到InetSocketAddress:
代码语言:javascript复制 public static Collection<InetSocketAddress> toSocketAddresses(Collection<String> addrs,
Collection<String> hostNames, int port) {
Set<InetSocketAddress> res = new HashSet<>(addrs.size());
Iterator<String> hostNamesIt = hostNames.iterator();
for (String addr : addrs) {
String hostName = hostNamesIt.hasNext() ? hostNamesIt.next() : null;
if (!F.isEmpty(hostName)) {
InetSocketAddress inetSockAddr = new InetSocketAddress(hostName, port);
...