通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例

2021-12-08 17:14:20 浏览数 (1)

通过Wireshark和arthas排查由DNS引发的Ignite生产故障案例
  • 故障背景
  • 故障分析
  • 第一次定位问题
    • 跨架构假设
    • 生产重现故障
    • Wireshark抓包分析
    • 客户端10秒超时源码跟踪
    • 结合日志查看服务端卡点
  • 第二次故障定位
    • arthas定位
      • 通过trace查看耗时方法
      • 通过thread查看线程
    • jstack
    • kill -3 pid
  • 最终问题定位
  • 结合ignite源代码回顾
    • 客户端收集本地信息
    • 服务端反序列化解析hostname

故障背景

一次维护人员在上完线后,发现在分布式内存数据网格apache Ignite集群上通过客户端执行加载数据任务时,出现客户端节点连不上服务节点的问题。

生产环境情况: 分为A、B两中心 A中心 x86架构主机8台,为ignite服务节点 B中心 power架构主机8台,为ignite服务节点 Ignite版本为1.10,jdk为1.8

问题反馈: 当用x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而power 客户端连power服务端没有问题;x86 或者power客户端连x86节点也没有问题。

故障分析

由于开发人员log4j2日志配置不对,导致生产上没有看到错误日志

目前从维护人员得到的信息来看,猜测可能有2个原因导致故障: 1、跨架构导致故障,因为客户端x86架构,服务端power架构 2、网络问题导致故障,怀疑是跨中心,跨了网段导致故障

第一次定位问题

跨架构假设

针对第一点的假设,通过并行环境模拟,模拟不出故障,因此只能到生产环境重现故障,再做下一步定位。

生产重现故障

修改log4j2日志,得到报错信息(以下是后续本地环境模拟的,报错内容和生产一致): 服务端节点错误:

代码语言:javascript复制
[2021-11-25T18:13:28,887][ERROR][tcp-disco-sock-reader-[045905d0 192.168.139.129:44809 client]-#5-#58][TcpDiscoverySpi] Caught exception on message read [sock=Socket[addr=/192.168.139.129,port=44809,localport=47500], locNodeId=5be3f5a4-08a1-4f46-aa14-e009963dc83c, rmtNodeId=045905d0-bcaa-4631-bd70-7cc1922a8422]
java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:134) ~[?:1.8.0_211]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]

客户端报错日志:

代码语言:javascript复制
[2021-11-25T18:13:23,792][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Read timed out
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_211]
        at java.net.SocketInputStream.read(SocketInputStream.java:224) ~[?:1.8.0_211]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.readReceipt(TcpDiscoverySpi.java:1904) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:806) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
[18:13:23] Failed to connect to any address from IP finder (will retry to join topology every 2000 ms; change 'reconnectDelay' to configure the frequency of retries): [/192.168.139.128:47500]
[2021-11-25T18:13:27,042][ERROR][tcp-client-disco-msg-worker-#4-#42][TcpDiscoverySpi] Exception on joining: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
org.apache.ignite.IgniteCheckedException: Failed to serialize object: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=045905d0-bcaa-4631-bd70-7cc1922a8422, consistentId=045905d0-bcaa-4631-bd70-7cc1922a8422, addrs=ArrayList [0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.139.129], sockAddrs=HashSet [/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.139.129:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1637835172514, loc=true, ver=2.10.0#20211125-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@29f3a5ed, super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=8e765965d71-045905d0-bcaa-4631-bd70-7cc1922a8422, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
        at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:102) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]
Caused by: java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
        at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
        at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
        at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:99) ~[ignite-01-1.0-SNAPSHOT.jar:?]
        ... 12 more
        Suppressed: java.net.SocketException: Broken pipe (Write failed)
                at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0_211]
                at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) ~[?:1.8.0_211]
                at java.net.SocketOutputStream.write(SocketOutputStream.java:155) ~[?:1.8.0_211]
                at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_211]
                at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_211]
                at org.apache.ignite.marshaller.jdk.JdkMarshallerOutputStreamWrapper.flush(JdkMarshallerOutputStreamWrapper.java:58) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1823) ~[?:1.8.0_211]
                at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) ~[?:1.8.0_211]
                at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) ~[?:1.8.0_211]
                at org.apache.ignite.marshaller.jdk.JdkMarshaller.marshal0(JdkMarshaller.java:100) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.marshal(AbstractNodeNameAwareMarshaller.java:68) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.internal.util.IgniteUtils.marshal(IgniteUtils.java:10597) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1764) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1701) ~[ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:797) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:629) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl.access$1000(ClientImpl.java:150) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.tryJoin(ClientImpl.java:2108) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1751) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.discovery.tcp.ClientImpl$1.body(ClientImpl.java:317) [ignite-01-1.0-SNAPSHOT.jar:?]
                at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-01-1.0-SNAPSHOT.jar:?]

Wireshark抓包分析

通过tcpdump分别对客户端,服务端进行抓包,来分析网络情况。 命令如下: tcpdump -i ens192 host 10.1.12.XXX and port 47500 -w server.pcap

47500 为Ignite客户端与服务器建立连接的端口号 ens192 指定网卡

服务端抓包如下:

TCP层,有个FLAGS字段,这个字段有以下几个标识:SYN, FIN, ACK, PSH, RST, URG。 SYN表示建立连接 FIN表示关闭连接 ACK表示响应 PSH表示有 DATA数据传输 RST表示连接重置

以下重点分析几处: 第10行结合源码分析,客户端发TcpDiscoveryJoinRequestMessage包到服务端,seq=440,ack=434,Len=14480

服务端顺利接收到,为13行,seq=434,ack=14920,Len=0

备注: 在已经建立好连接的TCP上(只考虑数据包和ack包),seq和ack的计算规则为 本次要发送的包的 seq = 上一个发送的包的seq 上一个发送的包的长度(不含包头) 本次要发送的包的 ack = 上一个接收到的包的seq 上一个接收到的包的长度(不含包头)

上面图中画圈的部分发现,第19行和第20行相差了10秒,而且客户端(10.48.131.13)发起[RST,ACK]重连。

接着重试后往下看:

在客户端重试的TCP包中,37行发现发送len为1的应答包(RES_OK),但是重试前服务端没有发送这个应答包,结合处理TcpDiscoveryJoinRequest源码看,代码为:

代码语言:javascript复制
 private boolean processJoinRequestMessage(TcpDiscoveryJoinRequestMessage msg,
            @Nullable ClientMessageWorker clientMsgWrk) throws IOException {
....
 spi.writeToSocket(msg, sock, RES_OK, sockTimeout);
....
}

这里得到结论:服务端在第一次处理TcpDiscoveryJoinRequest包等待了很久,客户端socket到了超时时间10秒后,重连了,在重连后,后续都正常。

客户端10秒超时源码跟踪

ClientImpl 805行:

代码语言:javascript复制
 return new T3<>(new SocketStream(sock),
     spi.readReceipt(sock, timeoutHelper.nextTimeoutChunk(ackTimeout0)),
     res.clientAck());

timeoutHelper.nextTimeoutChunk继续往下跟IgniteSpiOperationTimeoutHelper :

代码语言:javascript复制
    public long nextTimeoutChunk(long dfltTimeout) throws IgniteSpiOperationTimeoutException {
        long now = System.nanoTime();
        long left;
        // 因为timeoutEnabled为true,走到这一行
        if (timeoutEnabled)
            left = timeoutThreshold - now;
            ...

timeoutThreshold的值源于ClientImpl 722:

代码语言:javascript复制
IgniteSpiOperationTimeoutHelper timeoutHelper = new IgniteSpiOperationTimeoutHelper(spi, true);  

再跟到IgniteSpiOperationTimeoutHelper 58行

代码语言:javascript复制
long timeout = (lastRelatedOperationTime > 0 ? lastRelatedOperationTime : System.nanoTime())  
    U.millisToNanos(srvOp ? adapter.failureDetectionTimeout() : adapter.clientFailureDetectionTimeout());

srvOp为true ,即System.nanoTime() failureDetectionTimeout,failureDetectionTimeout默认10秒

找到源码后,客户端修改超时时间setFailureDetectionTimeout示例如下:

代码语言:javascript复制
        IgniteConfiguration cfg = new IgniteConfiguration();
        cfg.setClientMode(true);
        cfg.setFailureDetectionTimeout(30000L);

结合日志查看服务端卡点

结合日志来看,11:05:50,288 服务端收到Initialized connection with remote client node ,11:06:05,325后,服务端打印日志Message has been received,这里得到结论,服务端卡住15秒

代码语言:javascript复制
2021-11-08 11:05:50,288 INFO  tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:478) - Initialized connection with remote client node [nodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, rmtAddr=/10.45.179.99:24449]
2021-11-08 11:06:05,325 DEBUG tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:465) - Message has been received: TcpDiscoveryJoinRequestMessage [node=TcpDiscoveryNode [id=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, consistentId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, addrs=ArrayList [10.45.179.99], sockAddrs=HashSet [/10.45.179.99:0], discPort=0, order=0, intOrder=0, lastExchangeTime=1636340750309, loc=false, ver=2.10.0#20211108-sha1:00000000, isClient=true], dataPacket=org.apache.ignite.spi.discovery.tcp.internal.DiscoveryDataPacket@637f47b5, super=TcpDiscoveryAbstractMessage [sndNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, id=3b2528dfc71-a0a3eaf9-b1c2-4a61-b581-7dfa545045f7, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=true]]
2021-11-08 11:06:05,326 ERROR tcp-disco-sock-reader-[a0a3eaf9 10.45.179.99:24449 client]-#7%npbossdev-gridB-test%-#799%npbossdev-gridB-test% (org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:498) - Caught exception on message read [sock=Socket[addr=/10.45.179.99,port=24449,localport=47500], locNodeId=faa30201-a16e-4035-bf56-b2901c6a1298, rmtNodeId=a0a3eaf9-b1c2-4a61-b581-7dfa545045f7]
java.net.SocketException: Broken pipe (Write failed)
	at java.net.SocketOutputStream.socketWrite0(Native Method) ~[?:1.8.0]
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:122) ~[?:1.8.0]
	at java.net.SocketOutputStream.write(SocketOutputStream.java:145) ~[?:1.8.0]
	at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.writeToSocket(TcpDiscoverySpi.java:1807) ~[ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.processJoinRequestMessage(ServerImpl.java:7581) ~[ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7092) [ignite-core-2.10.0.jar:2.10.0]
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58) [ignite-core-2.10.0.jar:2.10.0]

第二次故障定位

发现服务端有超时15秒的问题后,决定通过arthas继续跟踪生产哪个方法导致耗时这么久。

arthas定位

通过trace查看耗时方法

根据服务端报错信息发现org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body这个方法有问题

执行trace命令跟踪: trace org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader body ‘#cost>10000’

代码语言:javascript复制
`---ts=2021-11-25 10:55:55;thread_name=tcp-disco-sock-reader-[]-#49-#468;id=1fa;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
    `---[35092.814909ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:body()
         ---[0.029574ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:getConfiguredNodeId() #6750
         ---[0.013376ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #6756
         ---[0.417309ms] org.apache.ignite.IgniteLogger:info() #6757
         ---[0.081638ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:configureSocketOptions() #6767
         ---[2.029919ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:readMessage() #6821
         ---[0.024223ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6863
         ---[0.006424ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:creatorNodeId() #6865
         ---[0.047115ms] org.apache.ignite.internal.util.typedef.internal.U:id8() #6869
         ---[0.005641ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6870
         ---[0.200221ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #6869
         ---[0.008674ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode:internalOrder() #6873
         ---[0.012978ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:<init>() #6873
         ---[0.075802ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:allNodesSupport() #6875
         ---[0.060064ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:setDiscoveryDataPacketCompression() #6875
         ---[0.00566ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6877
         ---[0.013879ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeResponse:clientAck() #6878
         ---[0.005648ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6941
         ---[0.026675ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6943
         ---[0.049792ms] org.apache.ignite.IgniteLogger:debug() #6942
         ---[0.004571ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #6946
         ---[0.48207ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:writeToSocket() #6946
         ---[0.007472ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #6959
         ---[0.030182ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:<init>() #6960
         ---[0.007268ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #6963
         ---[0.002032ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #6992
         ---[0.023339ms] org.apache.ignite.IgniteLogger:debug() #6993
         ---[0.002474ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7001
         ---[0.010884ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryHandshakeRequest:client() #7003
         ---[0.021883ms] org.apache.ignite.IgniteLogger:info() #7002
         ---[0.005246ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:getEffectiveSocketTimeout() #7060
         ---[0.007627ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7062
         ---[0.010837ms] org.apache.ignite.internal.processors.security.SecurityUtils:serializeVersion() #7064
         ---[0.006864ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:marshaller() #7066
         ---[0.001182ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi:ignite() #7067
         ---[0.005046ms] org.apache.ignite.Ignite:configuration() #7067
         ---[0.004567ms] org.apache.ignite.internal.util.typedef.internal.U:resolveClassLoader() #7067
         ---[35086.430093ms] org.apache.ignite.internal.util.typedef.internal.U:unmarshal() #7066
         ---[0.021627ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryAbstractMessage:senderNodeId() #7069
         ---[0.025284ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:messageLogger() #7071
         ---[0.026901ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:isDebugEnabled() #7073
         ---[0.073315ms] org.apache.ignite.spi.discovery.tcp.TcpDiscoveryImpl$DebugLogger:debug() #7074
         ---[0.042802ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryStatistics:onMessageReceived() #7076
         ---[0.010485ms] org.apache.ignite.spi.discovery.tcp.messages.TcpDiscoveryJoinRequestMessage:responded() #7091
         ---[0.272454ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:processJoinRequestMessage() #7092 [throws Exception]
         ---[0.002176ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7331
         ---[0.61067ms] org.apache.ignite.internal.util.typedef.internal.U:error() #7332
         ---[0.002346ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader:isInterrupted() #7335
         ---[0.008792ms] org.apache.ignite.internal.util.typedef.X:hasCause() #7339
         ---[0.017524ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$9200() #7340
         ---[0.025243ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:onException() #7346
         ---[0.013936ms] org.apache.ignite.internal.processors.security.SecurityUtils:restoreDefaultSerializeVersion() #7352
         ---[0.001579ms] org.apache.ignite.IgniteLogger:isDebugEnabled() #7358
         ---[0.013629ms] org.apache.ignite.IgniteLogger:debug() #7359
         ---[0.004428ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$3900() #7362
         ---[0.002056ms] org.apache.ignite.spi.discovery.tcp.ServerImpl$ClientMessageWorker:runner() #7364
         ---[0.004352ms] org.apache.ignite.internal.util.typedef.internal.U:interrupt() #7364
         ---[0.025334ms] org.apache.ignite.internal.util.typedef.internal.U:closeQuiet() #7367
         ---[0.00144ms] org.apache.ignite.IgniteLogger:isInfoEnabled() #7369
         ---[0.012279ms] org.apache.ignite.IgniteLogger:info() #7370
         ---[0.039086ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:isLocalNodeCoordinator() #7373
         ---[0.0024ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$1500() #7373
         ---[0.002276ms] org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing:hasRemoteServerNodes() #7373
         ---[0.005297ms] org.apache.ignite.spi.discovery.tcp.ServerImpl:access$10200() #7374
        `---[0.006973ms] org.apache.ignite.internal.util.typedef.internal.U:enhanceThreadName() #7374

发现unmarshal() 这个方法最耗时,继续trace跟踪: trace --skipJDKMethod false org.apache.ignite.marshaller.jdk.JdkMarshaller unmarshal0 ‘#cost>10000’ 备注: --skipJDKMethod false 开启这个可以跟踪jdk方法,以及开启options unsafe true(系统级别的类(即java.*)默认不能进行增强,需要增强是请参考这里的unsafe开关,增强系统类时请谨慎操作)

代码语言:javascript复制
`---ts=2021-11-25 10:31:36;thread_name=tcp-disco-sock-reader-[5f6cbd0a 192.168.139.129:35461 client]-#45-#298;id=150;is_daemon=false;priority=10;TCCL=sun.misc.Launcher$AppClassLoader@18b4aac2
    `---[35081.943106ms] org.apache.ignite.marshaller.jdk.JdkMarshaller:unmarshal0()
         ---[0.001963ms] org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper:<init>() #122
         ---[50.965952ms] org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream:<init>() #122
         ---[35030.958673ms] java.io.ObjectInputStream:readObject() #124
        `---[0.002752ms] java.io.ObjectInputStream:close() #125

这里通过trace和结合查看源码,大体明白服务节点卡在反序列化代码

通过thread查看线程

本次也可以通过日志"ERROR tcp-disco-sock-reader-"得知发生报错的线程,从而通过arthas的thread查看tcp-disco-sock-reader线程卡在什么位置: 执行命令:thread -all | grep tcp-disco-sock-reader- 得到结果:

代码语言:javascript复制
2362   tcp-disco-sock-reader-[0e1a224e 192.168.1 main                 10            RUNNABLE     0.0           0.000         0:0.017       false         false   

根据线程ID继续查此线程卡在什么位置: 执行命令:thread 2362 得到结果:

代码语言:javascript复制
"tcp-disco-sock-reader-[0e1a224e 192.168.139.129:39423 client]-#180-#2312" Id=2362 RUNNABLE (in native)
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
    at java.net.InetAddress.getAllByName(InetAddress.java:1193)
    at java.net.InetAddress.getAllByName(InetAddress.java:1127)
    at java.net.InetAddress.getByName(InetAddress.java:1077)
    at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
    at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
    at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
    at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
    at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
    at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
    at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
    at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
    at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

现在查到这里,已经确认hostname导致本次问题

jstack

jstack也可以导出卡住线程的堆栈信息,命令如下: jstack -l PID >> output.log ,PID可以通过JPS得到

得到结果后,过滤出tcp-disco-sock-reader线程信息

代码语言:javascript复制
"tcp-disco-sock-reader-[1d3a8137 192.168.139.129:52379 client]-#9-#70" #95 prio=10 os_prio=0 tid=0x0000000000a20000 nid=0x251b runnable [0x00007f71d0c65000]
   java.lang.Thread.State: RUNNABLE
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
	at java.net.InetAddress.getByName(InetAddress.java:1077)
	at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
	at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
	at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

   Locked ownable synchronizers:
	- None

kill -3 pid

由于power架构下 arthas或者jstack 不能使用,所以也可以通过kill -3 加上pid 导出卡住线程堆栈 命令如下: kill -3 pid ,pid通过jps可以查到 结果会输出到控制台,过滤出tcp-disco-sock-reader线程信息

代码语言:javascript复制
"tcp-disco-sock-reader-[4eeee5c0 192.168.139.129:36905 client]-#4-#45" #65 prio=10 os_prio=0 tid=0x00000000010d6000 nid=0x2444 runnable [0x00007f20b869e000]
   java.lang.Thread.State: RUNNABLE
	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
	at java.net.InetAddress.getByName(InetAddress.java:1077)
	at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9433)
	at org.apache.ignite.internal.util.IgniteUtils.toSocketAddresses(IgniteUtils.java:9411)
	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNode.readExternal(TcpDiscoveryNode.java:614)
	at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:2118)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2067)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
	at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:124)
	at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:92)
	at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10408)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:7066)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:58)

最终问题定位

由上面分析,已经分析到是因为hostname导致本地故障,对照生产配置发现: power服务节点在/etc/hosts 配置了power客户端的hostname,但是没有配置x86客户端节点的hostname, 而x86服务节点服务端都配置了x86和power的客户端的hostname,所以会导致x86 客户端连power架构的服务节点时,出现客户端节点连不上问题。而x86 或者power客户端连x86节点是没有问题的。

对于这个问题,有个前提条件:客户端节点在自己这台的/etc/hosts 配置上了的hostname,如果没配置,是不会引发问题。

至于和DNS的关系,原因如下: 导致问题的代码为InetSocketAddress(String hostname, int port),InetSocketAddress会自动去解析hostname,也就是DNS解析,获取到IP,然后连接;

之前生产环境为什么没发现问题,是因为dns没有开启,本次由于其他业务上线,开启dns

相关配置如下: 原来没开启DNS: cat /etc/nsswitch.conf hosts: files 本次上线,维护在files后加上dns就会开启DNS,导致问题产生 hosts: files dns

15秒配置如下: 位于/etc/resolv.conf options timeout:5 attempts:3 rotate 超时时间5秒,重试次数为3

总结:这次引起bug的原因有2个,一是power服务端节点遗漏配置x86客户端的hostname,二是维护开启DNS,导致DNS解析hostname找不到,卡住15秒

结合ignite源代码回顾

客户端收集本地信息

首先客户端在启动加入节点前,会收集本地信息,这时候就会带上hostname,源码跟踪如下:

在IgniteUtils类里,

代码语言:javascript复制
    public static IgniteBiTuple<Collection<String>, Collection<String>> resolveLocalAddresses(InetAddress locAddr,
        boolean allHostNames) throws IOException {
...
 for (InetAddress addr : locAddrs)
                    addresses(addr, addrs, hostNames, allHostNames);
...
}

在addresses(addr, addrs, hostNames, allHostNames)方法里,获得hostname 即InetAddress getHostName方法

代码语言:javascript复制
    private static void addresses(InetAddress addr, Collection<String> addrs, Collection<String> hostNames,
        boolean allHostNames) {
...
        String hostName = addr.getHostName();
        String ipAddr = addr.getHostAddress();
...

如果客户端节点在自己这台的/etc/hosts 配置上了自己的hostname,这时代码就会把hostname存在加入节点里(locNode)。

最后,ClientImpl 把要加入的客户端节点(包含上面的hostname),包装为TcpDiscoveryJoinRequestMessage,传给服务端

代码语言:javascript复制
 ...
 TcpDiscoveryNode node = locNode;
 ...
 TcpDiscoveryJoinRequestMessage joinReqMsg = new TcpDiscoveryJoinRequestMessage(node, discoveryData);
 ...
 msg = joinReqMsg;
 ...
 spi.writeToSocket(sock, msg, timeoutHelper.nextTimeoutChunk(spi.getSocketTimeout()));

服务端反序列化解析hostname

客户端把TcpDiscoveryJoinRequestMessage发给服务端,服务端进行反序列化,代码为ServerImpl 7066行,

代码语言:javascript复制
  TcpDiscoveryAbstractMessage msg = U.unmarshal(spi.marshaller(), in,
      U.resolveClassLoader(spi.ignite().configuration()));

U.unmarshal底层调用ObjectInputStream的readObject方法,ObjectInputStream 里会调用readExternal接口,ignite里TcpDiscoveryJoinRequestMessage 重写了readExternal,在重写的readExternal里会初始化InetSocketAddress。

TcpDiscoveryJoinRequestMessage 主要包含TcpDiscoveryNode和DiscoveryDataPacket

TcpDiscoveryNode里包含writeExternal和readExternal

查看readExternal源码,包含初始化InetSocketAddress:

代码语言:javascript复制
    @Override public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    ...
    sockAddrs = U.toSocketAddresses(this, discPort);
    ...

继续跟入U.toSocketAddresses源码,找到InetSocketAddress:

代码语言:javascript复制
    public static Collection<InetSocketAddress> toSocketAddresses(Collection<String> addrs,
        Collection<String> hostNames, int port) {
        Set<InetSocketAddress> res = new HashSet<>(addrs.size());

        Iterator<String> hostNamesIt = hostNames.iterator();

        for (String addr : addrs) {
            String hostName = hostNamesIt.hasNext() ? hostNamesIt.next() : null;

            if (!F.isEmpty(hostName)) {
                InetSocketAddress inetSockAddr = new InetSocketAddress(hostName, port);
                ...

0 人点赞