1. 发现问题
发现某个节点无法查询网络,继而发现所有的客户端都无法正确查询,而且错误相同
复制
代码语言:javascript复制cinder 服务无法访问
[root@controller01 ~]# cinder list
ERROR: Unable to establish connection to http://nt-controller:8776/v2/364307d25ca8465daa7982dafc625f05/volumes/detail: ('Connection aborted.', BadStatusLine("''",))
nova服务无法访问
[root@controller01 ~]# nova list
/usr/lib/python2.7/site-packages/novaclient/client.py:278: UserWarning: The 'tenant_id' argument is deprecated in Ocata and its use may result in errors in future releases. As 'project_id' is provided, the 'tenant_id' argument will be ignored.
warnings.warn(msg)
ERROR (ConnectFailure): Unable to establish connection to http://nt-controller:8774/v2.1/364307d25ca8465daa7982dafc625f05/servers/detail: ('Connection aborted.', BadStatusLine("''",))
2. 问题排查
手动 telnet 端口可以连接
复制
代码语言:javascript复制[root@controller01 ~]# telnet nt-controller 8774
Trying 192.168.105.253...
Connected to nt-controller.
Escape character is '^]'.
Connection closed by foreign host.
conductor 和 api 服务有无法连接数据库的错误
复制
代码语言:javascript复制2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db return self.dbapi.connect(*cargs, **cparams)
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 90, in Connect
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db return Connection(*args, **kwargs)
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 694, in __init__
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db self.connect()
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 947, in connect
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db raise exc
2018-05-30 02:16:08.609 29270 ERROR nova.servicegroup.drivers.db DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'nt-controller' ([Errno 111] ECONNREFUSED)")
查看数据库集群状态(集群正常)
复制
代码语言:javascript复制MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cluster_size';
-------------------- -------
| Variable_name | Value |
-------------------- -------
| wsrep_cluster_size | 3 |
-------------------- -------
3. 问题解决
全部服务都无法连接,而keystone服务又是正常(鉴权服务不在本地),数据库服务也正常,同时和这么多服务有关联的就是 haproxy 了,手动重启 haproxy 后问题解决(haproxy监听端口正常,可能发生了crash)。