istio-proxy注入后HTTP类型探针的返回码变化

2023-06-19 16:47:29 浏览数 (1)

问题背景

某次现网故障后,业务Pod因为Liveness探针失败而被重启。失败信息如下:

代码语言:txt复制
Liveness probe failed: HTTP probe failed with statuscode: 500

问题过程没有什么疑问,不过,业务侧有一个疑问:HTTP 返回码为什么是 500 ?因为,Liveness 探针配置的 URI ,实际上只固定输出一个 "success" 的 string ,不可能出现 HTTP 500 的返回码。

问题分析

第一反应,不应该啊

一个单纯输出固定文字的 HTTP 服务,确实很难想象怎么出现 HTTP 500 返回码。毕竟,HTTP 500,是表明服务端内部错误。那么,是被 k8s 改写了吗?

直觉上,k8s 不会重写 HTTP 探针的返回码。当然,不能只凭直觉,查下代码。相应逻辑在 pkg/probe/http/http.go 的 DoHTTPProbe 里:

代码语言:go复制
     89 // DoHTTPProbe checks if a GET request to the url succeeds.
     90 // If the HTTP response code is successful (i.e. 400 > code >= 200), it returns Success.
     91 // If the HTTP response code is unsuccessful or HTTP communication fails, it returns Failure.
     92 // This is exported because some other packages may want to do direct HTTP probes.
     93 func DoHTTPProbe(url *url.URL, headers http.Header, client GetHTTPInterface) (probe.Result, string, error) {
     94         req, err := http.NewRequest("GET", url.String(), nil)
     95         if err != nil {
     96                 // Convert errors into failures to catch timeouts.
     97                 return probe.Failure, err.Error(), nil
     98         }
                  ...
    131         if res.StatusCode >= http.StatusOK && res.StatusCode < http.StatusBadRequest {
    132                 if res.StatusCode >= http.StatusMultipleChoices { // Redirect
    133                         klog.V(4).Infof("Probe terminated redirects for %s, Response: %v", url.String(), *res)
    134                         return probe.Warning, fmt.Sprintf("Probe terminated redirects, Response body: %v", body), nil
    135                 }
    136                 klog.V(4).Infof("Probe succeeded for %s, Response: %v", url.String(), *res)
    137                 return probe.Success, body, nil
    138         }
    139         klog.V(4).Infof("Probe failed for %s with request headers %v, response body: %v", url.String(), headers, body)
    140         return probe.Failure, fmt.Sprintf("HTTP probe failed with statuscode: %d", res.StatusCode), nil
    141 }

确实没有改写 HTTP 返回码的逻辑。

会不会业务 Pod 的探针配置其实变化过

有没有可能是业务 Pod 的探针配置其实变化过,Liveness探针失败的时候,Liveness 探针配置的 URI 是另一个更复杂的逻辑?

于是,对比了 Pod 对应的 ReplicaSet 的版本,Liveness 探针配置没有变化过。继续研究什么情况下,直接输出 "success" 的简单逻辑,也会返回 HTTP 500? 算了吧,这条路大概率不通。

查看 ReplicaSet 的时候,注意到一个现象:运行中的 Pod 的 Liveness 探针配置,和 ReplicaSet 的配置,是有区别的。ReplicaSet 中,Liveness 探针配置的 path 是 /ok.jsp,但 Pod 里 path 变成了 /app-health/jetty/livez,port 也变成了 15020。对于 1502X 端口,用过 istio 的同学应该有印象,所以,这里可能和 istio 有关。

查看 Pod 的 yaml 文件,确实有 istio-proxy 的注入。

istio-proxy 注入后,探针会有什么变化

istio 官方文档 Health Checking of Istio Services 写得很清楚,对于 HTTP 类型的探针,默认是会被改写的。改写原因也说了。

代码语言:txt复制
The command approach works with no changes required, but HTTP requests, TCP probes, and gRPC probes require Istio to make changes to the pod configuration.

The health check requests to the liveness-http service are sent by Kubelet. This becomes a problem when mutual TLS is enabled, because the Kubelet does not have an Istio issued certificate. Therefore the health check requests will fail.

TCP probe checks need special handling, because Istio redirects all incoming traffic into the sidecar, and so all TCP ports appear open. The Kubelet simply checks if some process is listening on the specified port, and so the probe will always succeed as long as the sidecar is running.

Istio solves both these problems by rewriting the application PodSpec readiness/liveness probe, so that the probe request is sent to the sidecar agent. For HTTP and gRPC requests, the sidecar agent redirects the request to the application and strips the response body, only returning the response code. For TCP probes, the sidecar agent will then do the port check while avoiding the traffic redirection.

具体来说,path 会改成 /app-health/<container name>/livez,直接代码在 pilot/cmd/pilot-agent/status/server.go 的 FormatProberURL :

代码语言:go复制
    289 // FormatProberURL returns a set of HTTP URLs that pilot agent will serve to take over Kubernetes
    290 // app probers.
    291 func FormatProberURL(container string) (string, string, string) {
    292         return fmt.Sprintf("/app-health/%v/readyz", container),
    293                 fmt.Sprintf("/app-health/%v/livez", container),
    294                 fmt.Sprintf("/app-health/%v/startupz", container)
    295 }

port 统一修改为一个 StatusPort,默认是 15020,直接代码在 pkg/kube/inject/webhook.go 的 applyRewrite 中:

代码语言:go复制
    590 func applyRewrite(pod *corev1.Pod, req InjectionParameters) error {
    591         valuesStruct := &opconfig.Values{}
    592         if err := gogoprotomarshal.ApplyYAML(req.valuesConfig, valuesStruct); err != nil {
    593                 log.Infof("Failed to parse values config: %v [%v]n", err, req.valuesConfig)
    594                 return fmt.Errorf("could not parse configuration values: %v", err)
    595         }
    596 
    597         rewrite := ShouldRewriteAppHTTPProbers(pod.Annotations, valuesStruct.GetSidecarInjectorWebhook().GetRewriteAppHTTPProbe())
    598         sidecar := FindSidecar(pod.Spec.Containers)
    599 
    600         // We don't have to escape json encoding here when using golang libraries.
    601         if rewrite && sidecar != nil {
    602                 if prober := DumpAppProbers(&pod.Spec, req.meshConfig.GetDefaultConfig().GetStatusPort()); prober != "" {
    603                         sidecar.Env = append(sidecar.Env, corev1.EnvVar{Name: status.KubeAppProberEnvName, Value: prober})
    604                 }
    605                 patchRewriteProbe(pod.Annotations, pod, req.meshConfig.GetDefaultConfig().GetStatusPort())
    606         }
    607         return nil
    608 }
istio 会不会改写 HTTP 返回码

虽然我们已经认定了大概率是 istio 改写的 HTTP 返回码,但总得有证据。所以继续查代码。

HTTP 探针被改写后,服务端口其实是 pilot-agent 来监听的,所以要查 pilot-agent 的代码。思路清晰了,问题也不复杂,代码在 pilot/cmd/pilot-agent/status/server.go 的 handleAppProbeHTTPGet 部分:

代码语言:go复制
    622 func (s *Server) handleAppProbeHTTPGet(w http.ResponseWriter, req *http.Request, prober *Prober, path string) {
    623         proberPath := prober.HTTPGet.Path
    624         if !strings.HasPrefix(proberPath, "/") {
    625                 proberPath = "/"   proberPath
    626         }
                   ...
    663         // get the http client must exist because
    664         httpClient := s.appProbeClient[path]
    665 
    666         // Send the request.
    667         response, err := httpClient.Do(appReq)
    668         if err != nil {
    669                 log.Errorf("Request to probe app failed: %v, original URL path = %vnapp URL path = %v", err, path, proberPath)
    670                 w.WriteHeader(http.StatusInternalServerError)
    671                 return
    672         }
	    673         defer func() {
    674                 // Drain and close the body to let the Transport reuse the connection
    675                 _, _ = io.Copy(io.Discard, response.Body)
    676                 _ = response.Body.Close()
    677         }()
    678 
    679         if isRedirect(response.StatusCode) { // Redirect
    680                 // In other cases, we return the original status code. For redirects, it is illegal to
    681                 // not have Location header, so we need to switch to just 200.
    682                 w.WriteHeader(http.StatusOK)
    683                 return
    684         }
    685         // We only write the status code to the response.
    686         w.WriteHeader(response.StatusCode)
    687 }

代码中值得注意的是:

  1. Liveness 探针不是每次都新建连接,而是建连后放到一个连接池,然后取已有的连接来发送
  2. 发送失败的话,直接返回 http.StatusInternalServerError,即 HTTP 500 返回码

自此,问题清晰了。

问题结论

事后,结合业务日志,实际上当时业务被大流量打爆了,业务进程发生了 OOM 。所以 pilot-agent 和 业务进程之间的连接断开了。实际的报错,是 ”read: connection reset by peer“。istio-proxy 的报错日志是:

代码语言:txt复制
Request to probe app failed: Get "http://xxx/ok.jsp": read tcp 127.0.0.6:58949->xxx: read: connection reset by peer, original URL path = /app-health/jetty/livez
app URL path = /ok.jsp

0 人点赞