带有 istio 的 GKE 上的 websockets 给出了“没有健康的上游”和“CrashLoopBackOff”
Posted
技术标签:
【中文标题】带有 istio 的 GKE 上的 websockets 给出了“没有健康的上游”和“CrashLoopBackOff”【英文标题】:websockets on GKE with istio gives 'no healthy upstream' and 'CrashLoopBackOff' 【发布时间】:2019-05-27 23:40:49 【问题描述】:我在 GKE 上使用 Istio 版本 1.0.3 。我尝试让我的 express.js 与 socket.io(和 uws 引擎)后端与 websockets 一起使用,并且之前在带有 websockets 的“非 kubernetes 服务器”上运行这个后端没有问题。
当我简单地输入 external_gke_ip 作为 url 时,我得到了我的后端 html 页面,所以 http 可以工作。但是当我的客户端应用程序从我的客户端应用程序进行 socketio 身份验证调用时,我在浏览器控制台中收到 503 错误:
WebSocket connection to 'ws://external_gke_ip/socket.io/?EIO=3&transport=websocket' failed: Error during WebSocket handshake: Unexpected response code: 503
当我在进行套接字调用时输入 external_gke_ip 作为 url 时,我在浏览器中得到:no healthy upstream
。豆荚给出:CrashLoopBackOff
。
我在某个地方找到:'在 node.js 领域,socket.io 通常会在最终升级到 Websockets 之前与服务器进行一些非 Websocket 握手。如果你没有粘性会话,升级永远不会奏效。所以也许我需要粘性会话?或者不是......因为我只有一个我的应用程序的副本?似乎是通过设置sessionAffinity: ClientIP
来完成的,但是使用 istio 我不知道如何执行此操作,并且在 GUI 中我可以编辑负载均衡器的一些值,但会话关联显示为“无”,我无法编辑它。
might be relevant 和我不确定的其他设置(如何使用 istio 进行设置)是:
externalTrafficPolicy=本地 ttl我的清单配置文件:
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
app: myapp
ports:
- port: 8089
targetPort: 8089
protocol: TCP
name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: gcr.io/myproject/firstapp:v1
imagePullPolicy: Always
ports:
- containerPort: 8089
env:
- name: POSTGRES_DB_HOST
value: 127.0.0.1:5432
- name: POSTGRES_DB_USER
valueFrom:
secretKeyRef:
name: mysecret
key: username
- name: POSTGRES_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mysecret
key: password
readinessProbe:
httpGet:
path: /healthz
scheme: HTTP
port: 8089
initialDelaySeconds: 10
timeoutSeconds: 5
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=myproject:europe-west4:osm=tcp:5432",
"-credential_file=/secrets/cloudsql/credentials.json"]
securityContext:
runAsUser: 2
allowPrivilegeEscalation: false
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: myapp-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- "*"
gateways:
- myapp-gateway
http:
- match:
- uri:
prefix: /
route:
- destination:
host: myapp
weight: 100
websocketUpgrade: true
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: google-apis
spec:
hosts:
- "*.googleapis.com"
ports:
- number: 443
name: https
protocol: HTTPS
location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: cloud-sql-instance
spec:
hosts:
- 35.204.XXX.XX # ip of cloudsql database
ports:
- name: tcp
number: 3307
protocol: TCP
location: MESH_EXTERNAL
各种输出(在进行套接字调用时,当我停止这些时,部署重新启动并且 READY 返回到 3/3):
kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-8888 2/3 CrashLoopBackOff 11 1h
$ kubectl describe pod/myapp-8888
给出:
Name: myapp-8888
Namespace: default
Node: gke-standard-cluster-1-default-pool-888888-9vtk/10.164.0.36
Start Time: Sat, 19 Jan 2019 14:33:11 +0100
Labels: app=myapp
pod-template-hash=207157
Annotations:
kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu request for container app; cpu request for container cloudsql-proxy
sidecar.istio.io/status:
"version":"3c9617ff82c9962a58890e4fa987c69ca62487fda71c23f3a2aad1d7bb46c748","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status: Running
IP: 10.44.0.5
Controlled By: ReplicaSet/myapp-64c59c94dc
Init Containers:
istio-init:
Container ID: docker://a417695f99509707d0f4bfa45d7d491501228031996b603c22aaf398551d1e45
Image: gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0
Image ID: docker-pullable://gcr.io/gke-release/istio/proxy_init@sha256:e30d47d2f269347a973523d0c5d7540dbf7f87d24aca2737ebc09dbe5be53134
Port: <none>
Host Port: <none>
Args:
-p
15001
-u
1337
-m
REDIRECT
-i
*
-x
-b
8089,
-d
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:33:19 +0100
Finished: Sat, 19 Jan 2019 14:33:19 +0100
Ready: True
Restart Count: 0
Environment: <none>
Mounts: <none>
Containers:
app:
Container ID: docker://888888888888888888888888
Image: gcr.io/myproject/firstapp:v1
Image ID: docker-pullable://gcr.io/myproject/firstapp@sha256:8888888888888888888888888
Port: 8089/TCP
Host Port: 0/TCP
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:40:14 +0100
Finished: Sat, 19 Jan 2019 14:40:37 +0100
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:39:28 +0100
Finished: Sat, 19 Jan 2019 14:39:46 +0100
Ready: False
Restart Count: 3
Requests:
cpu: 100m
Readiness: http-get http://:8089/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
Environment:
POSTGRES_DB_HOST: 127.0.0.1:5432
POSTGRES_DB_USER: <set to the key 'username' in secret 'mysecret'> Optional: false
POSTGRES_DB_PASSWORD: <set to the key 'password' in secret 'mysecret'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
cloudsql-proxy:
Container ID: docker://788888888888888888888888888
Image: gcr.io/cloudsql-docker/gce-proxy:1.11
Image ID: docker-pullable://gcr.io/cloudsql-docker/gce-proxy@sha256:5c690349ad8041e8b21eaa63cb078cf13188568e0bfac3b5a914da3483079e2b
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=myproject:europe-west4:osm=tcp:5432
-credential_file=/secrets/cloudsql/credentials.json
State: Running
Started: Sat, 19 Jan 2019 14:33:40 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/secrets/cloudsql from cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
istio-proxy:
Container ID: docker://f3873d0f69afde23e85d6d6f85b1f
Image: gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0
Image ID: docker-pullable://gcr.io/gke-release/istio/proxyv2@sha256:826ef4469e4f1d4cabd0dc846
Port: <none>
Host Port: <none>
Args:
proxy
sidecar
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
myapp
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:15007
--discoveryRefreshDelay
1s
--zipkinAddress
zipkin.istio-system:9411
--connectTimeout
10s
--statsdUdpAddress
istio-statsd-prom-bridge.istio-system:9125
--proxyAdminPort
15000
--controlPlaneAuthPolicy
NONE
State: Running
Started: Sat, 19 Jan 2019 14:33:54 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 10m
Environment:
POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
ISTIO_META_POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
Mounts:
/etc/certs/ from istio-certs (ro)
/etc/istio/proxy from istio-envoy (rw)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-instance-credentials
Optional: false
default-token-rclsf:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-rclsf
Optional: false
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
istio-certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.default
Optional: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m31s default-scheduler Successfully assigned myapp-64c59c94dc-tdb9c to gke-standard-cluster-1-default-pool-65b9e650-9vtk
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-envoy"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "cloudsql-instance-credentials"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "default-token-rclsf"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-certs"
Normal Pulling 7m30s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
Normal Pulled 7m25s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
Normal Created 7m24s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 7m23s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 7m4s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/cloudsql-docker/gce-proxy:1.11"
Normal Pulled 7m3s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.11"
Normal Started 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
Normal Created 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Pulled 6m54s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
Normal Created 6m51s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 6m48s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 111s (x2 over 7m22s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/myproject/firstapp:v3"
Normal Created 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulled 110s (x2 over 7m7s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/myproject/firstapp:v3"
Warning Unhealthy 99s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Readiness probe failed: HTTP probe failed with statuscode: 503
Warning BackOff 85s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Back-off restarting failed container
还有:
$ kubectl logs myapp-8888 myapp
> api_server@0.0.0 start /usr/src/app
> node src/
info: Feathers application started on http://localhost:8089
还有数据库日志(看起来没问题,因为可以使用 psql 检索应用程序中的一些“启动脚本条目”):
$ kubectl logs myapp-8888 cloudsql-proxy
2019/01/19 13:33:40 using credential file for authentication; email=proxy-user@myproject.iam.gserviceaccount.com
2019/01/19 13:33:40 Listening on 127.0.0.1:5432 for myproject:europe-west4:osm
2019/01/19 13:33:40 Ready for new connections
2019/01/19 13:33:54 New connection for "myproject:europe-west4:osm"
2019/01/19 13:33:55 couldn't connect to "myproject:europe-west4:osm": Post https://www.googleapis.com/sql/v1beta4/projects/myproject/instances/osm/createEphemeral?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp 74.125.143.95:443: getsockopt: connection refused
2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:06 Client closed local connection on 127.0.0.1:5432
2019/01/19 13:39:13 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
编辑: 这是对我的应用程序的 503 次 websocket 调用的服务器端日志:
insertId: "465nu9g3xcn5hf"
jsonPayload:
apiClaims: ""
apiKey: ""
clientTraceId: ""
connection_security_policy: "unknown"
destinationApp: "myapp"
destinationIp: "10.44.XX.XX"
destinationName: "myapp-888888-88888"
destinationNamespace: "default"
destinationOwner: "kubernetes://apis/extensions/v1beta1/namespaces/default/deployments/myapp"
destinationPrincipal: ""
destinationServiceHost: "myapp.default.svc.cluster.local"
destinationWorkload: "myapp"
httpAuthority: "35.204.XXX.XXX"
instance: "accesslog.logentry.istio-system"
latency: "1.508885ms"
level: "info"
method: "GET"
protocol: "http"
receivedBytes: 787
referer: ""
reporter: "source"
requestId: "bb31d922-8f5d-946b-95c9-83e4c022d955"
requestSize: 0
requestedServerName: ""
responseCode: 503
responseSize: 57
responseTimestamp: "2019-01-18T20:53:03.966513Z"
sentBytes: 164
sourceApp: "istio-ingressgateway"
sourceIp: "10.44.X.X"
sourceName: "istio-ingressgateway-8888888-88888"
sourceNamespace: "istio-system"
sourceOwner: "kubernetes://apis/extensions/v1beta1/namespaces/istio-system/deployments/istio-ingressgateway"
sourcePrincipal: ""
sourceWorkload: "istio-ingressgateway"
url: "/socket.io/?EIO=3&transport=websocket"
userAgent: "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
xForwardedFor: "10.44.X.X"
logName: "projects/myproject/logs/stdout"
metadata:
systemLabels:
container_image: "gcr.io/gke-release/istio/mixer:1.0.2-gke.0"
container_image_id: "docker-pullable://gcr.io/gke-release/istio/mixer@sha256:888888888888888888888888888888"
name: "mixer"
node_name: "gke-standard-cluster-1-default-pool-88888888888-8887"
provider_instance_id: "888888888888"
provider_resource_type: "gce_instance"
provider_zone: "europe-west4-a"
service_name: [
0: "istio-telemetry"
]
top_level_controller_name: "istio-telemetry"
top_level_controller_type: "Deployment"
userLabels:
app: "telemetry"
istio: "mixer"
istio-mixer-type: "telemetry"
pod-template-hash: "88888888888"
receiveTimestamp: "2019-01-18T20:53:08.135805255Z"
resource:
labels:
cluster_name: "standard-cluster-1"
container_name: "mixer"
location: "europe-west4-a"
namespace_name: "istio-system"
pod_name: "istio-telemetry-8888888-8888888"
project_id: "myproject"
type: "k8s_container"
severity: "INFO"
timestamp: "2019-01-18T20:53:03.965100Z"
在浏览器中,起初它似乎正确地切换了协议,但随后导致重复 503 响应,随后的健康问题导致重复重启。协议切换 websocket 调用:
一般:
Request URL: ws://localhost:8080/sockjs-node/842/s4888/websocket
Request Method: GET
Status Code: 101 Switching Protocols [GREEN]
响应标头:
Connection: Upgrade
Sec-WebSocket-Accept: NS8888888888888888888
Upgrade: websocket
请求标头:
Accept-Encoding: gzip, deflate, br
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Cookie: _ga=GA1.1.1118102238.18888888; hblid=nSNQ2mS8888888888888; olfsk=ol8888888888
Host: localhost:8080
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: b8zkVaXlEySHasCkD4aUiw==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
它的框架:
根据上述内容,我得到了多个:
关于 websocket 调用的 Chrome 输出:
一般:
Request URL: ws://35.204.210.134/socket.io/?EIO=3&transport=websocket
Request Method: GET
Status Code: 503 Service Unavailable
响应标题:
connection: close
content-length: 19
content-type: text/plain
date: Sat, 19 Jan 2019 14:06:39 GMT
server: envoy
请求头:
Accept-Encoding: gzip, deflate
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Host: 35.204.210.134
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: VtKS5xKF+GZ4u3uGih2fig==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
框架:
Data: (Opcode -1)
Length: 63
Time: 15:06:44.412
【问题讨论】:
我看到你的 pod 处于 CrashLoopBackOff 状态,因为容器 myapp 一直处于这种状态。请尝试get a shell 到该容器,troubleshoot 相应的 PostgreSql 数据库连接和挂载以检查它们是否正常工作。 @mehdisharifi 谢谢,我现在确实可以获得一个 shell,但遇到了 EADDRINUSE 问题,请参阅here。 在Stacklink you shared 中,我看到您通过解决健康服务问题解决了错误“错误:监听 EADDRINUSE”。但是,当您使用 Websockets 时,您的 myapp 容器仍然会崩溃。您仍然收到相同的错误消息吗? pod 和容器是否都处于相同的 CrashLoopBackOff 状态?请查看link。它提供了有关 WebSocket 握手问题的想法。 @mehdisharifi 在this issue 中描述了当前的状况。当我解决云 sql 问题时,我什至可能没有 websocket 问题,或者至少有更好的错误消息来继续解决它。 您能否分享更多有关您解决云 sql 问题时发生的情况的详细信息?故障排除需要更多数据和状态。 【参考方案1】:使用 uws (uWebSockets) 作为 websocket 引擎会导致这些错误。当我在我的后端应用程序中交换此代码时:
app.configure(socketio(
wsEngine: 'uws',
timeout: 120000,
reconnect: true
))
为此:
app.configure(socketio())
一切都按预期进行。
编辑:现在它也适用于 uws。我使用了基于节点 10 的 alpine docker 容器,它不适用于 uws。切换到基于节点 8 的容器后,它就可以工作了。
【讨论】:
以上是关于带有 istio 的 GKE 上的 websockets 给出了“没有健康的上游”和“CrashLoopBackOff”的主要内容,如果未能解决你的问题,请参考以下文章
如何在具有默认 istio beta 功能的 GKE 中安装带有 prometheus 的 Kiali Dashboard?
使用 GKE Istio Addon 时如何更改 istio-ingressgateway?
有没有办法访问 Istio 在 GKE 中创建的 promsd 服务?
周一见|Envoy 成为 CNCF 第三个毕业项目AWS 发布 Firecracker谷歌将 Istio 添加到 GKE