kubernetes 更新后,kubernetes 上的 mongodb StatefulSet 不再工作

Posted

技术标签:

【中文标题】kubernetes 更新后,kubernetes 上的 mongodb StatefulSet 不再工作【英文标题】:mongodb StatefulSet on kubernetes is not working anymore after kubernetes update 【发布时间】:2019-05-19 00:14:26 【问题描述】:

我已将我的 AKS Azure Kubernetes 集群更新到版本 1.11.5,在此集群中运行 MongoDB Statefulset:

statefulset 是用这个文件创建的:

---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: default-view
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: view
subjects:
  - kind: ServiceAccount
    name: default
    namespace: default
---
apiVersion: v1
kind: Service
metadata:
  name: mongo
  labels:
    name: mongo
spec:
  ports:
  - port: 27017
    targetPort: 27017
  clusterIP: None
  selector:
    role: mongo
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: mongo
spec:
  serviceName: "mongo"
  replicas: 2
  template:
    metadata:
      labels:
        role: mongo
        environment: test
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: mongo
          image: mongo
          command:
            - mongod
            - "--replSet"
            - rs0
            - "--bind_ip"
            - 0.0.0.0            
            - "--smallfiles"
            - "--noprealloc"
          ports:
            - containerPort: 27017
          volumeMounts:
            - name: mongo-persistent-storage
              mountPath: /data/db
        - name: mongo-sidecar
          image: cvallance/mongo-k8s-sidecar
          env:
            - name: MONGO_SIDECAR_POD_LABELS
              value: "role=mongo,environment=test"
  volumeClaimTemplates:
  - metadata:
      name: mongo-persistent-storage
      annotations:
        volume.beta.kubernetes.io/storage-class: "managed-premium"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 32Gi

在提到将集群更新到新的 k8s 版本后,我收到此错误:

mongo-0                        1/2     CrashLoopBackOff   6          9m
mongo-1                        2/2     Running            0          1h

来自 pod 的详细日志如下:

2018-12-18T14:28:44.281+0000 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten]
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten]
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten]
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-12-18T14:28:44.281+0000 I CONTROL  [initandlisten]
2018-12-18T14:28:44.477+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
2018-12-18T14:28:44.478+0000 I REPL     [initandlisten] Rollback ID is 7
2018-12-18T14:28:44.479+0000 I REPL     [initandlisten] Recovering from stable timestamp: Timestamp(1545077719, 1) (top of oplog:  ts: Timestamp(1545077349, 1), t: 5 , appliedThrough:  ts: Timestamp(1545077719, 1), t: 6 , TruncateAfter: Timestamp(0, 0))
2018-12-18T14:28:44.480+0000 I REPL     [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1545077719, 1)
2018-12-18T14:28:44.480+0000 F REPL     [initandlisten] Applied op  : Timestamp(1545077719, 1)  not found. Top of oplog is  : Timestamp(1545077349, 1) .
2018-12-18T14:28:44.480+0000 F -        [initandlisten] Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 361
2018-12-18T14:28:44.480+0000 F -        [initandlisten]

***aborting after fassert() failure

这两个实例似乎不同步并且无法恢复。有人可以帮忙吗?

【问题讨论】:

与您的问题相关:jira.mongodb.org/browse/SERVER-37318 感谢您的链接! 【参考方案1】:

我有解决这个问题的方法:

    将 MongoDB 容器添加到集群以转储和恢复 MongoDB 数据 转储当前数据库 删除 MongoDB 实例 重新创建一个新的 MongoDB 实例 将数据恢复到新实例

是的,不幸的是,这会导致停机

【讨论】:

【参考方案2】:

从可靠和/或官方来源寻找答案。

一个官方来源是“Running MongoDB on Kubernetes with StatefulSets”(从 2017 年开始,因此可能需要一些适应/进化),但您似乎已经遵循它。

2 个月前在mongodb.org SERVER 37724 看到您的错误消息

在 4.0 中,我们确实对日志过程进行了更改,在该过程中它遵循 oplog 而不是数据文件本身。这有可能是这里正在发生的事情。

要对此进行测试,请尝试使用 MongoDB 3.6,看看问题是否仍然存在。

【讨论】:

我恢复到 3.6 版。如果这能解决问题,请继续关注

以上是关于kubernetes 更新后,kubernetes 上的 mongodb StatefulSet 不再工作的主要内容,如果未能解决你的问题,请参考以下文章

Kubernetes 滚动部署和数据库迁移

Kubernetes等待部署完成 kubectl wait rollout

Kubernetes 学习总结(24)—— Kubernetes 滚动更新蓝绿发布金丝雀发布等发布策略详解

Kubernetes 学习总结(24)—— Kubernetes 滚动更新蓝绿发布金丝雀发布等发布策略详解

Skaffold Kubernetes 不显示 React 更改

Kubernetes Deployment滚动更新场景分析