声明:本文部分内容使用AI辅助生成,经人工编辑、审核和补充个人经验。
更新说明:本文最后更新于 2025-05-25。
Kubernetes生产环境部署踩坑记录
去年公司决定把核心业务从Docker Compose迁到Kubernetes,我负责整个迁移过程。从0开始搭集群到生产上线,踩了无数坑。记录一下整个过程,希望能帮到后来人。
集群搭建
选型:自建还是托管
一开始纠结用自建K8s还是云厂商托管。
| 方案 |
优点 |
缺点 |
适用场景 |
| 自建Kubeadm |
完全控制、成本低 |
维护复杂、需要专人 |
有运维团队 |
| 云托管EKS/ACK |
省心、集成好 |
贵、 vendor lock-in |
快速上线 |
| Rancher |
界面友好 |
额外复杂度 |
多集群管理 |
| K3s |
轻量、资源占用低 |
功能精简 |
边缘/IoT |
我们选了自建Kubeadm,原因是成本敏感,而且团队有人有K8s经验。
Kubeadm安装
三台CentOS 7.9,1 Master + 2 Worker。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
|
swapoff -a sed -i '/swap/d' /etc/fstab
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf overlay br_netfilter EOF
sudo modprobe overlay sudo modprobe br_netfilter
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf net.bridge.bridge-nf-call-iptables = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1 EOF
sudo sysctl --system
yum install -y containerd.io systemctl enable --now containerd
cat > /etc/containerd/config.toml <<EOF [plugins."io.containerd.grpc.v1.cri"] systemd_cgroup = true EOF systemctl restart containerd
|
坑1:containerd配置不对
默认配置没有开systemd_cgroup,导致Pod启动失败。
1
| Error: failed to create shim task: OCI runtime create failed
|
坑2:防火墙没关
K8s需要大量端口通信,防火墙规则没配好导致节点无法加入。
1 2 3 4 5 6 7 8 9 10 11
| systemctl stop firewalld systemctl disable firewalld
firewall-cmd --permanent --add-port=6443/tcp firewall-cmd --permanent --add-port=10250/tcp firewall-cmd --permanent --add-port=10251/tcp firewall-cmd --permanent --add-port=10252/tcp firewall-cmd --permanent --add-port=2379-2380/tcp firewall-cmd --reload
|
初始化Master
1 2 3 4 5 6 7
| kubeadm init \ --pod-network-cidr=10.244.0.0/16 \ --service-cidr=10.96.0.0/12 \ --apiserver-advertise-address=192.168.1.10
|
坑3:CIDR冲突
公司内网用了10.0.0.0/8,我一开始设了–pod-network-cidr=10.0.0.0/16,结果和办公网冲突,Pod访问不了外网。
解决:改成10.244.0.0/16,这是Flannel默认网段,不会冲突。
CNI网络插件
试了Flannel和Calico,最终选Calico。
1 2 3 4 5
| kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml
|
坑4:Calico网卡识别错误
Calico自动识别网卡,但在多网卡机器上经常选错。
1 2 3 4 5
| kubectl logs -n kube-system -l k8s-app=calico-node
|
解决:手动指定网卡
1 2
| kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth0
|
Pod调度
资源限制
一开始没设资源限制,一个Pod把节点CPU吃满,其他Pod全部卡死。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| apiVersion: apps/v1 kind: Deployment metadata: name: api-server spec: replicas: 3 template: spec: containers: - name: api image: myapp:latest resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m"
|
坑5:只设limits不设requests
1 2 3 4 5
| resources: limits: memory: "512Mi" cpu: "500m"
|
这样requests默认等于limits,调度器认为这个Pod需要500m CPU,但可能实际只用100m。导致节点资源利用率低。
正确做法:requests和limits分开设
1 2 3 4 5 6 7
| resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "512Mi" cpu: "1000m"
|
亲和性与反亲和性
数据库Pod需要分散到不同节点,防止单节点故障。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres spec: template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - postgres topologyKey: kubernetes.io/hostname
|
坑6:反亲和性导致调度失败
3个Postgres副本,反亲和性要求分散到不同节点,但只有2个Worker节点。
1 2
| Warning FailedScheduling 30s default-scheduler 0/3 nodes are available: 2 node(s) didn't match pod anti-affinity rules.
|
解决:改成preferred(优先满足,不满足也能调度)
1 2 3 4 5 6 7 8 9 10 11
| podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - postgres topologyKey: kubernetes.io/hostname
|
污点与容忍
Master节点默认不能调度Pod,需要去掉污点或者专门给Master打污点。
1 2 3 4 5 6 7 8 9 10
| kubectl describe node master | grep Taints
kubectl taint nodes master node-role.kubernetes.io/control-plane:NoSchedule-
kubectl taint nodes gpu-node nvidia.com/gpu=true:NoSchedule
|
1 2 3 4 5 6 7
| spec: tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "true" effect: "NoSchedule"
|
服务发现
Service类型选择
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| apiVersion: v1 kind: Service metadata: name: api-service spec: type: ClusterIP selector: app: api ports: - port: 80 targetPort: 8080
---
apiVersion: v1 kind: Service metadata: name: api-nodeport spec: type: NodePort selector: app: api ports: - port: 80 targetPort: 8080 nodePort: 30080
---
apiVersion: v1 kind: Service metadata: name: api-lb spec: type: LoadBalancer selector: app: api ports: - port: 80 targetPort: 8080
|
坑7:NodePort端口范围
默认NodePort范围是30000-32767,想改需要修改apiserver配置。
1 2 3 4 5 6 7
| vi /etc/kubernetes/manifests/kube-apiserver.yaml
kubectl delete pod -n kube-system -l component=kube-apiserver
|
Ingress配置
生产环境用Ingress做七层路由。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / nginx.ingress.kubernetes.io/ssl-redirect: "true" cert-manager.io/cluster-issuer: "letsencrypt-prod" spec: ingressClassName: nginx tls: - hosts: - api.example.com secretName: api-tls rules: - host: api.example.com http: paths: - path: / pathType: Prefix backend: service: name: api-service port: number: 80
|
坑8:Ingress Controller没装
创建了Ingress资源,但外部访问不了。查了半天发现没装Ingress Controller。
1 2 3 4 5
| kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.0/deploy/static/provider/baremetal/deploy.yaml
kubectl get pods -n ingress-nginx
|
坑9:裸机环境Ingress没有External IP
云厂商会自动分配LoadBalancer IP,裸机没有。
1 2 3
| kubectl get svc -n ingress-nginx
|
解决:用MetalLB做裸机负载均衡
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| apiVersion: metallb.io/v1beta1 kind: IPAddressPool metadata: name: default namespace: metallb-system spec: addresses: - 192.168.1.100-192.168.1.110 --- apiVersion: metallb.io/v1beta1 kind: L2Advertisement metadata: name: default namespace: metallb-system
|
存储
PV和PVC
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: local-ssd provisioner: kubernetes.io/no-provisioner volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1 kind: PersistentVolume metadata: name: pv-001 spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: local-ssd local: path: /data/pv-001 nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - worker-1
---
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: postgres-data spec: accessModes: - ReadWriteOnce storageClassName: local-ssd resources: requests: storage: 10Gi
|
坑10:local卷不能动态供给
local类型的PV需要提前在节点上创建目录,不能自动创建。
1 2 3
| mkdir -p /data/pv-001 chmod 777 /data/pv-001
|
坑11:Pod漂移后数据丢失
用local卷的Pod被调度到另一个节点,数据留在原节点。
解决:用nodeAffinity绑定节点,或者改用NFS/ceph等共享存储。
NFS共享存储
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: v1 kind: PersistentVolume metadata: name: nfs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany nfs: server: 192.168.1.5 path: /data/nfs
|
坑12:NFS权限问题
Pod写NFS报Permission Denied。
监控告警
Prometheus + Grafana
1 2 3 4 5 6 7 8
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=admin123
|
坑13:Prometheus数据量大,磁盘不够
默认存储10天,数据量大的时候磁盘很快满。
1 2 3 4 5 6 7 8 9 10 11 12
| prometheus: prometheusSpec: retention: "7d" retentionSize: "50GB" storageSpec: volumeClaimTemplate: spec: storageClassName: local-ssd resources: requests: storage: 100Gi
|
自定义告警规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: app-alerts namespace: monitoring spec: groups: - name: app.rules rules: - alert: HighErrorRate expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"
- alert: PodRestarting expr: | rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Pod is restarting frequently"
|
坑14:告警太多,变成狼来了
一开始设了很多告警,结果每天几百条,团队直接麻木。
解决:
- 告警分级:critical、warning、info
- 设置合理的for时间,避免瞬时波动触发
- 告警收敛,同类告警合并
故障排查
Pod启动失败
1 2 3 4 5 6 7 8 9 10
| kubectl get pods
kubectl logs api-5d4f8b7c9-x2k9m --previous
kubectl describe pod api-5d4f8b7c9-x2k9m
|
常见原因:
| 状态 |
原因 |
解决 |
| ImagePullBackOff |
镜像拉取失败 |
检查镜像名、仓库权限 |
| CrashLoopBackOff |
容器启动后崩溃 |
看日志排查应用问题 |
| Pending |
调度失败 |
资源不足、污点、亲和性 |
| OOMKilled |
内存超限 |
增大limits或优化内存 |
| Error |
启动错误 |
看日志 |
网络问题
1 2 3 4 5 6 7 8 9 10 11
| kubectl run debug --rm -it --image=busybox -- /bin/sh
wget -O- http://api-service.default.svc.cluster.local:80
kubectl get endpoints api-service
nslookup api-service.default.svc.cluster.local
|
坑15:CoreDNS崩溃导致服务发现失效
CoreDNS Pod内存不够,OOM后服务名解析不了。
1 2 3 4
| kubectl -n kube-system set resources deployment coredns \ --limits=memory=256Mi \ --requests=memory=128Mi
|
节点故障
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| kubectl get nodes
kubectl describe node worker-1
|
生产上线 checklist
上线前必须检查:
- 资源限制:所有Pod都有requests和limits
- 健康检查:配置了livenessProbe和readinessProbe
- 副本数:关键服务至少2个副本,分散到不同节点
- 持久化数据:数据库等用StatefulSet + PVC
- 备份策略:etcd定期备份
- 监控告警:Prometheus + Alertmanager运行正常
- 日志收集:EFK/Loki正常运行
- 安全:RBAC配置、NetworkPolicy、Secret加密
1 2 3 4 5 6 7
|
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
|
总结
K8s生产部署的核心经验:
- 网络是基础:CNI选型、CIDR规划、DNS配置,一步错步步错
- 资源要限制:requests和limits必须设,防止资源争抢
- 存储要规划:local、NFS、云盘各有利弊,根据场景选
- 监控要先行:上线前就把Prometheus和告警配好
- 故障要演练:定期模拟节点故障、Pod故障,验证高可用
踩坑最多的地方:
- CNI网络不通,Pod之间访问不了
- 资源没限制,一个Pod搞垮整个节点
- 存储没规划,Pod漂移数据丢失
- Ingress没装,外部访问不了
- CoreDNS内存不够,服务发现失效
K8s确实复杂,但用熟了之后,部署和运维效率比传统方式高很多。关键是多实践,多踩坑,坑踩多了就平了。