DO280OpenShift命令及故障排查--常见故障排除和章节实验

🎹 个人简介：大家好，我是 金鱼哥，CSDN运维领域新星创作者，华为云·云享专家，阿里云社区·专家博主
📚个人资质：CCNA、HCNP、CSNA（网络分析师），软考初级、中级网络工程师、RHCSA、RHCE、RHCA、RHCI、ITIL😜
💬格言：努力不一定成功，但要想成功就必须努力🔥

🎈支持我：可点赞👍、可收藏⭐️、可留言📝

📜常见环境信息

使用RPM安装的OCP，那么master和node的ocp相关服务将作为Red Hat Enterprise Linux服务运行。从master和node使用标准的sosreport实用程序，收集关于环境的信息，以及docker和openshift相关的信息。

[root@master ~]# sosreport -k docker.all=on -k docker.logs=on
1

sosreport命令创建一个包含所有相关信息的压缩归档文件，并将其保存在/var/tmp目录中。

另一个有用的诊断工具是oc adm diagnostics命令，能够在OpenShift集群上运行多个诊断检查，包括network、日志、内部仓库、master节点和node节点的服务检查等等。oc adm diagnostics --help命令，获取帮助。

📜常见诊断命令

oc客户端命令是用来检测和排除OpenShift集群中的问题的主要工具。它有许多选项，能够检测、诊断和修复由集群管理的主机和节点、服务和资源的问题。若已授权所需的权限，可以直接编辑集群中大多数托管资源的配置。

📑oc get events

事件允许OpenShift记录集群中生命周期事件的信息，以统一的方式查看关于OpenShift组件的信息。oc get events命令提供OpenShift namespace的事件信息，可实现以下事件的捕获：

- Pod创建和删除
- pod调度的节点
- master和node节点的状态

事件通常用于故障排除，从而获得关于集群中的故障和问题的高级信息，然后使用日志文件和其他oc子命令进一步定位。

示例：使用以下命令获得特定项目中的事件列表。

[student@workstation ~]$ oc get events -n <project>
1

也可以通过Web控制台进行事件的查看events。

📑oc log

oc logs命令查看build、deployment或pod的日志输出，。

示例1：使用oc命令查看pod的日志。

[student@workstation ~]$ oc logs pod
1

示例2：使用oc命令查看build的日志。

[student@workstation ~]$ oc logs bc/build-name
1

使用oc logs命令和-f选项实时跟踪日志输出。例如，这对于连续监视build的进度和检查错误非常有用。

也可以通过Web控制台进行事件的查看log。

📑oc rsync

oc rsync命令将内容复制到正在运行的pod中的目录或从目录复制内容。如果一个pod有多个容器，可以使用-c选项指定容器ID。否则，它默认为pod中的第一个容器。通常用于从容器传输日志文件和配置文件。

示例1：将pod目录中的内容复制到本地目录。

[student@workstation ~]$ oc rsync <pod>:<pod_dir> <local_dir> -c <container>
1

示例2：将内容从本地目录复制到pod的目录中。

[student@workstation ~]$ oc rsync <local_dir> <pod>:<pod_dir> -c <container>
1

📑oc port-forward

使用oc port-forward命令将一个或多个本地端口转发到pod。这允许在本地监听特定或随机端口，并将数据转发到pod中的特定端口。

示例1：本地监听3306并转发到pod的3306.

[student@workstation ~]$ oc port-forward <pod> 3306:3306
1

📜常见故障

📑资源限制和配额问题

对于设置了资源限制和配额的项目，不适当的资源配置将导致部署失败。使用oc get events和oc describe命令来排查失败的原因。

例如试图创建超过项目中pod数量配额限制的pod数量，那么在运行oc get events命令时会提示：

Warning FailedCreate {hello-1-deploy} Error creating: pods "hello-1" is forbidden:
exceeded quota: project-quota, requested: cpu=250m, used: cpu=750m, limited: cpu=900m
1
2

📑S2I build失败

使用oc logs命令查看S2I构建失败。例如，要查看名为hello的构建配置的日志:

[student@workstation ~]$ oc logs bc/hello
1

例如可以通过在build configuration策略中指定BUILD_LOGLEVEL环境变量来调整build日志的详细程度。

{
"sourceStrategy": {
...
"env": [
{
"name": "BUILD_LOGLEVEL",
"value": "5"
}
]
}
}
1
2
3
4
5
6
7
8
9
10
11

📑ErrImagePull和imgpullback错误

通常是由不正确的deployment configuration造成、部署期间引用的错误或缺少image或Docker配置不当造成。

使用oc get events和oc describe命令排查，通过使用**oc edit dc/**编辑deployment configuration来修复错误。

📑docker配置异常

master和node上不正确的docker配置可能会在部署期间导致许多错误。

通常检查ADD_REGISTRY、INSECURE_REGISTRY和BLOCK_REGISTRY设置。使用systemctl status, oc logs, oc get events和oc describe命令对问题进行排查。

可以通添加**/etc/sysconfig/docker配置文件中的–log-level**参数来更改docker服务日志级别。

示例：将日志级别设置为debug。

OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'
1

📑master和node节点失败

运行systemctl status命令，对atomicopenshift-master、atom-openshift-node、etcd和docker服务中的问题进行排查。使用journalctl -u 命令查看与前面列出的服务相关的系统日志。

可以通过在各自的配置文件中编辑–loglevel变量，然后重新启动关联的服务，来增加来自atom-openshift-node、atomicopenshift-master-controllers和atom-openshift-master-api服务的详细日志记录。

示例：设置OpenShift主控制器log level为debug级别，修改/etc/sysconfig/atomic-openshift-master-controllers文件。

OPTIONS=--loglevel=4 --listen=https://0.0.0.0:8444
1

延伸：

Red Hat OpenShift容器平台有五个级别的日志详细程度，无论日志配置如何，日志中都会出现带有致命、错误、警告和某些信息严重程度的消息。

0：只有错误和警告
2：正常信息(默认)
4：debug级信息
6：api级debug信息(请求/响应)
8：带有完整请求体的API debug信息

📑调度pod失败

OpenShift master调度pod在node上运行，通常由于node本身没有处于就绪状态，也由于资源限制和配额，pod无法运行。

使用oc get nodes命令验证节点的状态。在调度失败期间，pod将处于挂起状态，可以使用oc get pods -o wide命令进行检查，该命令还显示了计划在哪个节点上运行pod。使用oc get events和oc describe pod命令检查调度失败的详细信息。

示例1：如下所示pod调度失败，原因是CPU不足。

{default-scheduler } Warning FailedScheduling pod (FIXEDhello-phb4j) failed to
fit in any node
fit failure on node (hello-wx0s): Insufficient cpu
fit failure on node (hello-tgfm): Insufficient cpu
fit failure on node (hello-qwds): Insufficient cpu
1
2
3
4
5

示例2：如下所示pod调度失败，原因是节点没有处于就绪状态，可通过oc describe排查。

{default-scheduler } Warning FailedScheduling pod (hello-phb4j): no nodes
available to schedule pods
1
2

📜课本练习

📑环境准备

[student@workstation ~]$ lab install-prepare setup
[student@workstation ~]$ cd /home/student/do280-ansible
[student@workstation do280-ansible]$ ./install.sh
1
2
3

提示：若已经拥有一个完整环境，可不执行。

📑本练习准备

[student@workstation ~]$ lab common-troubleshoot setup
1

📑创建应用

[student@workstation ~]$ oc login -u developer -p redhat  https://master.lab.example.com
[student@workstation ~]$ oc new-project common-troubleshoot
[student@workstation ~]$ oc new-app --name=hello -i php:5.4 \
 http://services.lab.example.com/php-helloworld         # 从源代码创建应用
error: multiple images or templates matched "php:5.4": 2

The argument "php:5.4" could apply to the following Docker images, OpenShift image streams, or templates:

* Image stream "php" (tag "5.6") in project "openshift"
  Use --image-stream="openshift/php:5.6" to specify this image or template

* Image stream "php" (tag "7.0") in project "openshift"
  Use --image-stream="openshift/php:7.0" to specify this image or template
1
2
3
4
5
6
7
8
9
10
11
12
13

📑查看详情

[student@workstation ~]$ oc describe is php -n openshift
7.1 (latest)
  tagged from registry.lab.example.com/rhscl/php-71-rhel7:latest

  Build and run PHP 7.1 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/7.1/README.md.
  Tags: builder, php
  Supports: php:7.1, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/rhscl/php-71-rhel7:latest" not found
      3 days ago
…………
5.5
  tagged from registry.lab.example.com/openshift3/php-55-rhel7:latest

  Build and run PHP 5.5 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/5.5/README.md.
  Tags: hidden, builder, php
  Supports: php:5.5, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/openshift3/php-55-rhel7:latest" not found
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

结论：由上可知，仓库中不存在所需镜像。

📑修正错误

[student@workstation ~]$ oc new-app --name=hello -i php:7.0 http://services.lab.example.com/php-helloworld
[student@workstation ~]$ oc get pod -o wide      # 再次查看发现一只出于pending
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-build 0/1 Pending 0 40s <none> <none>
1
2
3
4

📑查看详情

[student@workstation ~]$ oc log hello-1-build		# 查看log
W0301 17:25:02.867828    4584 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead. 

[student@workstation ~]$ oc get events			# 查看事件
LAST SEEN   FIRST SEEN   COUNT     NAME                             KIND      SUBOBJECT   TYPE      REASON             SOURCE              MESSAGE
16s         47s          7         hello-1-build.16682daab914ecb6   Pod                   Warning   FailedScheduling   default-scheduler   0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
[student@workstation ~]$ oc describe pod hello-1-build	# 查看详情
……
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  23s (x8 over 1m)  default-scheduler  0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
1
2
3
4
5
6
7
8
9
10
11
12

结论：由上可知，没有node可供调度此pod。

[root@master ~]# oc get nodes				# 在master节点进一步排查node情况
NAME                     STATUS     ROLES     AGE       VERSION
master.lab.example.com   Ready      master    1d        v1.9.1+a0ce1bc657
node1.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657
node2.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657
1
2
3
4
5

结论：由上可知，node状态异常，都未出于ready状态。

📑检查服务

[root@node1 ~]# systemctl status atomic-openshift-node.service
[root@node2 ~]# systemctl status atomic-openshift-node.service
[root@node1 ~]# systemctl status docker
[root@node2 ~]# systemctl status docker
[root@node1 ~]#  systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2021-03-01 17:23:12 CST; 4min 52s ago
     Docs: http://docs.docker.com
 Main PID: 17637 (code=exited, status=0/SUCCESS)

Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.375792111+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.382396227+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.387020843+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.394091193+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.402339410+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.404059183+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.413005258+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.436107140+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.485170808+08:00" level=i...ed"
Mar 01 17:23:12 node1.lab.example.com systemd[1]: Stopped Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

结论：由上可知，node节点的docker异常。

📑启动服务

[root@node1 ~]# systemctl start docker
[root@node2 ~]# systemctl start docker
1
2

📑确认验证

[root@master ~]# oc get nodes                # 再次查看node状态
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657

[student@workstation ~]$ oc get pods       # 确认pod是否正常调度至node
NAME READY STATUS RESTARTS AGE
hello-1-build 1/1 Running 0 22m

[student@workstation ~]$ oc describe is    # 查看is详情
Name:			hello
Namespace:		common-troubleshoot
Created:		15 minutes ago
Labels:			app=hello
Annotations:		openshift.io/generated-by=OpenShiftNewApp
Docker Pull Spec:	docker-registry.default.svc:5000/common-troubleshoot/hello
Image Lookup:		local=false
Unique Images:		1
Tags:			1

latest
  no spec tag

  * docker-registry.default.svc:5000/common-troubleshoot/hello@sha256:8d63ed61d6e9c74933fe0d0d8aadceecb71751abf260f10645c19737a3e13354
      10 minutes ago
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

结论：由上可知，IS也将image推送至内部仓库。

📑清除项目

[student@workstation ~]$ oc delete project common-troubleshoot
1

📜综合实验

📑环境准备

[student@workstation ~]$ lab install-prepare setup
[student@workstation ~]$ cd /home/student/do280-ansible
[student@workstation do280-ansible]$ ./install.sh
1
2
3

提示：若已经拥有一个完整环境，可不执行。

📑本练习准备

[student@workstation ~]$ lab execute-review setup
1

📑git项目至本地

[student@workstation ~]$ cd /home/student/DO280/labs/execute-review/
[student@workstation execute-review]$ git clone http://services.lab.example.com/node-hello
1
2

📑docker构建image

[student@workstation execute-review]$ cd node-hello/
[student@workstation node-hello]$ docker build -t node-hello:latest .
[student@workstation node-hello]$ docker images              # 查看image
REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
node-hello                                      latest              9b3befb0536b        9 seconds ago       495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago         489 MB
1
2
3
4
5
6

📑修改docker tag

[student@workstation node-hello]$ docker tag 9b3befb0536b registry.lab.example.com/node-hello:latest
[student@workstation node-hello]$ docker images
REPOSITORY                                      TAG                 IMAGE ID            CREATED              SIZE
node-hello                                      latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/node-hello             latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago          489 MB
1
2
3
4
5
6

📑push image

[student@workstation node-hello]$ docker push registry.lab.example.com/node-hello:latest
1

📑创建project

[student@workstation ~]$ oc login -u developer -p redhat https://master.lab.example.com
[student@workstation ~]$ oc projects
[student@workstation ~]$ oc project execute-review
[student@workstation ~]$ oc new-app registry.lab.example.com/node-hello --name hello
[student@workstation ~]$ oc get all           # 查看全部资源
NAME                      REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfigs/hello   1          1         1         config,image(hello:latest)

NAME                 DOCKER REPO                                             TAGS      UPDATED
imagestreams/hello   docker-registry.default.svc:5000/execute-review/hello   latest    12 seconds ago

NAME                READY     STATUS             RESTARTS   AGE
po/hello-1-deploy   1/1       Running            0          12s
po/hello-1-zswgc    0/1       ImagePullBackOff   0          9s

NAME         DESIRED   CURRENT   READY     AGE
rc/hello-1   1         1         0         12s

NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
svc/hello   ClusterIP   172.30.7.229   <none>        3000/TCP,8080/TCP   12s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

📑排查ImagePullBackOff

[student@workstation ~]$ oc logs hello-1-zswgc          # 查看日志
Error from server (BadRequest): container "hello" in pod "hello-1-zswgc " is waiting to start: trying and failing to pull image
[student@workstation ~]$ oc describe pod hello-1-zswgc   # 查看详情
[student@workstation ~]$ oc get events --sort-by='.metadata.creationTimestamp'   # 查看事件
1
2
3
4

结论：由上可知，为image pull失败。

📑手动pull镜像

[student@workstation ~]$ oc get pod -o wide
NAME             READY     STATUS             RESTARTS   AGE       IP            NODE
hello-1-deploy   1/1       Running            0          32s       10.129.0.93   node2.lab.example.com
hello-1-zswgc    0/1       ImagePullBackOff   0          30s       <none>        node2.lab.example.com 
[root@node2 ~]# docker pull registry.lab.example.com/node-hello
Using default tag: latest
Trying to pull repository registry.lab.example.com/node-hello ... 
All endpoints blocked.
1
2
3
4
5
6
7
8

结论：由上可知，所有endpoint都被阻塞了。这种类型的错误通常发生在OpenShift中，原因是不正确的部署配置或无效docker配置。

📑修正docker配置

[root@node1 ~]# vi /etc/sysconfig/docker
将BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io --block-registry registry.lab.example.com'
修改为
BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io'
[root@node1 ~]# systemctl restart docker
1
2
3
4
5

提示：node2也需要如上操作。

📑更新pod

[student@workstation ~]$ oc rollout latest hello
[student@workstation ~]$ oc get pods        # 确认
NAME             READY     STATUS    RESTARTS   AGE
hello-1-deploy   0/1       Error     0          10m
hello-2-scrbl    1/1       Running   0          28s
1
2
3
4
5

📑确认验证

[student@workstation ~]$ oc logs hello-2-scrbl 
nodejs server running on http://0.0.0.0:3000
1
2

📑暴露服务

[student@workstation ~]$ oc expose svc hello --hostname=hello.apps.lab.example.com
route "hello" exposed
1
2

📑测试服务

[student@workstation ~]$ curl http://hello.apps.lab.example.com
Hi! I am running on host -> hello-2-scrbl
[student@workstation ~]$ lab execute-review grade #脚本验证试验
1
2
3

📑清除实验

[student@workstation ~]$ oc delete project execute-review
1

💡总结

RHCA认证需要经历5门的学习与考试，还是需要花不少时间去学习与备考的，好好加油，可以噶🤪。

以上就是【金鱼哥】对 第四章 OpenShift命令及故障排查–常见故障排除和章节实验 的简述和讲解。希望能对看到此文章的小伙伴有所帮助。

💾红帽认证专栏系列：
RHCSA专栏：戏说 RHCSA 认证
RHCE专栏：戏说 RHCE 认证
此文章收录在RHCA专栏：RHCA 回忆录

如果这篇【文章】有帮助到你，希望可以给【金鱼哥】点个赞👍，创作不易，相比官方的陈述，我更喜欢用【通俗易懂】的文笔去讲解每一个知识点。

如果有对【运维技术】感兴趣，也欢迎关注❤️❤️❤️ 【金鱼哥】❤️❤️❤️，我将会给你带来巨大的【收获与惊喜】💕💕！

相关阅读:
Linux 新建 python 文件
 选举
 Kubernetes 学习总结（39）—— Kubernetes 之 Pause 容器详解
 RFID固定资产盘点系统给企业带来哪些便利？
中英翻译《”绿色“一词及其不同含义》
android 与 flutter 之间的通信
 Bean的自动装配（自动注入）- autowire研究
 中标麒麟--国产操作系统-九五小庞
 【大数据 - Doris 实践】数据表的基本使用（一）：基本概念、创建表
 构建高效且可伸缩的结果缓存
原文地址：https://blog.csdn.net/qq_41765918/article/details/125377175

DO280OpenShift命令及故障排查--常见故障排除和章节实验

文章目录

📜常见环境信息

📜常见诊断命令

📑oc get events

📑oc log

📑oc rsync

📑oc port-forward

📜常见故障

📑资源限制和配额问题

📑S2I build失败

📑ErrImagePull和imgpullback错误

📑docker配置异常

📑master和node节点失败

📑调度pod失败

📜课本练习

📑环境准备

📑本练习准备

📑创建应用

📑查看详情

📑修正错误

📑查看详情

📑检查服务

📑启动服务

📑确认验证

📑清除项目

📜综合实验

📑环境准备

📑本练习准备

📑git项目至本地

📑docker构建image

📑修改docker tag

📑push image

📑创建project

📑排查ImagePullBackOff

📑手动pull镜像

📑修正docker配置

📑更新pod

📑确认验证

📑暴露服务

📑测试服务

📑清除实验

💡总结