🎹 个人简介:大家好,我是 金鱼哥,CSDN运维领域新星创作者,华为云·云享专家,阿里云社区·专家博主
📚个人资质:CCNA、HCNP、CSNA(网络分析师),软考初级、中级网络工程师、RHCSA、RHCE、RHCA、RHCI、ITIL😜
💬格言:努力不一定成功,但要想成功就必须努力🔥🎈支持我:可点赞👍、可收藏⭐️、可留言📝
使用RPM安装的OCP,那么master和node的ocp相关服务将作为Red Hat Enterprise Linux服务运行。从master和node使用标准的sosreport实用程序,收集关于环境的信息,以及docker和openshift相关的信息。
[root@master ~]# sosreport -k docker.all=on -k docker.logs=on
sosreport命令创建一个包含所有相关信息的压缩归档文件,并将其保存在/var/tmp目录中。
另一个有用的诊断工具是oc adm diagnostics命令,能够在OpenShift集群上运行多个诊断检查,包括network、日志、内部仓库、master节点和node节点的服务检查等等。oc adm diagnostics --help命令,获取帮助。
oc客户端命令是用来检测和排除OpenShift集群中的问题的主要工具。它有许多选项,能够检测、诊断和修复由集群管理的主机和节点、服务和资源的问题。若已授权所需的权限,可以直接编辑集群中大多数托管资源的配置。
事件允许OpenShift记录集群中生命周期事件的信息,以统一的方式查看关于OpenShift组件的信息。oc get events命令提供OpenShift namespace的事件信息,可实现以下事件的捕获:
事件通常用于故障排除,从而获得关于集群中的故障和问题的高级信息,然后使用日志文件和其他oc子命令进一步定位。
示例:使用以下命令获得特定项目中的事件列表。
[student@workstation ~]$ oc get events -n <project>
也可以通过Web控制台进行事件的查看events。
oc logs命令查看build、deployment或pod的日志输出,。
示例1:使用oc命令查看pod的日志。
[student@workstation ~]$ oc logs pod
示例2:使用oc命令查看build的日志。
[student@workstation ~]$ oc logs bc/build-name
使用oc logs命令和-f选项实时跟踪日志输出。例如,这对于连续监视build的进度和检查错误非常有用。
也可以通过Web控制台进行事件的查看log。
oc rsync命令将内容复制到正在运行的pod中的目录或从目录复制内容。如果一个pod有多个容器,可以使用-c选项指定容器ID。否则,它默认为pod中的第一个容器。通常用于从容器传输日志文件和配置文件。
示例1:将pod目录中的内容复制到本地目录。
[student@workstation ~]$ oc rsync <pod>:<pod_dir> <local_dir> -c <container>
示例2:将内容从本地目录复制到pod的目录中。
[student@workstation ~]$ oc rsync <local_dir> <pod>:<pod_dir> -c <container>
使用oc port-forward命令将一个或多个本地端口转发到pod。这允许在本地监听特定或随机端口,并将数据转发到pod中的特定端口。
示例1:本地监听3306并转发到pod的3306.
[student@workstation ~]$ oc port-forward <pod> 3306:3306
对于设置了资源限制和配额的项目,不适当的资源配置将导致部署失败。使用oc get events和oc describe命令来排查失败的原因。
例如试图创建超过项目中pod数量配额限制的pod数量,那么在运行oc get events命令时会提示:
Warning FailedCreate {hello-1-deploy} Error creating: pods "hello-1" is forbidden:
exceeded quota: project-quota, requested: cpu=250m, used: cpu=750m, limited: cpu=900m
使用oc logs命令查看S2I构建失败。例如,要查看名为hello的构建配置的日志:
[student@workstation ~]$ oc logs bc/hello
例如可以通过在build configuration策略中指定BUILD_LOGLEVEL环境变量来调整build日志的详细程度。
{
"sourceStrategy": {
...
"env": [
{
"name": "BUILD_LOGLEVEL",
"value": "5"
}
]
}
}
通常是由不正确的deployment configuration造成、部署期间引用的错误或缺少image或Docker配置不当造成。
使用oc get events和oc describe命令排查,通过使用**oc edit dc/**编辑deployment configuration来修复错误。
master和node上不正确的docker配置可能会在部署期间导致许多错误。
通常检查ADD_REGISTRY、INSECURE_REGISTRY和BLOCK_REGISTRY设置。使用systemctl status, oc logs, oc get events和oc describe命令对问题进行排查。
可以通添加**/etc/sysconfig/docker配置文件中的–log-level**参数来更改docker服务日志级别。
示例:将日志级别设置为debug。
OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'
运行systemctl status命令,对atomicopenshift-master、atom-openshift-node、etcd和docker服务中的问题进行排查。使用journalctl -u 命令查看与前面列出的服务相关的系统日志。
可以通过在各自的配置文件中编辑–loglevel变量,然后重新启动关联的服务,来增加来自atom-openshift-node、atomicopenshift-master-controllers和atom-openshift-master-api服务的详细日志记录。
示例:设置OpenShift主控制器log level为debug级别,修改/etc/sysconfig/atomic-openshift-master-controllers文件。
OPTIONS=--loglevel=4 --listen=https://0.0.0.0:8444
延伸:
Red Hat OpenShift容器平台有五个级别的日志详细程度,无论日志配置如何,日志中都会出现带有致命、错误、警告和某些信息严重程度的消息。
OpenShift master调度pod在node上运行,通常由于node本身没有处于就绪状态,也由于资源限制和配额,pod无法运行。
使用oc get nodes命令验证节点的状态。在调度失败期间,pod将处于挂起状态,可以使用oc get pods -o wide命令进行检查,该命令还显示了计划在哪个节点上运行pod。使用oc get events和oc describe pod命令检查调度失败的详细信息。
示例1:如下所示pod调度失败,原因是CPU不足。
{default-scheduler } Warning FailedScheduling pod (FIXEDhello-phb4j) failed to
fit in any node
fit failure on node (hello-wx0s): Insufficient cpu
fit failure on node (hello-tgfm): Insufficient cpu
fit failure on node (hello-qwds): Insufficient cpu
示例2:如下所示pod调度失败,原因是节点没有处于就绪状态,可通过oc describe排查。
{default-scheduler } Warning FailedScheduling pod (hello-phb4j): no nodes
available to schedule pods
[student@workstation ~]$ lab install-prepare setup
[student@workstation ~]$ cd /home/student/do280-ansible
[student@workstation do280-ansible]$ ./install.sh
提示:若已经拥有一个完整环境,可不执行。
[student@workstation ~]$ lab common-troubleshoot setup
[student@workstation ~]$ oc login -u developer -p redhat https://master.lab.example.com
[student@workstation ~]$ oc new-project common-troubleshoot
[student@workstation ~]$ oc new-app --name=hello -i php:5.4 \
http://services.lab.example.com/php-helloworld # 从源代码创建应用
error: multiple images or templates matched "php:5.4": 2
The argument "php:5.4" could apply to the following Docker images, OpenShift image streams, or templates:
* Image stream "php" (tag "5.6") in project "openshift"
Use --image-stream="openshift/php:5.6" to specify this image or template
* Image stream "php" (tag "7.0") in project "openshift"
Use --image-stream="openshift/php:7.0" to specify this image or template
[student@workstation ~]$ oc describe is php -n openshift
7.1 (latest)
tagged from registry.lab.example.com/rhscl/php-71-rhel7:latest
Build and run PHP 7.1 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/7.1/README.md.
Tags: builder, php
Supports: php:7.1, php
Example Repo: https://github.com/openshift/cakephp-ex.git
! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/rhscl/php-71-rhel7:latest" not found
3 days ago
…………
5.5
tagged from registry.lab.example.com/openshift3/php-55-rhel7:latest
Build and run PHP 5.5 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/5.5/README.md.
Tags: hidden, builder, php
Supports: php:5.5, php
Example Repo: https://github.com/openshift/cakephp-ex.git
! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/openshift3/php-55-rhel7:latest" not found
结论:由上可知,仓库中不存在所需镜像。
[student@workstation ~]$ oc new-app --name=hello -i php:7.0 http://services.lab.example.com/php-helloworld
[student@workstation ~]$ oc get pod -o wide # 再次查看发现一只出于pending
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-build 0/1 Pending 0 40s <none> <none>
[student@workstation ~]$ oc log hello-1-build # 查看log
W0301 17:25:02.867828 4584 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead.
[student@workstation ~]$ oc get events # 查看事件
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
16s 47s 7 hello-1-build.16682daab914ecb6 Pod Warning FailedScheduling default-scheduler 0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
[student@workstation ~]$ oc describe pod hello-1-build # 查看详情
……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 23s (x8 over 1m) default-scheduler 0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
结论:由上可知,没有node可供调度此pod。
[root@master ~]# oc get nodes # 在master节点进一步排查node情况
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com NotReady compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com NotReady compute 1d v1.9.1+a0ce1bc657
结论:由上可知,node状态异常,都未出于ready状态。
[root@node1 ~]# systemctl status atomic-openshift-node.service
[root@node2 ~]# systemctl status atomic-openshift-node.service
[root@node1 ~]# systemctl status docker
[root@node2 ~]# systemctl status docker
[root@node1 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: inactive (dead) since Mon 2021-03-01 17:23:12 CST; 4min 52s ago
Docs: http://docs.docker.com
Main PID: 17637 (code=exited, status=0/SUCCESS)
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.375792111+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.382396227+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.387020843+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.394091193+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.402339410+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.404059183+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.413005258+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.436107140+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.485170808+08:00" level=i...ed"
Mar 01 17:23:12 node1.lab.example.com systemd[1]: Stopped Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.
结论:由上可知,node节点的docker异常。
[root@node1 ~]# systemctl start docker
[root@node2 ~]# systemctl start docker
[root@master ~]# oc get nodes # 再次查看node状态
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
[student@workstation ~]$ oc get pods # 确认pod是否正常调度至node
NAME READY STATUS RESTARTS AGE
hello-1-build 1/1 Running 0 22m
[student@workstation ~]$ oc describe is # 查看is详情
Name: hello
Namespace: common-troubleshoot
Created: 15 minutes ago
Labels: app=hello
Annotations: openshift.io/generated-by=OpenShiftNewApp
Docker Pull Spec: docker-registry.default.svc:5000/common-troubleshoot/hello
Image Lookup: local=false
Unique Images: 1
Tags: 1
latest
no spec tag
* docker-registry.default.svc:5000/common-troubleshoot/hello@sha256:8d63ed61d6e9c74933fe0d0d8aadceecb71751abf260f10645c19737a3e13354
10 minutes ago
结论:由上可知,IS也将image推送至内部仓库。
[student@workstation ~]$ oc delete project common-troubleshoot
[student@workstation ~]$ lab install-prepare setup
[student@workstation ~]$ cd /home/student/do280-ansible
[student@workstation do280-ansible]$ ./install.sh
提示:若已经拥有一个完整环境,可不执行。
[student@workstation ~]$ lab execute-review setup
[student@workstation ~]$ cd /home/student/DO280/labs/execute-review/
[student@workstation execute-review]$ git clone http://services.lab.example.com/node-hello
[student@workstation execute-review]$ cd node-hello/
[student@workstation node-hello]$ docker build -t node-hello:latest .
[student@workstation node-hello]$ docker images # 查看image
REPOSITORY TAG IMAGE ID CREATED SIZE
node-hello latest 9b3befb0536b 9 seconds ago 495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7 latest fba56b5381b7 3 years ago 489 MB
[student@workstation node-hello]$ docker tag 9b3befb0536b registry.lab.example.com/node-hello:latest
[student@workstation node-hello]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
node-hello latest 9b3befb0536b About a minute ago 495 MB
registry.lab.example.com/node-hello latest 9b3befb0536b About a minute ago 495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7 latest fba56b5381b7 3 years ago 489 MB
[student@workstation node-hello]$ docker push registry.lab.example.com/node-hello:latest
[student@workstation ~]$ oc login -u developer -p redhat https://master.lab.example.com
[student@workstation ~]$ oc projects
[student@workstation ~]$ oc project execute-review
[student@workstation ~]$ oc new-app registry.lab.example.com/node-hello --name hello
[student@workstation ~]$ oc get all # 查看全部资源
NAME REVISION DESIRED CURRENT TRIGGERED BY
deploymentconfigs/hello 1 1 1 config,image(hello:latest)
NAME DOCKER REPO TAGS UPDATED
imagestreams/hello docker-registry.default.svc:5000/execute-review/hello latest 12 seconds ago
NAME READY STATUS RESTARTS AGE
po/hello-1-deploy 1/1 Running 0 12s
po/hello-1-zswgc 0/1 ImagePullBackOff 0 9s
NAME DESIRED CURRENT READY AGE
rc/hello-1 1 1 0 12s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/hello ClusterIP 172.30.7.229 <none> 3000/TCP,8080/TCP 12s
[student@workstation ~]$ oc logs hello-1-zswgc # 查看日志
Error from server (BadRequest): container "hello" in pod "hello-1-zswgc " is waiting to start: trying and failing to pull image
[student@workstation ~]$ oc describe pod hello-1-zswgc # 查看详情
[student@workstation ~]$ oc get events --sort-by='.metadata.creationTimestamp' # 查看事件
结论:由上可知,为image pull失败。
[student@workstation ~]$ oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-deploy 1/1 Running 0 32s 10.129.0.93 node2.lab.example.com
hello-1-zswgc 0/1 ImagePullBackOff 0 30s <none> node2.lab.example.com
[root@node2 ~]# docker pull registry.lab.example.com/node-hello
Using default tag: latest
Trying to pull repository registry.lab.example.com/node-hello ...
All endpoints blocked.
结论:由上可知,所有endpoint都被阻塞了。这种类型的错误通常发生在OpenShift中,原因是不正确的部署配置或无效docker配置。
[root@node1 ~]# vi /etc/sysconfig/docker
将BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io --block-registry registry.lab.example.com'
修改为
BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io'
[root@node1 ~]# systemctl restart docker
提示:node2也需要如上操作。
[student@workstation ~]$ oc rollout latest hello
[student@workstation ~]$ oc get pods # 确认
NAME READY STATUS RESTARTS AGE
hello-1-deploy 0/1 Error 0 10m
hello-2-scrbl 1/1 Running 0 28s
[student@workstation ~]$ oc logs hello-2-scrbl
nodejs server running on http://0.0.0.0:3000
[student@workstation ~]$ oc expose svc hello --hostname=hello.apps.lab.example.com
route "hello" exposed
[student@workstation ~]$ curl http://hello.apps.lab.example.com
Hi! I am running on host -> hello-2-scrbl
[student@workstation ~]$ lab execute-review grade #脚本验证试验
[student@workstation ~]$ oc delete project execute-review
RHCA认证需要经历5门的学习与考试,还是需要花不少时间去学习与备考的,好好加油,可以噶🤪。
以上就是【金鱼哥】对 第四章 OpenShift命令及故障排查–常见故障排除和章节实验 的简述和讲解。希望能对看到此文章的小伙伴有所帮助。
💾红帽认证专栏系列:
RHCSA专栏:戏说 RHCSA 认证
RHCE专栏:戏说 RHCE 认证
此文章收录在RHCA专栏:RHCA 回忆录
如果这篇【文章】有帮助到你,希望可以给【金鱼哥】点个赞👍,创作不易,相比官方的陈述,我更喜欢用【通俗易懂】的文笔去讲解每一个知识点。
如果有对【运维技术】感兴趣,也欢迎关注❤️❤️❤️ 【金鱼哥】❤️❤️❤️,我将会给你带来巨大的【收获与惊喜】💕💕!