在vCenter环境中,经常发现vCLS创建失败后反复创建且提示“用户取消了任务” 这样的任务,如下图所示,这样的任务既影响环境中其他日志的查看,更多的可以反映出环境存在很严重问题。

我们知道在vSphere7.0U1及之后版本,集群DRS功能都要依赖于vCLS的可用性,此时需要深入地排查一番。
1、查看vCenter集群告警
当虚拟机出现问题,诸如上电失败提示“虚拟机以固定到主机”时,先查看该集群的告警,导航到集群-监控-问题与警报-所有问题下查看,发现上报了关于DRS功能异常由于vCLS功能失效类似的告警,这时就完全确认是vCLS出问题。
2、登录vCenter检查vmware-vsan-health服务
首先ssh root登录成功后,输入shell进入后台:

检查服务状态:
service-control --status vmware-vsan-health
此时发现服务确实是stoped状态,
root@vcsa [ /storage/log/vmware ]# service-control --status vmware-vsan-health
Stopped:
vmware-vsan-health
通过以下命令启动:
service-control --start vmware-vsan-health
此时出现报错,回显如下:
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1029, in emit
self.flush()
File "/usr/lib/python3.7/logging/__init__.py", line 1009, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
Call stack:
File "/usr/bin/service-control", line 181, in
logging.info('********** Start %s **********', sys.argv[1:])
Message: '********** Start %s **********'
Arguments: (['--status', 'vmware-vsan-health'],)
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1029, in emit
self.flush()
File "/usr/lib/python3.7/logging/__init__.py", line 1009, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
Call stack:
File "/usr/bin/service-control", line 183, in
exit(main())
File "/usr/bin/service-control", line 167, in main
process_arguments(parsed_args)
File "/usr/bin/service-control", line 121, in process_arguments
svcs_status = get_services_status(svc_names, ignore_err=ignore_err)
File "/usr/lib/vmware/site-packages/cis/svcsController.py", line 643, in get_s ervices_status
quiet=_quiet)
File "/usr/lib/vmware/site-packages/cis/svcsController.py", line 62, in _svc_l og
logging.info(msg)
Message: "Get services status, svcnames=['vmware-vsan-health']"
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1029, in emit
self.flush()
File "/usr/lib/python3.7/logging/__init__.py", line 1009, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
Call stack:
File "/usr/bin/service-control", line 183, in
exit(main())
File "/usr/bin/service-control", line 167, in main
process_arguments(parsed_args)
File "/usr/bin/service-control", line 149, in process_arguments
logging.info(msg)
Message: 'Stopped:\n vmware-vsan-health'
Arguments: ()
Stopped:
vmware-vsan-health

仔细看发现是由于系统没有足够的空间导致执行失败,这也可能是服务出现问题的根因。查看磁盘大小分析,发现/storage/log目录爆满100%,将一些时间久远的日志删除,又通过mv使其有足够的空间。

注:vmware vCenter虚拟机经常容易出现log目录写满的情况,可以根据实际情况释放相关磁盘空间。
再次执行启动服务命令:
root@vcsa [ /storage/log/vmware ]# service-control --start vmware-vsan-health
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vsan-health...
Successfully started service vsan-health
OK,服务正常了。
3、检查集群DRS状态
返回vCenter查看集群的告警,发现相关DRS的告警已经消除,vCLS的虚机又开始正常创建成功且上电,此时DRS故障已恢复,相关问题虚拟机已能正常上电。