tf和cuda版本不匹配的情况下,tf调用不到显卡资源,因此需要单独为我们这里tf1.15来配置一个CUDA 环境:
https://raychiu.blog.csdn.net/article/details/126682812
参照官方:
https://github.com/tensorflow/models/tree/v1.13.0/research/slim#Export
容器化的环境官方文档:
https://www.tensorflow.org/install/docker
新建conda+python3.7环境并激活进入,然后安装tensorflow 1.0的环境:
pip install tensorflow-gpu==1.15
pip install --upgrade protobuf==3.20.1
pip install tf_slim
测试一下环境是否正常:
python -c "import tensorflow.contrib.slim as slim; eval = slim.evaluation.evaluate_once"
python -c "from nets import cifarnet; mynet = cifarnet.cifarnet"
依赖环境安装:
pip install contextlib2 pillow
下载命令:
DATA_DIR=~/data/dataset/data/flowers
python download_and_convert_data.py --dataset_name=flowers --dataset_dir="${DATA_DIR}"
用 ls ${DATA_DIR} 会发现创建了几个 TFRecord 文件,还有labels.txt这个包含从整数标签到类名的映射的文件。
加载一下数据集:
import tensorflow as tf
from datasets import flowers
slim = tf.contrib.slim
# Selects the 'validation' dataset.
dataset = flowers.get_split('validation', "/home/xiaoling/data/dataset/data/flowers")
# Creates a TF-Slim DataProvider which reads the dataset in the background
# during both training and testing.
provider = slim.dataset_data_provider.DatasetDataProvider(dataset)
[image, label] = provider.get(['image', 'label'])

用一下官方的finetune脚本:
https://github.com/tensorflow/models/blob/master/research/slim/scripts/finetune_inception_v3_on_flowers.sh
监控一下显卡情况:
watch -n 1 nvidia-smi
很快就训练完了:

跑官方代码报错:
Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above
这样处理:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '/gpu:0' # 运行程序,都会占用gpu0全部资源
# 多个GPU时,如果运行只使用了一个的话,可以设置为‘/gpu:0,1’等等
或者:
# 另一种写法
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID' # 按照PCI_BUS_ID顺序从0开始排列GPU设备
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1" #设置当前使用的GPU设备为0,1号两个设备,名称依次为'/gpu:0'、'/gpu:1'。
#[0,1]和[1,0]排列的设备是不同的,排在前面的设备优先级高,运行程序的时候会优先使用。
然后跑起来了
安装nvidia-docker :
https://raychiu.blog.csdn.net/article/details/126629132
tensorfolw在容器里调用GPU
https://raychiu.blog.csdn.net/article/details/126687326
启动命令:
sudo docker run -it --name ubuntu-tf-gpu --rm -v /home/xiaoling/data/projects/pythonHome/0902_test_GPU/models/research/slim:/workspaces --gpus all -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all -p 1234:22 tensorflow/tensorflow:1.15.0-gpu
然后跑训练也没问题。