• ubuntu18.04 下slowfast网络环境安装及模型测试( python3.9)


    用pip 安装建议用国内源,如 pip install xxx -i https://pypi.tuna.tsinghua.edu.cn/simple

    目录

    1.conda env 环境创建

    2. install pytorch 

    3. install fvcore

    4. install simplejson

    5. gcc版本查看

    6. PyAV

    7.ffmpeg with PyAV

    8. PyYaml , tqdm

    9.iopath

    10. psutil

    11. opencv

    12. tensorboard

    13. moviepy

    14. PyTorchVideo

    15. Detectron2

    16. FairScale

    17. SlowFast

    运行Demo测试模型

    安装过程中遇到的一些errors

    error0 

             error1

    error2

    error3

    error4

    error5

    error6

    error7


    1.conda env 环境创建

    conda create -n py39 python=3.9

    2. install pytorch 

    先查看cuda版本 , 再对应pytorch版本

    查看系统nvidia驱动版本支持最高cuda版本

    查看当前cuda版本

    根据对应cuda版本安装pytorch torchvision

    source activate py39
    conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

    3. install fvcore

    pip install git+https://github.com/facebookresearch/fvcore

    4. install simplejson

    pip install simplejson 

    5. gcc版本查看

    gcc -v



    版本是 7.5.0

    6. PyAV

    conda install av -c conda-forge

    7.ffmpeg with PyAV

    pip install av

    8. PyYaml , tqdm

    pip list fvcore

    9.iopath

    pip install -U iopath

    10. psutil

    pip install psutil

    11. opencv

    pip install opencv-python

    12. tensorboard

    查看是否安装tensorboard:

    conda list tensorboard


    没有安装tensorboard

    pip install tensorboard

    13. moviepy

    pip install moviepy

    14. PyTorchVideo

    pip install pytorchvideo

    15. Detectron2

    git clone https://github.com/facebookresearch/detectron2 detectron2_repo

    pip install -e detectron2_repo

    16. FairScale

    pip install git+https://github.com/facebookresearch/fairscale

    17. SlowFast

    git clone https://github.com/facebookresearch/SlowFast.git


    cd SlowFast
    python setup.py build develop

    运行Demo测试模型

    python3 tools/run_net.py --cfg demo/AVA/SLOWFAST_32x2_R101_50_50.yaml

    安装过程中遇到的一些errors

    error0 

    not find PIL 

    解决办法:将setup.py 中的 PIL 更改为 Pillow 

    error1

    from pytorchvideo.layers.distributed import ( # noqa
    ImportError: cannot import name 'cat_all_gather' from 'pytorchvideo.layers.distributed' (/home/cxgk/anaconda3/envs/sf/lib/python3.9/site-packages/pytorchvideo/layers/distributed.py)

    解决方式:

    方式一:将pytorchvideo/pytorchvideo at main · facebookresearch/pytorchvideo · GitHub文件下内容复制到虚拟环境所对应的文件下,这里是:/home/cxgk/anaconda3/envs/sf/lib/python3.9/site-packages/pytorchvideo/

    方式二:
    layers/distributed.py添加如下内容

    1. # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
    2. """Distributed helpers."""
    3. import torch
    4. import torch.distributed as dist
    5. from torch._C._distributed_c10d import ProcessGroup
    6. from torch.autograd.function import Function
    7. _LOCAL_PROCESS_GROUP = None
    8. def get_world_size() -> int:
    9. """
    10. Simple wrapper for correctly getting worldsize in both distributed
    11. / non-distributed settings
    12. """
    13. return (
    14. torch.distributed.get_world_size()
    15. if torch.distributed.is_available() and torch.distributed.is_initialized()
    16. else 1
    17. )
    18. def cat_all_gather(tensors, local=False):
    19. """Performs the concatenated all_reduce operation on the provided tensors."""
    20. if local:
    21. gather_sz = get_local_size()
    22. else:
    23. gather_sz = torch.distributed.get_world_size()
    24. tensors_gather = [torch.ones_like(tensors) for _ in range(gather_sz)]
    25. torch.distributed.all_gather(
    26. tensors_gather,
    27. tensors,
    28. async_op=False,
    29. group=_LOCAL_PROCESS_GROUP if local else None,
    30. )
    31. output = torch.cat(tensors_gather, dim=0)
    32. return output
    33. def init_distributed_training(cfg):
    34. """
    35. Initialize variables needed for distributed training.
    36. """
    37. if cfg.NUM_GPUS <= 1:
    38. return
    39. num_gpus_per_machine = cfg.NUM_GPUS
    40. num_machines = dist.get_world_size() // num_gpus_per_machine
    41. for i in range(num_machines):
    42. ranks_on_i = list(
    43. range(i * num_gpus_per_machine, (i + 1) * num_gpus_per_machine)
    44. )
    45. pg = dist.new_group(ranks_on_i)
    46. if i == cfg.SHARD_ID:
    47. global _LOCAL_PROCESS_GROUP
    48. _LOCAL_PROCESS_GROUP = pg
    49. def get_local_size() -> int:
    50. """
    51. Returns:
    52. The size of the per-machine process group,
    53. i.e. the number of processes per machine.
    54. """
    55. if not dist.is_available():
    56. return 1
    57. if not dist.is_initialized():
    58. return 1
    59. return dist.get_world_size(group=_LOCAL_PROCESS_GROUP)
    60. def get_local_rank() -> int:
    61. """
    62. Returns:
    63. The rank of the current process within the local (per-machine) process group.
    64. """
    65. if not dist.is_available():
    66. return 0
    67. if not dist.is_initialized():
    68. return 0
    69. assert _LOCAL_PROCESS_GROUP is not None
    70. return dist.get_rank(group=_LOCAL_PROCESS_GROUP)
    71. def get_local_process_group() -> ProcessGroup:
    72. assert _LOCAL_PROCESS_GROUP is not None
    73. return _LOCAL_PROCESS_GROUP
    74. class GroupGather(Function):
    75. """
    76. GroupGather performs all gather on each of the local process/ GPU groups.
    77. """
    78. @staticmethod
    79. def forward(ctx, input, num_sync_devices, num_groups):
    80. """
    81. Perform forwarding, gathering the stats across different process/ GPU
    82. group.
    83. """
    84. ctx.num_sync_devices = num_sync_devices
    85. ctx.num_groups = num_groups
    86. input_list = [torch.zeros_like(input) for k in range(get_local_size())]
    87. dist.all_gather(
    88. input_list, input, async_op=False, group=get_local_process_group()
    89. )
    90. inputs = torch.stack(input_list, dim=0)
    91. if num_groups > 1:
    92. rank = get_local_rank()
    93. group_idx = rank // num_sync_devices
    94. inputs = inputs[
    95. group_idx * num_sync_devices : (group_idx + 1) * num_sync_devices
    96. ]
    97. inputs = torch.sum(inputs, dim=0)
    98. return inputs
    99. @staticmethod
    100. def backward(ctx, grad_output):
    101. """
    102. Perform backwarding, gathering the gradients across different process/ GPU
    103. group.
    104. """
    105. grad_output_list = [
    106. torch.zeros_like(grad_output) for k in range(get_local_size())
    107. ]
    108. dist.all_gather(
    109. grad_output_list,
    110. grad_output,
    111. async_op=False,
    112. group=get_local_process_group(),
    113. )
    114. grads = torch.stack(grad_output_list, dim=0)
    115. if ctx.num_groups > 1:
    116. rank = get_local_rank()
    117. group_idx = rank // ctx.num_sync_devices
    118. grads = grads[
    119. group_idx
    120. * ctx.num_sync_devices : (group_idx + 1)
    121. * ctx.num_sync_devices
    122. ]
    123. grads = torch.sum(grads, dim=0)
    124. return grads, None, None

    error2

    from scipy.ndimage import gaussian_filter

    ModuleNotFoundError: No module named 'scipy'

    解决方法:

    pip install scipy

    error3

    from av._core import time_base, library_versions

    ImportError: /home/cxgk/anaconda3/envs/sf/lib/python3.9/site-packages/av/../../.././libgnutls.so.30: symbol mpn_copyi version HOGWEED_6 not defined in file libhogweed.so.6 with link time reference
     

    解决方法:

    先移处av包

    使用 pip安装


    pip install av


    error4

    File "/media/cxgk/Linux/work/SlowFast/slowfast/models/losses.py", line 11, in
    from pytorchvideo.losses.soft_target_cross_entropy import (
    ModuleNotFoundError: No module named 'pytorchvideo.losses'

    解决办法:

    打开"/home/cxgk/anaconda3/envs/sf/lib/python3.9/site-packages/pytorchvideo/losses",在文件夹下新建 soft_target_cross_entropy.py, 并打开添加如下代码:

    1. # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
    2. import torch
    3. import torch.nn as nn
    4. import torch.nn.functional as F
    5. from pytorchvideo.layers.utils import set_attributes
    6. from pytorchvideo.transforms.functional import convert_to_one_hot
    7. class SoftTargetCrossEntropyLoss(nn.Module):
    8. """
    9. Adapted from Classy Vision: ./classy_vision/losses/soft_target_cross_entropy_loss.py.
    10. This allows the targets for the cross entropy loss to be multi-label.
    11. """
    12. def __init__(
    13. self,
    14. ignore_index: int = -100,
    15. reduction: str = "mean",
    16. normalize_targets: bool = True,
    17. ) -> None:
    18. """
    19. Args:
    20. ignore_index (int): sample should be ignored for loss if the class is this value.
    21. reduction (str): specifies reduction to apply to the output.
    22. normalize_targets (bool): whether the targets should be normalized to a sum of 1
    23. based on the total count of positive targets for a given sample.
    24. """
    25. super().__init__()
    26. set_attributes(self, locals())
    27. assert isinstance(self.normalize_targets, bool)
    28. if self.reduction not in ["mean", "none"]:
    29. raise NotImplementedError(
    30. 'reduction type "{}" not implemented'.format(self.reduction)
    31. )
    32. self.eps = torch.finfo(torch.float32).eps
    33. def forward(self, input: torch.Tensor, target: torch.Tensor) -> torch.Tensor:
    34. """
    35. Args:
    36. input (torch.Tensor): the shape of the tensor is N x C, where N is the number of
    37. samples and C is the number of classes. The tensor is raw input without
    38. softmax/sigmoid.
    39. target (torch.Tensor): the shape of the tensor is N x C or N. If the shape is N, we
    40. will convert the target to one hot vectors.
    41. """
    42. # Check if targets are inputted as class integers
    43. if target.ndim == 1:
    44. assert (
    45. input.shape[0] == target.shape[0]
    46. ), "SoftTargetCrossEntropyLoss requires input and target to have same batch size!"
    47. target = convert_to_one_hot(target.view(-1, 1), input.shape[1])
    48. assert input.shape == target.shape, (
    49. "SoftTargetCrossEntropyLoss requires input and target to be same "
    50. f"shape: {input.shape} != {target.shape}"
    51. )
    52. # Samples where the targets are ignore_index do not contribute to the loss
    53. N, C = target.shape
    54. valid_mask = torch.ones((N, 1), dtype=torch.float).to(input.device)
    55. if 0 <= self.ignore_index <= C - 1:
    56. drop_idx = target[:, self.ignore_idx] > 0
    57. valid_mask[drop_idx] = 0
    58. valid_targets = target.float() * valid_mask
    59. if self.normalize_targets:
    60. valid_targets /= self.eps + valid_targets.sum(dim=1, keepdim=True)
    61. per_sample_per_target_loss = -valid_targets * F.log_softmax(input, -1)
    62. per_sample_loss = torch.sum(per_sample_per_target_loss, -1)
    63. # Perform reduction
    64. if self.reduction == "mean":
    65. # Normalize based on the number of samples with > 0 non-ignored targets
    66. loss = per_sample_loss.sum() / torch.sum(
    67. (torch.sum(valid_mask, -1) > 0)
    68. ).clamp(min=1)
    69. elif self.reduction == "none":
    70. loss = per_sample_loss
    71. return

    error5

    from sklearn.metrics import confusion_matrix

    ModuleNotFoundError: No module named 'sklearn'

    解决办法:

    pip install scikit-learn

    error6

    raise KeyError("Non-existent config key: {}".format(full_key))

    KeyError: 'Non-existent config key: TENSORBOARD.MODEL_VIS.TOPK'

    解决方法:

    注释掉如下三行:

    TENSORBOARD

    MODEL_VIS

    TOPK

    error7

    RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 3.94 GiB total capacity; 2.83 GiB already allocated; 25.44 MiB free; 2.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

    解决方法:

    将yaml里的帧数改小:

    DATA:
    NUM_FRAMES: 16

    Reference:

    https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo

  • 相关阅读:
    重载与友元
    Infragistics 2022.1 Sources【源码】
    Go语言之channel实现原理
    聚类算法:kmeans和dbscan
    (Rest风格API)Elasticsearch索引操作、映射配置、数据操作、查询操作
    既然有了量化交易,技术分析还有存在的必要么?有专门收割自动交易系统的策略吗?
    RabbitMQ 服务启动失败问题小结(Windows环境)
    富士康转移产线和中国手机海外设厂,中国手机出口减少超5亿部
    爬虫 — Scrapy 框架(一)
    二、基础算法精讲:二分
  • 原文地址:https://blog.csdn.net/y459541195/article/details/126278476