• 41.cuBLAS开发指南中文版--cuBLAS中的Level-2方法gemvStridedBatched()


    2.6.25. cublasgemvStridedBatched()

    在这里插入图片描述

    cublasStatus_t cublasSgemvStridedBatched(cublasHandle_t handle,
                                             cublasOperation_t trans,
                                             int m, int n,
                                             const float           *alpha,
                                             const float           *A, int lda,
                                             long long int         strideA,
                                             const float           *x, int incx,
                                             long long int         stridex,
                                             const float           *beta,
                                             float                 *y, int incy,
                                             long long int         stridey,
                                             int batchCount)
    cublasStatus_t cublasDgemvStridedBatched(cublasHandle_t handle,
                                             cublasOperation_t trans,
                                             int m, int n,
                                             const double          *alpha,
                                             const double          *A, int lda,
                                             long long int         strideA,
                                             const double          *x, int incx,
                                             long long int         stridex,
                                             const double          *beta,
                                             double                *yarray[], int incy,
                                             long long int         stridey,
                                             int batchCount)
    cublasStatus_t cublasCgemvStridedBatched(cublasHandle_t handle,
                                             cublasOperation_t trans,
                                             int m, int n,
                                             const cuComplex       *alpha,
                                             const cuComplex       *A, int lda,
                                             long long int         strideA,
                                             const cuComplex       *x, int incx,
                                             long long int         stridex,
                                             const cuComplex       *beta,
                                             cuComplex             *y, int incy,
                                             long long int         stridey,
                                             int batchCount)
    cublasStatus_t cublasZgemvStridedBatched(cublasHandle_t handle,
                                             cublasOperation_t trans,
                                             int m, int n,
                                             const cuDoubleComplex *alpha,
                                             const cuDoubleComplex *A, int lda,
                                             long long int         strideA,
                                             const cuDoubleComplex *x, int incx,
                                             long long int         stridex,
                                             const cuDoubleComplex *beta,
                                             cuDoubleComplex       *y, int incy,
                                             long long int         stridey,
                                             int batchCount)
    cublasStatus_t cublasHSHgemvStridedBatched(cublasHandle_t handle,
                                               cublasOperation_t trans,
                                               int m, int n,
                                               const float           *alpha,
                                               const __half          *A, int lda,
                                               long long int         strideA,
                                               const __half          *x, int incx,
                                               long long int         stridex,
                                               const float           *beta,
                                               __half                *y, int incy,
                                               long long int         stridey,
                                               int batchCount)
    cublasStatus_t cublasHSSgemvStridedBatched(cublasHandle_t handle,
                                               cublasOperation_t trans,
                                               int m, int n,
                                               const float           *alpha,
                                               const __half          *A, int lda,
                                               long long int         strideA,
                                               const __half          *x, int incx,
                                               long long int         stridex,
                                               const float           *beta,
                                               float                 *y, int incy,
                                               long long int         stridey,
                                               int batchCount)
    cublasStatus_t cublasTSTgemvStridedBatched(cublasHandle_t handle,
                                               cublasOperation_t trans,
                                               int m, int n,
                                               const float           *alpha,
                                               const __nv_bfloat16   *A, int lda,
                                               long long int         strideA,
                                               const __nv_bfloat16   *x, int incx,
                                               long long int         stridex,
                                               const float           *beta,
                                               __nv_bfloat16         *y, int incy,
                                               long long int         stridey,
                                               int batchCount)
    cublasStatus_t cublasTSSgemvStridedBatched(cublasHandle_t handle,
                                               cublasOperation_t trans,
                                               int m, int n,
                                               const float           *alpha,
                                               const __nv_bfloat16   *A, int lda,
                                               long long int         strideA,
                                               const __nv_bfloat16   *x, int incx,
                                               long long int         stridex,
                                               const float           *beta,
                                               float                 *y, int incy,
                                               long long int         stridey,
                                               int batchCount)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96

    此函数执行一批矩阵和向量的矩阵向量乘法。 batch 被认为是“统一的”,即所有实例对于各自的 A 矩阵、x 和 向量。 批处理的每个实例的输入矩阵 A 和向量 x 以及输出向量 y 位于与它们在前一个实例中的位置的元素数量上的固定偏移处。 用户将指向第一个实例的 A 矩阵、x 和 y 向量的指针连同元素数量的偏移量传递给函数 - strideA、stridex 和 stridey 确定输入矩阵和向量的位置,以及未来实例中的输出向量 .

    y + i ∗ s t i d e y = α o p ( A + i ∗ s t r i d e A ) ( x + i ∗ s t r i d e x ) + β ( y + i ∗ s t r i d e y ) , f o r i ∈ [ 0 , b a t c h C o u n t − 1 ] y + i*stidey = \alpha op(A+i*strideA)(x + i*stridex) + \beta (y+i*stridey), for i \in [0, batchCount -1] y+istidey=αop(A+istrideA)(x+istridex)+β(y+istridey),fori[0,batchCount1]

    其中 α \alpha α β \beta β是标量, A 是指向矩阵 A[i] 的指针数组,以列优先格式存储,维度为 m x n ,x 和 y 是指向向量的指针数组。 此外,对于 matrixA[i] ,

    o p ( A [ i ] ) = { A [ i ]      如 果 t r a n s a = = C U B L A S _ O P _ N , A [ i ] T    如 果 t r a n s a = = C U B L A S _ O P _ T , A [ i ] H    如 果 t r a n s a = = C U B L A S _ O P _ C op(A[i])= {A[i]    transa==CUBLAS_OP_N,A[i]T  transa==CUBLAS_OP_T,A[i]H  transa==CUBLAS_OP_C

    op(A[i])=A[i]    transa==CUBLAS_OP_N,A[i]T  transa==CUBLAS_OP_T,A[i]H  transa==CUBLAS_OP_C
    注意:y[i] 向量不能重叠,也就是说,各个 gemv 操作必须是可独立计算的; 否则,会出现未定义的行为。

    对于某些规模的问题,在不同的 CUDA 流中多次调用 cublasgemv 可能比使用此 API 更有利。

    注意:在下表中,我们使用 A[i]、x[i]、y[i] 作为 A 矩阵的符号,以及批处理的第 i 个实例中的 x 和 y 向量,隐含地假设它们分别在数字上偏移 元素 strideA、stridex、stridey 远离 A[i-1]、x[i-1]、y[i-1]。 偏移量的单位是元素数,不能为零。

    Param.MemoryIn/outMeaning
    handleinputhandle to the cuBLAS library context.
    transinputOperation op(A[i]) that is non- or (conj.) transpose.
    minputNumber of rows of matrix A[i].
    ninputnumber of columns of matrix A.
    alphahost or deviceinput scalar used for multiplication.
    Adeviceinput* pointer to the A matrix corresponding to the first instance of the batch, with dimensions lda x n with lda>=max(1,m).
    ldainputLeading dimension of two-dimensional array used to store each matrix A[i].
    strideAinputValue of type long long int that gives the offset in number of elements between A[i] and A[i+1]
    xdeviceinput* pointer to the x vector corresponding to the first instance of the batch, with each dimension n if trans==CUBLAS_OP_N and m otherwise.
    incxinputstride between consecutive elements of x.
    stridexinputValue of type long long int that gives the offset in number of elements between x[i] and x[i+1]
    betahost or deviceinput scalar used for multiplication. If beta == 0, y does not have to be a valid input.
    ydevicein/out* pointer to the y vector corresponding to the first instance of the batch, with each dimension m if trans==CUBLAS_OP_N and n otherwise. Vectors y[i] should not overlap; otherwise, undefined behavior is expected.
    incyinputStride of each one-dimensional array y[i].
    strideyinputValue of type long long int that gives the offset in number of elements between y[i] and y[i+1]
    batchCountinputNumber of pointers contained in Aarray, xarray and yarray.

    该函数可能返回的错误值及其含义如下表所示:

    ErrorValueMeaning
    CUBLAS_STATUS_SUCCESS操作成功完成
    CUBLAS_STATUS_NOT_INITIALIZED库未初始化
    CUBLAS_STATUS_INVALID_VALUE参数 m,n,batchCount<0 .
    CUBLAS_STATUS_EXECUTION_FAILED该功能无法在 GPU 上启动
  • 相关阅读:
    大数据学习1.1-Centos8虚拟机安装
    【Spring】 IoC & AOP 控制反转与面向切面编程
    Postgres 常用命令/脚本 (运维版)
    leetcode(力扣) 188. 买卖股票的最佳时机 IV (动态规划)
    能源管理零代码开发工具助力“工业节能诊断服务”推进“双碳”目标顺利实现
    外汇天眼:HAITONG FUTURES海通期货被山寨!受害者:到底哪个是真的?
    软件工程与计算总结(五)软件需求基础
    神经网络编程的34个案例,神经网络编程是什么
    多个Python包懒得import,那就一包搞定!
    【Linux】了解文件的inode元信息,以及日志分析
  • 原文地址:https://blog.csdn.net/kunhe0512/article/details/128143973