码农知识堂 - 1000bd
  •   Python
  •   PHP
  •   JS/TS
  •   JAVA
  •   C/C++
  •   C#
  •   GO
  •   Kotlin
  •   Swift
  • CUTLASS 1.3.3中的 Volta884_h884gemm


    CUTLASS 是 CUDA C++ 模板抽象的集合,用于在 CUDA 内的所有级别和规模上实现高性能矩阵-矩阵乘法 (GEMM) 和相关计算。它采用了类似于 cuBLAS 和 cuDNN 中实现的分层分解和数据移动策略。

    CUTLASS 最新版本为3.3,相比1.3.3变动较大。然而重温一下1.3.3仍然是有意义的。因为它更易于理解:

    • 与 PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS 中介绍的内容相匹配;
    • 仅支持 Volta 一种 Tensor Core 架构;
    • Tensor Core 仅支持 half 一种数据类型;
    • 仅采用HMMA.884.F16.F16一种指令。

    Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply 中提到 T4 GPU 在引入 Tensor Core 之后,原来重计算瓶颈的 GEMM 也变成了 IO 瓶颈。虽然 V100的带宽是 T4的三倍,然而带宽不足问题同样存在。因此,CUTLASS 对于数据路径进行了如下优化:

    • 全路径128 bit 的访问粒度:LDG.128、STS.128、LDS.128、STD.128;
    • 无冲突共享内存排列:转置时无需填充 Shared Memory;
    • Software Pipelining:LDG.128、LDS.128和HMMA.884.F16.F16三种指令并行,隐藏数据移动。

    下面以一个矩阵乘测例为例,介绍 Volta884_h884gemm 的实现。

    TEST(Volta884_h884gemm_128x64x32_nt, 520x264x136)

    OutputTile即 threadblock tile,该测例下设置为32x64x128。WarpGemmShape为32x64x64,这个是固定值。
    run_gemm 初始化 Volta884GemmTraits::Params 和 GemmTestbed,调用 Gemm::launch 运行后比对结果。

    TEST(Volta884_h884gemm_64x64x32_nt, 520x264x136) {
    
      typedef cutlass::gemm::Volta884GemmTraits<
        cutlass::MatrixLayout::kColumnMajor,
        cutlass::MatrixLayout::kRowMajor,
        cutlass::Shape<32, 64, 128>,
        cutlass::Shape<32, 64, 64>,
        half,
        half,
        half,
        2
      > GemmTraits;
    
      run_gemm<GemmTraits>(520, 264, 136);
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    CUTLASS 中 Volta884实现的层次结构如下图所示

    Gemm
    Volta884GemmTraits
    Volta884MultiplyAdd
    GemmMainloop
    Volta884Multiplicand
    IdentityBlockSwizzle
    GlobalLoadStreamPair
    SharedStreamPair
    MMAEpilogue
    Volta884EpilogueTraits
    PredicatedTileLoadStream
    PredicatedTileStoreStream
    TileStoreStream
    TileLoadStream
    mma
    TileLoadIterator
    Volta884ThreadblockMultiplicandStoreIterator
    Volta884WarpMultiplicandLoadIterator
    MMAGlobalLoadStream
    Copy
    MMASharedLoadStream

    gemm_kernel_nolb

    gemm_kernel_nolb
    GemmMainloop::multiply_add

    Kernel 函数申请动态 Shared Memory,并传递给 GemmMainloop,然后调用 GemmMainloop::multiply_add 进行计算。

    /// GEMM kernel without launch bounds specified
    template <typename Gemm_>
    __global__ /* __launch_bounds__(Gemm_::kThreads) */
    void gemm_kernel_nolb(typename Gemm_::Params params) {
    
      // Dynamic shared memory base pointer
      extern __shared__ int GemmSharedStorageBase[];
    
      // Declare pointer to dynamic shared memory.
      typename Gemm_::SharedStorage *shared_storage = 
        reinterpret_cast<typename Gemm_::SharedStorage *>(GemmSharedStorageBase);
    
      // Construct the GEMM object.
      Gemm_ gemm(params, *shared_storage);
    
      // Run GEMM.
      gemm.multiply_add();
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    GemmMainloop

    GemmMainloop 实现了软流水,如下图所示:
    在这里插入图片描述

    Shared Memory 和寄存器需要两个缓冲区,通过 SM 上的调度实现三条流水线并行。Global Memory 到 Shared Memory 的加载有同步,而从 Shared Memory 移动到寄存器时不需要同步。由于 Ampere 之前的架构不支持 Global Memory 到 Shared Memory 的直接拷贝,因此整个搬运过程比较复杂。如下图所示,程序中多处调用 Copy::transform 函数生成transformed_fragment。原因应该是为了实现类型转换,但 Volta 只支持 half,也就没有实际作用。

    PredicatedTileStoreStreamTileStoreStream::copy
    GlobalLoadStream::commit
    GlobalLoadStream::copy
    Copy::transform
    Volta884ThreadblockMultiplicandStoreIterator::store_post_increment
    MMASharedLoadStream::copy
    MMASharedLoadStream::commit
    A/B
    TileStoreStream::copy
    TileStoreStream::copy
    TileStoreStream::commit
    AB
    D
    PredicatedTileLoadStream::copy
    TileLoadStream::commit
    C
    Copy::transform
    TileStoreIterator::store_post_increment
    source_fragment
    transformed_fragment
    fetched_fragment
    transformed_fragment
    Global_Memory
    Shared_Memory
    fetched
    transformed
    Volta884MultiplyAdd
    accumulators
    fetched_fragment
    transformed_fragment
    LinearScaling
    fetched_fragment
    transformed_fragment
    template <typename Traits_>
    struct GemmMainloop {
    
      //
      // Type definitions
      //
    
      /// The traits.
      typedef Traits_ Traits;
    
      /// The GEMM mainloop
      typedef typename Traits::KernelClass KernelClass;
    
      /// The shared storage.
      typedef typename Traits::SharedStorage SharedStorage;
    
      /// The scalar for A.
      typedef typename Traits::ScalarA ScalarA;
      /// The scalar for B.
      typedef typename Traits::ScalarB ScalarB;
      /// The scalar in the epilogue.
      typedef typename Traits::Epilogue::Scalar ScalarEpilogue;
      /// The scalar for C.
      typedef typename Traits::Epilogue::ScalarC ScalarC;
      /// The scalar for D.
      typedef typename Traits::Epilogue::ScalarD ScalarD;
      /// The index.
      typedef typename Traits::Index Index;
    
      /// Define the mainloop iteration size
      typedef typename Traits::MultiplyAdd MultiplyAdd;
    
      /// The number of threads.
      static int const kThreads = Traits::GemmConfig::kThreads;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    AccumulatorsPerWarp为 GemmConfig::AccumulatorsPerWarp 即 Volta884MultiplyAdd::WarpGemmShape,为32x64x64。
    Volta884MultiplyAdd::InstructionShape 为4x32x32。因此,kWarpGemmSteps为8。

      // Number of warp-level multiply-accumulate steps executed by each warp.
      static Index const kWarpGemmSteps =
          Traits::GemmConfig::AccumulatorsPerWarp::kD / MultiplyAdd::InstructionShape::kD;
    
      /*
      // Make sure we have at least 2 unrolling steps or our pipeling is not going to work.
      static_assert(kWarpGemmSteps >= 2, "The pipelining assumes at least two steps");
      */
    
      /// Use the params object defined in traits
      typedef typename Traits::Params Params;
    
      //
      // Data members
      //
    
      /// The params.
      Params const& params;
    
      /// SharedStorage object
      SharedStorage& shared_storage;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
      //
      // Methods
      //
    
      /// Ctor.
      CUTLASS_DEVICE GemmMainloop(Params const& params_, SharedStorage& shared_storage_)
          : params(params_), shared_storage(shared_storage_) {}
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    GemmMainloop::fetch_global

    GemmMainloop::fetch_global
    GlobalLoadStreamPair::residue
    GlobalLoadStreamPair::copy

    Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair 类型。
    GlobalLoadStreamPair::residue 函数调用两次 MMAGlobalLoadStream::residue,计算在线程块 tile 最后一次加载所需的预测掩码。
    GlobalLoadStreamPair::copy 函数调用两次 MMAGlobalLoadStream::copy 从 Global Memory 拷贝矩阵元素到寄存器。后者调用 TileLoadIterator::load_post_increment 函数。

      /// Fetches global stream pair
      template <bool Residue>
      CUTLASS_DEVICE void fetch_global(typename Traits::GlobalLoadStream& global_to_shared_stream,
                                       Index outer_k) {
        // If residue portion and not calculating residue in prolog, update residue predicates now.
        if (Residue) {
          global_to_shared_stream.residue(outer_k);
        }
        global_to_shared_stream.copy();
      }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    GemmMainloop::consume_tile

    如果kWarpGemmSteps小于等于4,则为kGlobalStreamFirst,先从 Global Memory 加载下一次迭代的数据。

      /// Computes a warp-level GEMM on data held in shared memory
      template <bool Residue, bool LastIteration>
      CUTLASS_DEVICE void consume_tile(typename Traits::GlobalLoadStream& global_to_shared_stream,
                                       typename Traits::SharedStream& shared_load_stream,
                                       typename MultiplyAdd::Accumulators& accumulators,
                                       Index outer_k) {
    
        // Whether to load global stream before loading shared stream
        const bool kGlobalStreamFirst = (kWarpGemmSteps <= 4);
    
        // Load data for the next iteration of the main loop (unless it's the last iteration).
        if (kGlobalStreamFirst && !LastIteration) {
          fetch_global<Residue>(global_to_shared_stream, outer_k);
        }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    首先从 Shared Memory 加载下一次迭代的输入。拥有双缓冲区。
    MMASharedLoadStream::copy 调用 Volta884WarpMultiplicandLoadIterator::load 函数加载数据到寄存器中。
    问题是前一步如果没有调用 GemmMainloop::fetch_global,从 Shared Memory 拷贝不会有问题吗?

        CUTLASS_PRAGMA_UNROLL
        for (int step = 0; step < kWarpGemmSteps; ++step) {
    
          // Trigger the copy from shared memory for the next A/B values.
          shared_load_stream.copy((step + 1) % kWarpGemmSteps);
    
    • 1
    • 2
    • 3
    • 4
    • 5

    如果不是kGlobalStreamFirst , 在循环的第一步时调用GemmMainloop::fetch_global 函数加载输入。

          // Load data for the next iteration of the main loop (unless it's the last iteration).
          if (!kGlobalStreamFirst && (step == 0) && !LastIteration) {
            fetch_global<Residue>(global_to_shared_stream, outer_k);
          }
    
    • 1
    • 2
    • 3
    • 4

    如果是倒数第2步,需要确保数据已经加载到了 Shared Memory。
    Volta884GemmTraits::shared_load_fence 根据外部传入的StageCount来确定是否同步线程。
    GlobalLoadStreamPair::commit 函数会分别调用两个矩阵的 GlobalLoadStream::commit 拷贝到 Shared Memory。
    Volta884GemmTraits::shared_store_fence 同步线程。
    MMASharedLoadStream::inc_stage 递增stage_index。

          if (step == kWarpGemmSteps - 2) {
              // Make sure the data from shared memory has been entirely consumed.
              Traits::shared_load_fence(true);
    
              global_to_shared_stream.commit();
    
              // Make sure the data is in shared memory.
              Traits::shared_store_fence(true);
    
              // Move to the next stage for the load (if it makes sense).
              shared_load_stream.inc_stage();
          }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    MMASharedLoadStream::commit 调用 Copy 进行拷贝。Volta884WarpMultiplicandLoadIterator::Fragment 即 Fragment 。
    Volta884MultiplyAdd::multiply_add 完成 Warp Tile 的计算。

          // Make sure the values are available for the current iteration to do the multiply-add.
          shared_load_stream.commit(step);
    
          // Do the math on the fragments of the current iteration.
          MultiplyAdd multiply_add;
          multiply_add.multiply_add(shared_load_stream.fragment_a(step),
                                    shared_load_stream.fragment_b(step),
                                    accumulators,
                                    accumulators);
        }
      }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    GemmMainloop::multiply_add

    Created with Raphaël 2.3.0 GemmMainloop::multiply_add IdentityBlockSwizzle::get_threadblock_offset IdentityBlockSwizzle::get_threadblock_bounds IdentityBlockSwizzle::get_batch_id GlobalLoadStreamPair::add_batch_offset GlobalLoadStreamPair::move_to_residue GlobalLoadStreamPair::copy GlobalLoadStreamPair::commit Volta884GemmTraits::shared_store_fence GlobalLoadStreamPair::rollback SharedLoadStream::copy ClearAccumulators::clear GemmMainloop::consume_tile GemmEpilogue::epilogue End

    make_Coord_from_shape 根据形状创建一个 Coord 对象。

    IdentityBlockSwizzle::get_threadblock_offset 获得当前线程块在输出二维图上的偏移。
    Volta884GemmTraits::ClearAccumulators 即 ClearAccumulators。
    IdentityBlockSwizzle::get_threadblock_bounds 返回 threadblock 的三维边界。

      /// Do the GEMM.
      CUTLASS_DEVICE void multiply_add() {
        // Swizzle the IDs of the block (to enable better cache behavior).
        typename Traits::BlockSwizzle block_swizzle;
        Coord<3> threadblock_offset =
            block_swizzle.get_threadblock_offset(make_Coord_from_shape<typename Traits::OutputTile>());
    
        // We may want to use shared memory to clear the registers.
        typedef typename Traits::ClearAccumulators ClearAccumulators;
    
        // Get the bounds for each thread, it maybe different than problem_size
        Coord<3> bounds = block_swizzle.get_threadblock_bounds(params.problem_size,
                                                            params.partitionK_range);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    params.global_to_shared_stream即 GlobalLoadStreamPair::Params。
    shared_storage.main_loop.global_to_shared_stream为 GlobalLoadStreamPair::SharedStorage。
    shared_storage.main_loop.threadblock_tile为 GlobalLoadStreamPair::ThreadblockTileStorage,即 ZipTileAllocation。ZipTileAllocation::reference 返回指向数据的 ZipTensorRef 对象。
    global_to_shared_stream为 Volta884GemmTraits::GlobalLoadStream 即 GlobalLoadStreamPair。
    GlobalLoadStreamPair::add_batch_offset 调用 GlobalLoadStreamPair::add_batch_offset GlobalLoadStream::add_batch_offset 函数设置迭代器的 batch 偏移。

        // The streams to read A/B from global memory to shared memory.
        typename Traits::GlobalLoadStream global_to_shared_stream(
            params.global_to_shared_stream,
            shared_storage.main_loop.global_to_shared_stream,
            shared_storage.main_loop.threadblock_tile.reference(),
            bounds,
            threadblock_offset);
    
        // update A and B pointer offset based on batch_id and batch_stride_offset
        global_to_shared_stream.add_batch_offset(block_swizzle.get_batch_id());
    
        // Create the accumulator clear.
        ClearAccumulators clear;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    GlobalLoadStreamPair::move_to_residue 如果是在序幕中执行余数则调用 MMAGlobalLoadStream::move_to_residue 移动指针,否则直接调用 GlobalLoadStreamPair::residue 函数。
    GlobalLoadStreamPair::copy 调用 MMAGlobalLoadStream::copy 函数,后者调用 TileLoadIterator::load_post_increment 加载 A 和 B 矩阵的片段到 Fragment 寄存器。
    GlobalLoadStreamPair::commit 调用 MMAGlobalLoadStream::commit 函数,后者调用 Copy.transform 进行拷贝,然后调用
    Volta884ThreadblockMultiplicandStoreIterator::store_post_increment 保存到 Shared Memory。
    Volta884GemmTraits::shared_store_fence 同步 threadblock 内的线程。
    GlobalLoadStreamPair::rollback 调用 MMAGlobalLoadStream::rollback 函数,后者调用 TileLoadIterator::initialize_predicates 初始化预测向量,然后移动偏移。

        // Deal with residue in prolog.
        // global_to_shared_stream.move_to_residue(params.problem_size[0], Traits::OutputTile::kD);
        global_to_shared_stream.move_to_residue(bounds[0], Traits::OutputTile::kD);
    
        // Fetch the fragments for A and B from global memory.
        global_to_shared_stream.copy();
    
        // Copy the elements to shared memory (after transformation if needed).
        global_to_shared_stream.commit();
    
        // Make sure the data is in shared memory.
        Traits::shared_store_fence(false);
    
        // Rollback to the beginning of the first tile (if residue exists).
        // global_to_shared_stream.rollback(params.problem_size[0] % Traits::OutputTile::kD);
        global_to_shared_stream.rollback(bounds[0] % Traits::OutputTile::kD);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    shared_load_stream为 Volta884GemmTraits::SharedStream 类型,即 SharedStreamPair。
    SharedStreamPair::copy 调用 MMASharedLoadStream::copy,后者调用 Volta884WarpMultiplicandLoadIterator::load 从 Shared Memory 加载。
    accumulators为 Volta884MultiplyAdd::Accumulators 类型,即 Fragment。
    ClearAccumulators::clear 调用 Fragment::clear 将存储清零。
    outer_k是什么?

        // The stream of data from shared memory to fragments.
        typename Traits::SharedStream shared_load_stream(
            params.shared_stream,
            shared_storage.main_loop.threadblock_tile.reference());
    
        // Trigger the copy from shared memory for the 1st stream.
        shared_load_stream.copy(0);
    
        // Allocate the accumulators.
        typename MultiplyAdd::Accumulators accumulators;
    
        // Clear the accumulators.
        clear.clear(accumulators);
    
        // Initial index
        // Index outer_k = params.problem_size[0] - Traits::OutputTile::kD;
        // problem_size[0] might be bigger than bounds[0]
        Index outer_k = bounds[0] - Traits::OutputTile::kD;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    如果在序幕中计算了剩余,则仅最后一次处理余数。
    GemmMainloop::consume_tile 计算k = Traits::OutputTile::kD的分块。

        // Check if we are computing residue in prolog or not.
        if (Traits::GemmConfig::kResidueInProlog) {
          // Execute all mainloop iterations but the last one.
    
          CUTLASS_GEMM_LOOP
          for (; outer_k > 0; outer_k -= Traits::OutputTile::kD) {
            CUTLASS_GEMM_LOOP_HEADER
            consume_tile<false, false>(
                global_to_shared_stream, shared_load_stream, accumulators, outer_k);
          }
    
          consume_tile<false, true>(
              global_to_shared_stream, shared_load_stream, accumulators, outer_k);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    否则,每次迭代都考虑余数。

        } else {
          // When kResidueSeparate = true, execute all mainloop iterations but the last two without any
          // consideration for K-residue or predicate updates. This improves the steady state of some
          // kernels.
          if (Traits::GemmConfig::kResidueSeparate) {
    
            CUTLASS_GEMM_LOOP
            for (; outer_k > Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {
              CUTLASS_GEMM_LOOP_HEADER
              consume_tile<false, false>(
                  global_to_shared_stream, shared_load_stream, accumulators, outer_k);
            }
          }
    
          // Execute remaining tiles with K-residue predicate updates enabled.
          CUTLASS_GEMM_LOOP
          for (; outer_k > -Traits::OutputTile::kD; outer_k -= Traits::OutputTile::kD) {
            CUTLASS_GEMM_LOOP_HEADER
            consume_tile<true, false>(
                global_to_shared_stream, shared_load_stream, accumulators, outer_k);
          }
        }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    创建 MMAEpilogue 对象,然后调用 MMAEpilogue::epilogue 函数。

        typedef typename Traits::Epilogue Epilogue;
        Epilogue epilogue(params.epilogue, shared_storage.epilogue, params.problem_size.knm());
        epilogue.epilogue(accumulators, threadblock_offset, block_swizzle.get_batch_id());
      }
    };
    
    • 1
    • 2
    • 3
    • 4
    • 5

    参考资料:

    • # [DOC] Where does cutlass’ detailed GEMM kernel? #526
    • Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
    • Modeling Deep Learning Accelerator Enabled GPUs
    • gpgpu-sim_distribution
    • 理解Tensor Core
    • Flexible Performant GEMM Kernels on GPUs
    • CUDA Tensor Core编程
    • PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS
    • The NVIDIA Titan V Deep Learning Deep Dive: It’s All About The Tensor Cores
    • 9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
    • Numerical Behavior of NVIDIA Tensor Cores
    • CUDA Ampere Tensor Core HGEMM 矩阵乘法优化笔记 —— Up To 131 TFLOPS!
    • If we have two or four memory requests by a warp, do they need coalesced access/contiguity? #328
    • Do bank conflicts increase when using more shared memory?
    • How does parameter computeType affect the computation?
    • 2.1.10. GEMM Algorithms Numerical Behavior
    • cuBLAS的使用
    • RAFT在Knowhere上的一些评估测试[1]
    • How does parameter computeType affect the computation?
    • cudnn-frontend/tree/main/samples/samples/conv_sample.cpp
    • Is a union in C++ actually a class?
    • A Generalized Micro-kernel Abstraction for GPU Linear Algebra
    • Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs
    • Double-buffering in shared memory, details? #227
    • Efficient GEMM in CUDA
    • Thread synchronization with syncwarp
    • Using CUDA Warp-Level Primitives
    • CUDA微架构与指令集(3)-SASS指令集分类
    • VOLTA Architecture and performance optimization
    • How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
    • Determining registers holding the data after executing LDG.E.128
    • 刘冰、郑鹏|GPU编程和优化-最佳实践分享
  • 相关阅读:
    基于springboot+vue的戒毒所人员管理系统毕业设计源码251514
    ShardingSphere笔记(二):自定义分片算法 — 按月分表
    基于SpringBoot的花店销售网站
    C++ vector
    Masked Auto Encoder总结
    EA&UML日拱一卒 用例图开篇
    vue考试系统后台管理项目-接口封装调用
    transformer系列4---transformer结构计算量统计
    软文推广提升品牌曝光的技巧,媒介盒子告诉你
    共享剪切板小工具
  • 原文地址:https://blog.csdn.net/yiran103/article/details/134537689
  • 最新文章
  • 攻防演习之三天拿下官网站群
    数据安全治理学习——前期安全规划和安全管理体系建设
    企业安全 | 企业内一次钓鱼演练准备过程
    内网渗透测试 | Kerberos协议及其部分攻击手法
    0day的产生 | 不懂代码的"代码审计"
    安装scrcpy-client模块av模块异常,环境问题解决方案
    leetcode hot100【LeetCode 279. 完全平方数】java实现
    OpenWrt下安装Mosquitto
    AnatoMask论文汇总
    【AI日记】24.11.01 LangChain、openai api和github copilot
  • 热门文章
  • 十款代码表白小特效 一个比一个浪漫 赶紧收藏起来吧!!!
    奉劝各位学弟学妹们,该打造你的技术影响力了!
    五年了,我在 CSDN 的两个一百万。
    Java俄罗斯方块,老程序员花了一个周末,连接中学年代!
    面试官都震惊,你这网络基础可以啊!
    你真的会用百度吗?我不信 — 那些不为人知的搜索引擎语法
    心情不好的时候,用 Python 画棵樱花树送给自己吧
    通宵一晚做出来的一款类似CS的第一人称射击游戏Demo!原来做游戏也不是很难,连憨憨学妹都学会了!
    13 万字 C 语言从入门到精通保姆级教程2021 年版
    10行代码集2000张美女图,Python爬虫120例,再上征途
Copyright © 2022 侵权请联系2656653265@qq.com    京ICP备2022015340号-1
正则表达式工具 cron表达式工具 密码生成工具

京公网安备 11010502049817号