• 深入理解计算机系统——第五章 Optimizing Program Performance


    资源:

    视频课程
    深入理解计算机系统_第5章 优化程序性能

    第五章涉及到一些第四章的内容,感觉这章看视频更容易理解。

    5.2 Expressing Program Performance

    We introduce the metric cycles per element, abbreviated CPE, to express program
    performance in a way that can guide us in improving the code.

    5.4 Eliminating Loop Inefficiencies

    示例:

    /* Convert string to lowercase: slow */
    2 void lower1(char *s)
    3 {
    4 long i;
    5 
    6 for (i = 0; i < strlen(s); i++)
    7 if (s[i] >= ’A’ && s[i] <= ’Z’)
    8 s[i] -= (’A’ - ’a’);
    9 }
    10
    11 /* Convert string to lowercase: faster */
    12 void lower2(char *s)
    13 {
    14 long i;
    15 long len = strlen(s);
    16
    17 for (i = 0; i < len; i++)
    18 if (s[i] >= ’A’ && s[i] <= ’Z’)
    19 s[i] -= (’A’ - ’a’);
    20 }
    21
    22 /* Sample implementation of library function strlen */
    23 /* Compute length of string */
    24 size_t strlen(const char *s)
    25 {
    26 long length = 0;
    27 while (*s != ’\0) {
    28 s++;
    29 length++;
    30 }
    31 return length;
    32 }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32

    可以看到 lower1lower2 更优,效率更高。

    5.5 Reducing Procedure Calls

    示例:

    1 /* Move call to vec_length out of loop */
    2 void combine2(vec_ptr v, data_t *dest)
    3 {
    4 	long i;
    5 	long length = vec_length(v);
    6 
    7 	*dest = IDENT;
    8 	for (i = 0; i < length; i++) {
    9 		data_t val;
    10 		get_vec_element(v, i, &val);
    11 		*dest = *dest OP val;
    12 	}
    13 }
    
    1 /* Direct access to vector data */
    2 void combine3(vec_ptr v, data_t *dest)
    3 {
    4 	long i;
    5 	long length = vec_length(v);
    6 	data_t *data = get_vec_start(v);
    7 
    8		*dest = IDENT;
    9 	for (i = 0; i < length; i++) {
    10 		*dest = *dest OP data[i];
    11 	}
    12 }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    combine3 不用每次循环时都调用 get_vec_start 函数

    5.6 Eliminating Unneeded Memory References

    前面 combine3 的循环内的汇编代码如下:
    loop code

    可以看见每次循环都有读写内存的步骤,但实际没有必要的,只用在循环结束后将数据写到内存中即可,因此做如下修改:

    1 /* Accumulate result in local variable */
    2 void combine4(vec_ptr v, data_t *dest)
    3 {
    4 	long i;
    5 	long length = vec_length(v);
    6 	data_t *data = get_vec_start(v);
    7 	data_t acc = IDENT;
    8 
    9 	for (i = 0; i < length; i++) {
    10 		acc = acc OP data[i];
    11 	}
    12 	*dest = acc;
    13 }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    其循环内的汇编代码如下:
    Inner loop of combine4

    combine3 编译器在每次循环都要读写数据数据因为考虑到可能会有 memory aliasing 的情况。

    Aliasing: two different memory references specify single location.

    Easy to have happen in C:
    1. since allowed to do address arithmetic
    2. Direct access to storage structures

    解决方案:
    1. Get in habit of introducing local variables;
    2. Your way of telling compiler not to check for aliasing

    5.7 Understanding Modern Processors

    The latency bound : a series of operations must be performed in strict sequence, because the result of one operation is required before the next one can begin.
    延迟界限:一系列的指令有依赖关系,必须按照严格的顺序执行。

    The throughput bound: characterizes the raw computing capacity of the processor’s functional units. This bound becomes the ultimate limit on program performance.
    吞吐量限制:基于硬件的数量和性能,功能单元的原始计算能力。该限制是程序能优化的极限,

    5.7.1 Overall Operation

    Figure 5.11 shows a very simplified view of a modern microprocessor.
    Block diagram of an out-of-order processor

    These processors are described in the industry as being superscalar, which
    means they can perform multiple operations on every clock cycle and out of order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program.
    (需要指令之间独立)

    图 5.11 处理器两个部分:
    1. Instruction control unit (ICU): Reading a sequence of instructions from memory and generating from these a set of primitive operations to perform on program data.
    2. Execution unit (EU): Executes these operations.

    优点:They are better at achieving higher degrees of instruction-level parallelism.

    The ICU reads the instructions from an instruction cache—a special high-speed memory containing the most recently accessed instructions.

    In general, the ICU fetches well ahead of the currently executing instructions, so that it has enough time to decode these and send operations down to the EU.

    存在的问题:
    遇到分支时进行分支预测(前面讲过),在执行前已经对预测的分支进行取指和译码的操作,如果预测错误,则需要重新取指令和译码。
    如果有分支预测,那最后的在确定分支是否正确前不会存到程序寄存器或内存中。

    load 单元:从内存中读数据到处理器。
    store 单元:将处理器中的数据写入到内存,并且能计算地址。
    从图 5.11 可以看出,loadstore 两个单元通过 data catch 来读写内存,该单元是包含最近使用的数据的高速内存。

    Figure 5.11 indicates that the different functional units are designed to perform different operations.

    arithmetic operations: 执行整型和浮点数不同组合的操作。

    The arithmetic units are intentionally designed to be able to perform a variety of different operations, since the required operations vary widely across different programs.

    For example, our Intel Core i7 Haswell reference machine has eight functional units, numbered 0–7. Here is a partial list of each one’s capabilities:
    eight functional units

    In the above list, “integer arithmetic” refers to basic operations, such as addition, bitwise operations, and shifting.
    Multiplication and division require more specialized resources.

    store 操作需要两个function units,一个计算存储地址,另一个存储数据。

    We can see that this combination of functional units has the potential to
    perform multiple operations of the same type simultaneously.

    Within the ICU, the retirement unit keeps track of the ongoing processing and makes sure that it obeys the sequential semantics of the machine-level program.

    Our figure shows a register file containing the integer, floating-point, and, more
    recently, SSE and AVX registers as part of the retirement unit, because this unit controls the updating of these registers.

    指令被译码时,信息存在队列中(先入先出)。
    这些在分支预测结果出现前一直存在队列中。
    如果分支预测成功,那么已经执行的预测的指令就可以退役(retired),并更新程序寄存器。
    如果分支预测错误,执行的错误的指令将被清空(flushed),并丢弃已经计算的结果,因此不会不会修改程序状态。

    程序寄存器只会在指令退役后更新。

    执行单元的不同部分能直接传输结果(Operation results)。

    The most common mechanism for controlling the communication of operands among the execution units is called register renaming.

    对于每个寄存器,通常有几百个寄存器的副本,用于存储需要更新到实际寄存器的值。

    当一个更新寄存器 r 的指令被译码后,用标签 t 表示该操作的结果。

    An entry (r, t) is added to a table maintaining the association between program register r and tag t for an operation that will update this register.

    When a subsequent instruction using register r as an operand is decoded, the operation sent to the execution unit will contain t as the source for the operand value. (这里不是很明白?)

    When some execution unit completes the first operation, it generates a result (v, t), indicating that the operation with tag t produced value v.

    Any operation waiting for t as a source will then use v as the source value, a form of data forwarding.

    By this mechanism, values can be forwarded directly from one operation to another, rather than being written to and read from the register file, enabling the second operation to begin as soon as the first has completed.

    The renaming table only contains entries for registers having pending write operations.

    With register renaming, an entire sequence of operations can be performed
    speculatively, even though the registers are updated only after the processor is
    certain of the branch outcomes.

    第四章讲过 data forwarding 的概念,也举了很多例子。

    5.7.2 Functional Unit Performance

    latency : The total time required to perform the operation.
    延迟:一个指令执行需要的时间。

    issue time : The minimum number of clock cycles between two independent operations of the same type.(相同类型??)

    (Cycle/Issue:由于流水线(pipelining)操作,两条指令之间的时间。)

    capacity : The number of functional units capable of performing that operation.

    5.8 Loop Unrolling

    Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

    在循环中一次计算多个值而非一个值。

    示例:
    combine4 的代码如下:
    combine4

    采用 Loop Unrolling 后改进的 combine5
    combine5

    改进后的代码每次循环中计算两个数,这里还需要第二的循环处理剩下的数,例如元素的个数为5个时,第一个循环只处理4个数,因此第二个循环处理第5个数。

    改进后效果如下:
    performance

    当循环内做2次展开时,对于整数的加法操作有改进(OP+IDENT0),其他无改进,因为该代码还是需要顺序执行,即一条指令执行完成后才执行下一条,指令之间不是独立的(先计算 acc OP data[i],得到结果后再将结果 OP data[i+1])。

    继续优化:
    acc = (acc OP data[i]) OP data[i+1]; 改为 acc = acc OP (data[i] OP data[i+1]); ,即改变括号的位置,其性能如下(优化后代码为 combine7):
    performance

    combine5 对比可以看见换了括号的位置后整数的乘法和浮点数的操作时间减少几乎一半。
    原因:指令之间变得独立了,当一条指令执行 data[0] OP data[1] 时,下一条指令可以提前计算 data[2] OP data[3]

    存在的问题:对于整数的加法和乘法满足结合率和交换率,这样改变括号的位置对计算结果无影响;但对于浮点数,并不满足结合率,改变括号的位置后可能会出现舍入,甚至溢出的情况,其结果可能和改变括号前不同。

    浮点数乘法操作吞吐量为 0.5,因为有两个浮点乘法器

    进一步优化:分别计算两组元素的加法或者积:
    combine6

    combine6 将计算分成两组,索引为偶数的组合在一起计算,索引为奇数的组合为一组计算,其性能如下:
    performance

    combine 7 相比,整数的加法时间减少了。

    5.10 Summary of Results for Optimizing Combining Code

    继续优化,增大展开的数目(combine6 为 2*2,及一个循环中计算两个数,分成两组计算),让延迟界限尽量接近吞吐量界限:
    Summary

    可以看见 10 * 10时,比 2*2 进一步优化了。

    5.11 Some Limiting Factors

    5.11.1 Register Spilling

    前面说道通过增加循环内计算的数量来优化性能,且级数越大执行时间越短,但这也存在上限,受寄存器的数目的限制,如果寄存器的数目不够用,则编译器会将部分数据存在内存中,通常在中,因而导致时间反而更长(内存访问速度低于寄存器),见下图:
    result

    可以看见 20 * 20 展开时时间反而比 10 * 10 长。

    Fortunately, x86-64 has enough registers that most loops will become throughput limited before this occurs.

    5.11.2 Branch Prediction and Misprediction Penalties

    Modern processors work well ahead of the currently executing instructions,
    reading new instructions from memory and decoding them to determine what operations to perform on what operands. This instruction pipelining works well as long as the instructions follow in a simple sequence.

    然而对于需要进行分支预测的情况,指令控制单元必须在指令执行单元之前完成,执行指令执行单元时,必须保证该指令是正确的分支(前面讲过)以保证不会影响程序,如果预测错误,则需要回到分支预测的地方,执行正确的分支,重新取指译码,最后才能进入执行阶段。

    分支预测错误也不会对程序有影响,因为分支预测时对无效指令的操作只修改寄存器的值,而每个寄存器都有很多副本,每一次计算的结果都依次保留在寄存器副本中,因此预测错误时,只需要取消寄存器中那些错误的更新并将正确的值返回寄存器中(前面说过的寄存器重命名,即 register renaming)。

    5.12 Understanding Memory Performance

    All modern processors contain one or more cache memories to provide fast access to such small amounts of memory.

    第六章会详细介绍

    5.14 Identifying and Eliminating Performance Bottlenecks

    本章主要介绍通过工具进行程序剖析(code profilers) 来分析程序的性能,可以查看不同部分花费的时间等来分析代码从而优化程序。

  • 相关阅读:
    简易计时器开发
    何时使用Elasticsearch而不是MySql?
    (Hexagon_V65_Programmers_Reference_Manual(13)
    使用sqlmap的 ua注入
    java毕业设计下载含论文+PPT+源码等]javaweb基于JSP的网上订餐管理系统|餐饮就餐订餐餐厅
    文件管理革命:突破限制,实现无限次复制粘贴
    CS224W2.2——传统基于特征的方法(边层级特征)
    uniapp&小程序打开地图导航
    [HDLBits] Exams/ece241 2014 q5a
    编程语言(Python,Java,C++等)基础原理
  • 原文地址:https://blog.csdn.net/Lee567/article/details/126256960