Cache Lab

文章目录

Part A（Cache Stimulator）

Part A是写一个关于cache的模拟器。

分为三个步骤：

接受命令行参数并解析，为后面确定cache大小做铺垫；
从本地读取trace文件，即读取valgrind执行结果（valgrind是一种检测内存泄露的工具）；
模拟cache（包括初始化cache，访问cache等）。

注意细节：

不要重复造轮子，使用现有的Linux下的C语言库；
分配内存后一定要释放内存；
小心隐式类型转换；
注意指针的使用。

1. 使用getopt函数进行命令解析

使用man 3 getopt可看getopt的使用方法。
getopt的函数原型如下：

#include 

       int getopt(int argc, char * const argv[],
                  const char *optstring);

       extern char *optarg;
       extern int optind, opterr, optopt;
1
2
3
4
5
6
7

getopt()函数将传递给mian()函数的argc,argv作为参数，同时接受字符串参数optstring – optstring是由选项Option字母组成的字符串。
关于optstring的格式规范简单总结如下：
(1) 单个字符，表示该选项Option不需要参数。
(2) 单个字符后接一个冒号":“，表示该选项Option需要一个选项参数Option argument。选项参数Option argument可以紧跟在选项Option之后，
或者以空格隔开。选项参数Option argument的首地址赋给optarg。
(3) 单个字符后接两个冒号”::"，表示该选项Option的选项参数Option argument是可选的。当提供了Option argument时，必须紧跟Option之后，
不能以空格隔开，否则getopt()会认为该选项Option没有选项参数Option argument，optarg赋值为NULL。相反，提供了选项参数Option argument，
则optarg指向Option argument。

为了使用getopt()，我们需要在while循环中不断地调用直到其返回-1为止。每一次调用，当getopt()找到一个有效的Option的时候就会返回这个Option字符，并设置几个全局变量。

getopt()设置的全局变量包括：
char *optarg －－当匹配一个选项后，如果该选项带选项参数，则optarg指向选项参数字符串；若该选项不带选项参数，则optarg为NULL；
若该选项的选项参数为可选时，optarg为NULL表明无选项参数，optarg不为NULL时则指向选项参数字符串。

int optind －－下一个待处理元素在argv中的索引值。即下一次调用getopt的时候，从optind存储的位置处开始扫描选项。当getopt()返回-1后，
optind是argv中第一个Operands的索引值。optind的初始值为1。

int opterr －－ opterr的值非0时，在getopt()遇到无法识别的选项，或者某个选项丢失选项参数的时候，getopt()会打印错误信息到标准错误输出。
opterr值为0时，则不打印错误信息。

int optopt －－在上述两种错误之一发生时，一般情况下getopt()会返回’?'，并且将optopt赋值为发生错误的选项。

具体看man手册。

2. 使用fscanf进行trace文件的读取

fscanf 函数原型为 int fscanf (FILE * stream, const char * format, [argument…]); 其功能为根据数据格式 (format)，从输入流 (stream)中读入数据，存储到argument中，遇到空格和换行时结束。 fscanf位于C标准库头文件中。

3. 使用malloc进行cache内存的动态分配

malloc函数声明：

void *malloc(size_t size)
// size -- 内存块的大小，以字节为单位。
1
2

C 库函数 void *malloc(size_t size) 分配所需的内存空间，并返回一个指向它的指针。
该函数返回一个指针，指向已分配大小的内存。如果请求失败，则返回 NULL。

使用完malloc，一定要注意使用free函数释放内存。

4. 使用LRU算法进行block的evict

这里，LRU的实现可采用两种方法，一种是基于queue，另一种是使用timestamp来记录block的最近使用时间戳。

5. 总体实现

#include "cachelab.h"
#include 
#include 
#include 
#include 
#include 
#include 

// cache line的定义
typedef struct cache_line
{
    short int valid_bit;
    unsigned int tag;
    unsigned int LRU_counter;
} cache_line;

unsigned int s, E, b, hit_count = 0, miss_count = 0, eviction_count = 0, time = 0;
// s：组数；E：行数；b：块数；hit_count：命中次数；miss_count：缺失次数；eviction_count：冲突次数
// time：表示时间戳



void accessCache(unsigned int set, unsigned int tag, cache_line (*cache)[E])
{
    // 检查tag位，然后检查valid位，看是否命中
    for (int i = 0; i < E; ++i)
    {
        if ((cache[set][i].tag == tag) && (cache[set][i].valid_bit == 1))
        {
            hit_count += 1;
            cache[set][i].LRU_counter = time++;
            return;
        }
    }
    
    miss_count += 1;
    // 如果不命中（有空行）
    for (int i = 0; i < E; ++i)
    {
        if (cache[set][i].valid_bit == 0)
        {
            cache[set][i].valid_bit = 1;
            cache[set][i].tag = tag;
            cache[set][i].LRU_counter = time++;
            return;
        }
    }

    // 冲突不命中
    int min_time = INT_MAX, tmp;
    eviction_count += 1;

    for (int i = 0; i < E; ++i)
    {
        if (cache[set][i].LRU_counter < min_time)
        {
            min_time = cache[set][i].LRU_counter;
            tmp = i;
        }
    }
    cache[set][tmp].valid_bit = 1;
    cache[set][tmp].tag = tag;
    cache[set][tmp].LRU_counter = time++;
}



int main(int argc, char **argv)
{
    int opt;
    unsigned int tag, set;
    char fileName[50];

    // 解析命令行参数
    while (-1 != (opt = getopt(argc, argv, "s:E:b:t:")))
    {
        switch (opt)
        {
        case 's':
            s = atoi(optarg);
            break;
        case 'E':
            E = atoi(optarg);
            break;
        case 'b':
            b = atoi(optarg);
            break;
        case 't':
            strcpy(fileName, optarg);
            break;
        default:
            printf("wrong argument\n");
            break;
        }
    }

    // 初始化cache
    unsigned int cache_size = (1 << s) * E;
    cache_line(*cache)[E] = (cache_line(*)[E])malloc(cache_size * sizeof(cache_line));

    for (int i = 0; i < (1 << s); ++i)
    {
        for (int j = 0; j < E; ++j)
        {
            cache[i][j].valid_bit = 0;
            cache[i][j].tag = 0;
            cache[i][j].LRU_counter = 1;
        }
    }

    //读取trace文件
    FILE *pFile;
    pFile = fopen(fileName, "r");

    char identifier;
    unsigned address;
    int size;

    while (fscanf(pFile, "%c %x,%d", &identifier, &address, &size) > 0)
    {
        // 访问cache
        set = (address >> b) % (1 << s);   // 提取set位
        tag = address >> (s + b);     // 提取tag位
        switch (identifier)    // 识别要执行的操作
        {
        case 'M':
            accessCache(set, tag, cache);  // 一次modify操作，包含一次load，一次store
        case 'L':
        case 'S':
            accessCache(set, tag, cache); // load操作和store操作等价
            break;

        default:
            break;
        }
    }

    fclose(pFile);
    

    printSummary(hit_count, miss_count, eviction_count); // 打印最终结果

    // 终止
    free(cache);  // 释放cache
    return 0;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146

Part B（Efficient Matrix Transpose）

Part B主要是使用blocking技术来提高cache命中率。
要求实现矩阵的转置函数：transpose_submit。

这里只提供结果，具体可参考这篇，博主写得十分详细。

1. 32 * 32

分块大小：8*8。

for (int i = 0; i < N; i += 8)
{
	for (int j = 0; j < M; j += 8) // 分成8 * 8的小块
    {
            for (int i1 = i; i1 < i + 8; ++i1)
            {
                   if (i == j) // 处理对角线冲突
                   {
                        a1 = A[i1][j];
                        a2 = A[i1][j + 1];
                        a3 = A[i1][j + 2];
                        a4 = A[i1][j + 3];
                        a5 = A[i1][j + 4];
                        a6 = A[i1][j + 5];
                        a7 = A[i1][j + 6];
                        a8 = A[i1][j + 7];
                        B[j][i1] = a1;
                        B[j + 1][i1] = a2;
                        B[j + 2][i1] = a3;
                        B[j + 3][i1] = a4;
                        B[j + 4][i1] = a5;
                        B[j + 5][i1] = a6;
                        B[j + 6][i1] = a7;
                        B[j + 7][i1] = a8;
                        continue;
                    }
                    for (int j1 = j; j1 < j + 8; ++j1) // 其他情况
                        B[j1][i1] = A[i1][j1];
             }
      }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

2. 64 * 64

分块大小：由于要重复利用cache，所以沿用88，但是通过将88再分成4块，然后对这4块分别处理。

for (int i = 0; i < N; i += 8)
{
	for (int j = 0; j < M; j += 8)
    {
    	for (int i1 = i; i1 < i + 4; ++i1)
        {
              a1 = A[i1][j];
              a2 = A[i1][j + 1];
              a3 = A[i1][j + 2];
              a4 = A[i1][j + 3];
              a5 = A[i1][j + 4];
              a6 = A[i1][j + 5];
              a7 = A[i1][j + 6];
              a8 = A[i1][j + 7];

              B[j][i1] = a1;
              B[j + 1][i1] = a2;
              B[j + 2][i1] = a3;
              B[j + 3][i1] = a4;
              B[j][i1 + 4] = a5;
              B[j + 1][i1 + 4] = a6;
              B[j + 2][i1 + 4] = a7;
              B[j + 3][i1 + 4] = a8;
         }
         for (int j1 = j; j1 < j + 4; ++j1)
         {
             a1 = A[i + 4][j1];
             a2 = A[i + 5][j1];
             a3 = A[i + 6][j1];
             a4 = A[i + 7][j1];
             a5 = B[j1][i + 4];
             a6 = B[j1][i + 5];
             a7 = B[j1][i + 6];
             a8 = B[j1][i + 7];

             B[j1][i + 4] = a1;
             B[j1][i + 5] = a2;
             B[j1][i + 6] = a3;
             B[j1][i + 7] = a4;
             B[j1 + 4][i] = a5;
             B[j1 + 4][i + 1] = a6;
             B[j1 + 4][i + 2] = a7;
             B[j1 + 4][i + 3] = a8;
        }
        for (int i1 = i + 4; i1 < i + 8; ++i1)
        {
             a1 = A[i1][j + 4];
             a2 = A[i1][j + 5];
             a3 = A[i1][j + 6];
             a4 = A[i1][j + 7];
             B[j + 4][i1] = a1;
             B[j + 5][i1] = a2;
             B[j + 6][i1] = a3;
             B[j + 7][i1] = a4;
         }
      }
   }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

3. 61 * 67

分块大小：由于矩阵的大小不规格，通过对block的大小进行多次测试来选择最好的一种方案。
这里是选择：17*17。

for (int i = 0; i < N; i += 17)
{
    for (int j = 0; j < M; j += 17)
    {
         for (int i1 = i; (i1 < i + 17) && (i1 < N); ++i1)   // 注意边界情况
         {
              for (int j1 = j; (j1 < j + 17) && (j1 < M); ++j1)
              {
                  B[j1][i1] = A[i1][j1];
               }
          }
     }
}
1
2
3
4
5
6
7
8
9
10
11
12
13

测试结果

Part A :           
                        Your simulator     Reference simulator
Points (s,E,b)    Hits  Misses  Evicts    Hits  Misses  Evicts
     3 (1,1,1)       9       8       6       9       8       6  traces/yi2.trace
     3 (4,2,4)       4       5       2       4       5       2  traces/yi.trace
     3 (2,1,4)       2       3       1       2       3       1  traces/dave.trace
     3 (2,1,3)     167      71      67     167      71      67  traces/trans.trace
     3 (2,2,3)     201      37      29     201      37      29  traces/trans.trace
     3 (2,4,3)     212      26      10     212      26      10  traces/trans.trace
     3 (5,1,5)     231       7       0     231       7       0  traces/trans.trace
     6 (5,1,5)  265189   21775   21743  265189   21775   21743  traces/long.trace

Summary for official submission (func 0): correctness=1 misses=287
    
TEST_CSIM_RESULTS=27


Part B :

1.  32 * 32
Function 0 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:1766, misses:287, evictions:255

Function 1 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151

Summary for official submission (func 0): correctness=1 misses=287

TEST_TRANS_RESULTS=1:287


2.   64 * 64
Function 0 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:9066, misses:1179, evictions:1147

Function 1 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Simple row-wise scan transpose): hits:3474, misses:4723, evictions:4691

Summary for official submission (func 0): correctness=1 misses=1179

TEST_TRANS_RESULTS=1:1179



3. 67 * 61
Function 0 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 0 (Transpose submission): hits:6229, misses:1950, evictions:1918

Function 1 (2 total)
Step 1: Validating and generating memory traces
Step 2: Evaluating performance (s=5, E=1, b=5)
func 1 (Simple row-wise scan transpose): hits:3756, misses:4423, evictions:4391

Summary for official submission (func 0): correctness=1 misses=1950

TEST_TRANS_RESULTS=1:1950
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

相关阅读:
极端业务场景下，我们应该如何做好稳定性保障?
Win10 KB5017308安全更新补丁获取及更新说明
 vue2 修改密码 (4)
【Hadoop】 Hive：内部表与外部表的创建与查看
 08.29web自动化测试
 开放智慧，助力学习——电大搜题，打开学无止境的新篇章
 基本微信小程序的电影票务系统-电影票预订系统
 Win10+Anaconda+labelme安装（done）
WPF学习笔记：给文字添加线性渐变效果
 电子书制作软件Vellum mac中文版特点
原文地址：https://blog.csdn.net/weixin_50697073/article/details/127942117