CS144（2023 Spring）Lab 1: stitching substrings into a byte stream

文章目录

前言
- 其他笔记
- 相关链接
1. Getting started
2. Putting substrings in sequence
3. 测试与优化

前言

这一个Lab主要是实现一个TCP receiver的字符串接收重组部分。

其他笔记

Lab 0: networking warmup
Lab 1: stitching substrings into a byte stream

1. Getting started

CS144这个Lab本来是自上而下从头到尾代码复用的，这导致我开局顺手就把远程库给干掉了，结果没想到到Lab居然叫我merge，那就重新弄一下吧：

git remote add base  git@github.com:CS144/minnow.git
git fetch base
1
2

打开VS的分支管理器，可以看到这个远程库里面应该是check1 - 6挨个发布的，看提交记录这个应该是当时上课的时候才一步步弄的仓库，理论上来讲直接合并check6的分支就可以了
在这里插入图片描述
右键合并到main，接受冲突，然后提交上传，就可以继续撸码了

2. Putting substrings in sequence

2.1 需求分析

在这里插入图片描述
Lab1和2做的事情是写一个TCP接收器，大概工作就如同Lab0的末尾写的那样，写一个类去处理字节流，不过这个数据将不用内存传输上，而是通过网络传输。

由于网络传输的不确定性以及成本问题，在传输数据时我们都是将串切成一段一段的，比如这里提到的每个 $short\ segments$ 不超过1460个 $b y t e$ ，又是考虑到网络传输的不确定性以及TCP的性质，这些字段通常会出现乱序、丢失的情况，而我们需要保证能够重排回最初的字符串。

具体到本Lab，我们要实现一个叫 $R e a sse mb l er$ 的东西，这是用来在接收端接受上面说的那一堆字段的，而每一个 $B y t e$ （而非 $se g m e n t$ ）都有一个对应的 $in d e x$ 。文档约束了这个类的两个必要接口，insert将一个data写入output，写入的位置自first_index起，它还用了一个bool变量去标识当前段是否为最后一个段；而bytes_pending则仅仅返回一下存在 $R e a sse mb l er$ 中的字节，但是哪些字节存在这里面呢？我们知道单纯网络传输不保证顺序，有可能提前接收到了后面的字段，就只好暂存在 $R e a sse mb l er$ ，等它前面的字段写完了再存进去。

然后进一步展示了这个类应当做的工作的一些细节。
在这里插入图片描述
首先，我们应当知道流的下一个待接收的字节（的 $in d e x$ ），正如上面说的那样，类内部还有一大堆字段嗷嗷待哺等着进流；

在这里插入图片描述
然后，我们需要处理提前到达的暂时没被推进流的串；

在这里插入图片描述
而对于哪些超出流接收能力的字节，应当直接扔掉；

在这里插入图片描述
然后这个图演示了总共存在三类byte：未进流暂存的、已进流缓存的、已被read弹出的，第三个我们这个Lab应该不用考虑。绿色内存以及类内暂存的一整块空间共同组成了capacity，可知我们的红色内存的最大值只能为capacity - buffered，超过这个的字节就得丢掉了。在实现上，这个值就是上一个Lab实现的available_capacity。

2.2 注意事项

在这里插入图片描述
然后是一些FAQ，我们可以提炼出这些信息：

流的 $in d e x$ 自0始；
我们会同Lab0一样有个跑分环境；
每个字段都是来自字符串的准确切片，不用做异常处理；
鼓励用标准库、数据结构；
尽可能早地将字节推进流，免得一直存着；
insert接收到的data字符串是有可能与其他字符串重叠的；
可以往类里面加私有成员（这不是废话吗）；
对于每一个字节，类内部应当只存储它的一份副本，不要存重叠的字符串；
运行./scripts/lines-of-code以计算实现代码行数，这个值一般在50-60。

2.3 代码实现

下面给出我的代码实现，里面有很多注释，就不挨着说了，不过注意我这里用到了std::ranges和std::views，因此你的编译器要在gcc13.1及以上。稍微需要探讨一下的是用什么数据结构来当做 $b u ff er$ ，这个数据结构需要满足什么样的需求呢？首先它会有频繁的任意处插入，然后它需要去频繁遍历比较大小查找，给出几个常见的数据结构的复杂度：

	插入	头删	删除k个	查找
list	$O (1)$	$O (1)$	$O (k)$	$O (n)$
vector	$O (n)$	$O (n)$	$O (n)$	$O(\log n)$
deque	$O (n)$	$O (1)$	$O (n)$	$O(\log n)$
map	$O(\log n)$	$O(\log n)$	$O(\log n + k)$	$O (n)$

可以看到，综合考虑下list基本是最优秀的容器了。其中虽然map红黑树自带的查找是 $O(\log n)$ ，但是我们的查找是要查找两个端点，如果将左右区间的pair作为key的话就不能用它内置的二分查找算法——它无法传递自定义比较谓词，而使用中的二分算法的话又因为它的迭代器不满足随即迭代器的条件，意味着只能 $O (n)$ 查找。综合来看，我们维护一个有序链表是最优的。

此外在向 $b u ff er$ 暂存的过程中，可能涉及到区间合并的问题，可以参考LeetCode 57. 插入区间
，给出这道题我的实现，本Lab直接套用即可：

// https://leetcode.cn/u/zi-bu-yu-mf/
class Solution {
public:
    vector<vector<int>> insert(vector<vector<int>>& intervals, vector<int>& newInterval) {
        auto beg = intervals.begin(), end = intervals.end();
        int& a = newInterval[0], & b = newInterval[1];
        auto l = lower_bound(beg, end, a, [](auto& a, int b) { return a[1] < b; });
        auto r = upper_bound(  l, end, b, [](int b, auto& a) { return a[0] > b; });
        if (l != end) a = min(a, l[ 0][0]);
        if (r != beg) b = max(b, r[-1][1]);
        intervals.insert(intervals.erase(l, r), newInterval);
        return intervals;
    }
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14

/*****************************************************************//**
 * \file   reassembler.hh
 * \brief  实现一个 Reassembler 类, 用于将乱序的字符串重新组装成有序的
 *         字符串，并推入字节流.
 * 
 * \author JMC
 * \date   August 2023
 *********************************************************************/
#pragma once

#include "byte_stream.hh"

#include 
#include 
#include 

class Reassembler
{
	bool had_last_ {};	// 是否已经插入了最后一个字符串
	uint64_t next_index_ {};	// 下一个要写入的字节的索引
	uint64_t buffer_size_ {};	// buffer_中的字节数
	std::list<std::tuple<uint64_t, uint64_t, std::string>> buffer_ {};

	/**
	 * \breif 将data推入output流.
	 */
	void push_to_output(std::string data, Writer& output);

	/**
	 * \brief 将data推入buffer暂存区.
	 * \param first_index data的第一个字节的索引
	 * \param last_index  data的最后一个字节的索引
	 * \param data        待推入的字符串, 下标为[first_index, last_index]闭区间
	 */
	void buffer_push( uint64_t first_index, uint64_t last_index, std::string data );

	/**
	 * 尝试将buffer中的串推入output流.
	 */
	void buffer_pop(Writer& output);

public:
  /*
   * Insert a new substring to be reassembled into a ByteStream.
   *   `first_index`: the index of the first byte of the substring
   *   `data`: the substring itself
   *   `is_last_substring`: this substring represents the end of the stream
   *   `output`: a mutable reference to the Writer
   *
   * The Reassembler's job is to reassemble the indexed substrings (possibly out-of-order
   * and possibly overlapping) back into the original ByteStream. As soon as the Reassembler
   * learns the next byte in the stream, it should write it to the output.
   *
   * If the Reassembler learns about bytes that fit within the stream's available capacity
   * but can't yet be written (because earlier bytes remain unknown), it should store them
   * internally until the gaps are filled in.
   *
   * The Reassembler should discard any bytes that lie beyond the stream's available capacity
   * (i.e., bytes that couldn't be written even if earlier gaps get filled in).
   *
   * The Reassembler should close the stream after writing the last byte.
   */
  void insert( uint64_t first_index, std::string data, bool is_last_substring, Writer& output );

  // How many bytes are stored in the Reassembler itself?
  uint64_t bytes_pending() const;
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

/*****************************************************************//**
 * \file   reassembler.cc
 * \brief  实现一个 Reassembler 类, 用于将乱序的字符串重新组装成有序的
 *         字符串，并推入字节流.
 * \author JMC
 * \date   August 2023
 *********************************************************************/
#include "reassembler.hh"

#include 
#include 

using namespace std;
void Reassembler::push_to_output( std::string data, Writer& output ) {
  next_index_ += data.size();
  output.push( move( data ) );
}

void Reassembler::buffer_push( uint64_t first_index, uint64_t last_index, std::string data )
{
  // 合并区间
  auto l = first_index, r = last_index;
  auto beg = buffer_.begin(), end = buffer_.end();
  auto lef = lower_bound( beg, end, l, []( auto& a, auto& b ) { return get<1>( a ) < b; } );
  auto rig = upper_bound( lef, end, r, []( auto& b, auto& a ) { return get<0>( a ) > b; } );
  if (lef != end) l = min( l, get<0>( *lef ) );
  if (rig != beg) r = max( r, get<1>( *prev( rig ) ) );
  
  // 当data已在buffer_中时，直接返回
  if ( lef != end && get<0>( *lef ) == l && get<1>( *lef ) == r ) {
    return;
  }

  buffer_size_ += 1 + r - l;
  if ( data.size() == r - l + 1 && lef == rig ) { // 当buffer_中没有data重叠的部分
	buffer_.emplace( rig, l, r, move( data ) );
	return;
  }
  string s( 1 + r - l, 0 );

  for ( auto&& it : views::iota( lef, rig ) ) {
	auto& [a, b, c] = *it;
	buffer_size_ -= c.size();
    ranges::copy(c, s.begin() + a - l);
  }
  ranges::copy(data, s.begin() + first_index - l);
  buffer_.emplace( buffer_.erase( lef, rig ), l, r, move( s ) );
}

void Reassembler::buffer_pop( Writer& output ) {
  while ( !buffer_.empty() && get<0>( buffer_.front() ) == next_index_ ) {
    auto& [a, b, c] = buffer_.front();
    buffer_size_ -= c.size();
    push_to_output( move( c ), output ); 
    buffer_.pop_front();
  }

  if ( had_last_ && buffer_.empty() ) {
    output.close();
  }
}

void Reassembler::insert( uint64_t first_index, string data, bool is_last_substring, Writer& output )
{
  if ( data.empty() ) {
    if ( is_last_substring ) {
      output.close();
    }
    return;
  }
  auto end_index = first_index + data.size();                  // data: [first_index, end_index)
  auto last_index = next_index_ + output.available_capacity(); // 可用范围: [next_index_, last_index)
  if ( end_index < next_index_ || first_index >= last_index ) {
    return; // 不在可用范围内, 直接返回
  }

  // 调整data的范围
  if ( last_index < end_index ) {
    end_index = last_index;
    data.resize( end_index - first_index );
    is_last_substring = false;
  }
  if ( first_index < next_index_ ) {
    data = data.substr( next_index_ - first_index );
    first_index = next_index_;
  }

  // 若data可以直接写入output, 则直接写入
  if ( first_index == next_index_ && ( buffer_.empty() || end_index < get<1>( buffer_.front() ) + 2 ) ) {
    if ( buffer_.size() ) { // 若重叠, 则调整data的范围
      data.resize( min( end_index, get<0>( buffer_.front() ) ) - first_index );
    }
    push_to_output( move( data ), output );
  } else { // 否则, 将data插入buffer_
    buffer_push( first_index, end_index - 1, data );
  }
  had_last_ |= is_last_substring;
  
  // 尝试将buffer_中的数据写入output
  buffer_pop(output);
}

uint64_t Reassembler::bytes_pending() const
{
  return buffer_size_;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

3. 测试与优化

写这个Lab我跑测试遇到了挺多Bug的，心态比较麻，看一下我的git记录，大体代码从五点到六点半就写完了，然后又调了三四个小时的Bug。。。：
在这里插入图片描述

在这里插入图片描述

这个Lab最主要的优化点还是要记得使用std::move转发字符串，然后这是我不管字符串能不能直接进流，都先进buffer后进流的速度
在这里插入图片描述
这是我对这种情况的特殊处理后的速度，可以看到来到了11+Gbit/s这个速度，按理说我的入buffer操作对于这种情况仅仅是多跑了一遍链表而已，没想到速度也差不少：

最后看看代码行数，运行./scripts/lines-of-code，如果报错则需要安装一个工具sudo apt-get install sloccount：
在这里插入图片描述
它说基础代码是22行，我们就写了77行。我看了一下这个统计行数其实就是去除了注释和空行，他说50-60行是正常的，我这里写得比较详细，想压行也不是压不了，也差不多。

相关阅读:
Grass推出Layer 2 Data Rollup
【FPGA教程案例66】硬件开发板调试6——基于FPGA的UDP网口通信和数据传输
Linux 使用 crontab 定时拆分日志、清理过期文件
Java8函数式编程应用
【飞桨PaddleSpeech语音技术课程】— 语音识别-流式服务
＜C++＞手撕搜索二叉树
HLS学习1：点灯
vivo 全球商城：电商平台通用取货码设计
Linux编写脚本使用Docker部署项目
稀疏-平坦矩阵信号的重构条件

原文地址：https://blog.csdn.net/J__M__C/article/details/132567895