• 字符串压缩(一)之ZSTD


    前言

      最近项目上有大量的字符串数据需要存储到内存,并且需要储存至一定时间,于是自然而然的想到了使用字符串压缩算法对“源串”进行压缩存储。由此触发了对一些优秀压缩算法的调研。
      字符串压缩,我们通常的需求有几个,一是高压缩率,二是压缩速率高,三是解压速率高。不过高压缩率与高压缩速率是鱼和熊掌的关系,不可皆得,优秀的算法一般也是采用压缩率与性能折中的方案。从压缩率、压缩速率、解压速率考虑,zstd与lz4有较好的压缩与解压性能,最终选取zstd与lz4进行调研。

      zstd是facebook开源的提供高压缩比的快速压缩算法(参考https://github.com/facebook/zstd),很想了解一下它在压缩与解压方面的实际表现。

    一、zstd压缩与解压

      ZSTD_compress属于ZSTD的Simple API范畴,只有压缩级别可以设置。

      ZSTD_compress函数原型如下:

      size_t ZSTD_compress(void* dst, size_t dstCapacity, const void* src, size_t srcSize, int compressionLevel)

     

      ZSTD_decompress函数原型如下:
      size_t ZSTD_decompress( void* dst, size_t dstCapacity, const void* src, size_t compressedSize);

      我们先来看看zstd的压缩与解压缩示例。

    复制代码
     1 #include <stdio.h>
     2 #include <string.h>
     3 #include <sys/time.h>
     4 #include <malloc.h>
     5 #include <zstd.h>
     6 #include <iostream>
     7 
     8 using namespace std;
     9 
    10 int main()
    11 {
    12     // compress
    13     size_t com_space_size;
    14     size_t peppa_pig_text_size;
    15 
    16     char *com_ptr = NULL;
    17     char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
    18     
    19     peppa_pig_text_size = strlen(peppa_pig_buf);
    20     com_space_size= ZSTD_compressBound(peppa_pig_text_size);
    21     com_ptr = (char *)malloc(com_space_size);
    22     if(NULL == com_ptr) {
    23         cout << "compress malloc failed" << endl;
    24         return -1;
    25     }
    26 
    27     size_t com_size;
    28     com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, ZSTD_fast);
    29     cout << "peppa pig text size:" << peppa_pig_text_size << endl;
    30     cout << "compress text size:" << com_size << endl;
    31     cout << "compress ratio:" << (float)peppa_pig_text_size / (float)com_size << endl << endl;
    32 
    33 
    34     // decompress
    35     char* decom_ptr = NULL;
    36     unsigned long long decom_buf_size;
    37     decom_buf_size = ZSTD_getFrameContentSize(com_ptr, com_size);
    38     
    39     decom_ptr = (char *)malloc((size_t)decom_buf_size);
    40     if(NULL == decom_ptr) {
    41         cout << "decompress malloc failed" << endl;
    42         return -1;
    43     }
    44 
    45     size_t decom_size;
    46     decom_size = ZSTD_decompress(decom_ptr, decom_buf_size, com_ptr, com_size);
    47     cout << "decompress text size:" << decom_size << endl;
    48 
    49     if(strncmp(peppa_pig_buf, decom_ptr, peppa_pig_text_size)) {
    50         cout << "decompress text is not equal peppa pig text" << endl;
    51     }
    52     
    53     free(com_ptr);
    54     free(decom_ptr);
    55     return 0;
    56 }
    复制代码

    执行结果:

      从结果可以发现,压缩之前的peppa pig文本长度为1827,压缩后的文本长度为759,压缩率为2.4,解压后的长度与压缩前相等。

      另外,上文提到可以调整ZSTD_compress函数的压缩级别,zstd的默认级别为ZSTD_CLEVEL_DEFAULT = 3,最小值为0,最大值为ZSTD_MAX_CLEVEL = 22。另外也提供一些策略设置,例如 ZSTD_fast, ZSTD_greedy, ZSTD_lazy, ZSTD_lazy2, ZSTD_btlazy2。压缩级别越高,压缩率越高,但是压缩速率越低。

    二、ZSTD压缩与解压性能探索

      上面探索了zstd的基础压缩与解压方法,接下来再摸索一下zstd的压缩与解压缩性能。

      测试方法是,使用ZSTD_compress连续压缩同一段文本并持续10秒,最后得到每一秒的平均压缩速率。测试压缩性能的代码示例如下:

    复制代码
     1 #include <stdio.h>
     2 #include <string.h>
     3 #include <sys/time.h>
     4 #include <malloc.h>
     5 #include <zstd.h>
     6 #include <iostream>
     7 
     8 using namespace std;
     9 
    10 int main()
    11 {
    12     int cnt = 0;
    13 
    14     size_t com_size;
    15     size_t com_space_size;
    16     size_t peppa_pig_text_size;
    17     
    18     char *com_ptr = NULL;
    19     char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
    20 
    21     timeval st, et;
    22 
    23     peppa_pig_text_size = strlen(peppa_pig_buf);
    24     com_space_size= ZSTD_compressBound(peppa_pig_text_size);
    25 
    26     gettimeofday(&st, NULL);
    27     while(1) {
    28         
    29         com_ptr = (char *)malloc(com_space_size);
    30         com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, ZSTD_fast);
    31         
    32         free(com_ptr);
    33         cnt++;
    34         
    35         gettimeofday(&et, NULL);
    36         if(et.tv_sec - st.tv_sec >= 10) {
    37             break;
    38         }
    39     }
    40 
    41     cout << "comparess per second:" << cnt/10 << " times" << endl;
    42     return 0;
    43 }
    复制代码

    执行结果:

      结果显示ZSTD的压缩性能大概在每秒6-7万次左右,这个结果其实并不是太理想。需要说明的是压缩性能与待压缩文本的长度、字符内容也是有关系的。

      我们再来探索一下ZSTD的解压缩性能。与上面的测试方法类似,先对本文进行压缩,然后连续解压同一段被压缩过的数据并持续10秒,最后得到每一秒的平均解压速率。测试解压性能的代码示例如下:

    复制代码
     1 #include <stdio.h>
     2 #include <string.h>
     3 #include <sys/time.h>
     4 #include <malloc.h>
     5 #include <zstd.h>
     6 #include <iostream>
     7 
     8 using namespace std;
     9 
    10 int main()
    11 {
    12     int cnt = 0;
    13 
    14     size_t com_size;
    15     size_t com_space_size;
    16     size_t peppa_pig_text_size;
    17 
    18     timeval st, et;
    19     
    20     char *com_ptr = NULL;
    21     char peppa_pig_buf[2048] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining. Can we go out to play?Daddy: Alright, run along you two.Narrator: Peppa loves jumping in muddy puddles.Peppa: I love muddy puddles.Mummy: Peppa. If you jumping in muddy puddles, you must wear your boots.Peppa: Sorry, Mummy.Narrator: George likes to jump in muddy puddles, too.Peppa: George. If you jump in muddy puddles, you must wear your boots.Narrator: Peppa likes to look after her little brother, George.Peppa: George, let's find some more pud dles.Narrator: Peppa and George are having a lot of fun. Peppa has found a lttle puddle. George hasfound a big puddle.Peppa: Look, George. There's a really big puddle.Narrator: George wants to jump into the big puddle first.Peppa: Stop, George. | must check if it's safe for you. Good. It is safe for you. Sorry, George. It'sonly mud.Narrator: Peppa and George love jumping in muddy puddles.Peppa: Come on, George. Let's go and show Daddy.Daddy: Goodness me.Peppa: Daddy. Daddy. Guess what we' ve been doing.Daddy: Let me think... Have you been wa tching television?Peppa: No. No. Daddy.Daddy: Have you just had a bath?Peppa: No. No.Daddy: | know. You've been jumping in muddy puddles.Peppa: Yes. Yes. Daddy. We've been jumping in muddy puddles.Daddy: Ho. Ho. And look at the mess you're in.Peppa: Oooh....Daddy: Oh, well, it's only mud. Let's clean up quickly before Mummy sees the mess.Peppa: Daddy, when we've cleaned up, will you and Mummy Come and play, too?Daddy: Yes, we can all play in the garden.Narrator: Peppa and George are wearing their boots. Mummy and Daddy are wearing their boots.Peppa loves jumping up and down in muddy puddles. Everyone loves jumping up and down inmuddy puddles.Mummy: Oh, Daddy pig, look at the mess you're in. .Peppa: It's only mud.";
    22 
    23     size_t decom_size;
    24     char* decom_ptr = NULL;
    25     unsigned long long decom_buf_size;
    26 
    27     peppa_pig_text_size = strlen(peppa_pig_buf);
    28     com_space_size= ZSTD_compressBound(peppa_pig_text_size);
    29     com_ptr = (char *)malloc(com_space_size);
    30     
    31     com_size = ZSTD_compress(com_ptr, com_space_size, peppa_pig_buf, peppa_pig_text_size, 1);
    32 
    33     gettimeofday(&st, NULL);
    34     decom_buf_size = ZSTD_getFrameContentSize(com_ptr, com_size);
    35     
    36     while(1) {
    37     
    38         decom_ptr = (char *)malloc((size_t)decom_buf_size);
    39         
    40         decom_size = ZSTD_decompress(decom_ptr, decom_buf_size, com_ptr, com_size);
    41         if(decom_size != peppa_pig_text_size) {
    42         
    43             cout << "decompress error" << endl;
    44             break;
    45         }
    46         
    47         free(decom_ptr);
    48         
    49         cnt++;
    50         gettimeofday(&et, NULL);
    51         if(et.tv_sec - st.tv_sec >= 10) {
    52             break;
    53         }
    54     }
    55 
    56     cout << "decompress per second:" << cnt/10 << " times" << endl;
    57     
    58     free(com_ptr);
    59     return 0;
    60 }
    复制代码

    执行结果:

       结果显示ZSTD的解压缩性能大概在每秒12万次左右,解压性能比压缩性能高。

    三、zstd的高级用法

      zstd提供了一个名为PZSTD的压缩和解压工具。PZSTD(parallel zstd),并行压缩的zstd,是一个使用多线程对待压缩文本进行切片分段,且进行并行压缩的命令行工具。

      其实高版本(v1.4.0及以上)的zstd也提供了指定多线程对文本进行并行压缩的相关API接口,也就是本小节要介绍的zstd高级API用法。下面我们再来探索一下zstd的多线程压缩使用方法。

      多线程并行压缩的两个关键API,一个是参数设置API,另一个是压缩API。

      参数设置API的原型是:

      size_t ZSTD_CCtx_setParameter(ZSTD_CCtx* cctx, ZSTD_cParameter param, int value)

      压缩API的原型是:

      size_t ZSTD_compress2(ZSTD_CCtx* cctx, void* dst, size_t dstCapacity, const void* src, size_t srcSize)

      下面给出zstd并行压缩的示例demo,通过ZSTD_CCtx_setParameter设置线程数为3,即指定宏ZSTD_c_nbWorkers为3,通过ZSTD_compress2压缩相关文本。另外,为了展示zstd确实使用了多线程,需要先读取一个非常大的文件,作为zstd的压缩文本源,尽量使zstd运行较长时间。

    复制代码
     1 #include <stdio.h>
     2 #include <string.h>
     3 #include <sys/time.h>
     4 #include <malloc.h>
     5 #include <zstd.h>
     6 #include <iostream>
     7 
     8 using namespace std;
     9 
    10 int main()
    11 {
    12     size_t com_size;
    13     size_t com_space_size;
    14 
    15     FILE *fp = NULL;
    16     unsigned int file_len;
    17     
    18     char *com_ptr = NULL;
    19     char *file_text_ptr = NULL;
    20 
    21     fp = fopen("xxxxxx", "r");
    22     if(NULL == fp){
    23          cout << "file open failed" << endl;
    24          return -1;
    25     }
    26 
    27     fseek(fp, 0, SEEK_END);
    28     file_len = ftell(fp);
    29     fseek(fp, 0, SEEK_SET);
    30     cout << "file length:" << file_len << endl;
    31 
    32     // malloc space for file content
    33     file_text_ptr = (char *)malloc(file_len);
    34     if(NULL == file_text_ptr) {
    35         cout << "malloc failed" << endl;
    36         return -1;
    37     }
    38 
    39     // malloc space for compress space
    40     com_space_size = ZSTD_compressBound(file_len);
    41     com_ptr = (char *)malloc(com_space_size);
    42     if(NULL == com_ptr) {
    43         cout << "malloc failed" << endl;
    44         return -1;
    45     }
    46 
    47     // read text from source file
    48     fread(file_text_ptr, 1, file_len, fp);
    49     fclose(fp);
    50 
    51     ZSTD_CCtx* cctx;
    52     cctx = ZSTD_createCCtx();
    53 
    54     // set multi-thread parameter
    55     ZSTD_CCtx_setParameter(cctx, ZSTD_c_nbWorkers, 3);
    56     ZSTD_CCtx_setParameter(cctx, ZSTD_c_compressionLevel, ZSTD_btlazy2);
    57 
    58     com_size = ZSTD_compress2(cctx, com_ptr, com_space_size, file_text_ptr, file_len);
    59     
    60     free(com_ptr);
    61     free(file_text_ptr);
    62     return 0;
    63 }
    复制代码

      运行上述demo,可见zstd确实启动了3个线程对文本进行了并行压缩。且设置的线程数越多,压缩时间越短,这里就不详细展示了,读者可以自行实验。

      需要说明的是,zstd当前默认编译单线程的库文件,要实现多线程的API调用,需要在make的时候指定编译参数ZSTD_MULTITHREAD。

      

      另外,zstd还支持线程池的方式,线程池的函数原型:

      POOL_ctx* ZSTD_createThreadPool(size_t numThreads)

      线程池可以避免在多次、连续压缩场景时频繁的去创建线程、撤销线程产生的非必要开销,使得算力主要开销在文本压缩方面。

    四、总结

       本篇分享了zstd压缩与解压缩使用的基本方法,对压缩与解压的性能进行了摸底,最后探索了zstd多线程压缩的使用方法。

      从压缩测试来看,zstd的压缩比其实已经比较好了,比原文所占用空间缩小了一半以上,当然压缩比也跟待压缩文本的内容有关。

      从性能执行结果来看,zstd的压缩与解压性能表现比较勉强,我认为zstd在鱼(性能)和熊掌(压缩比)之间更偏向熊掌一些,不过对一些性能要求不太高的,但是要高压缩比的场景是比较符合的。

      多线程并行压缩,在有大文本需要连续多次压缩的场景下,结合线程池可以很好的提升压缩速率。

     

      下一篇介绍调研的LZ4压缩算法。码字不易,还请各位技术爱好者登录点个赞呀!

     

      本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/15956868.html 谢谢!

  • 相关阅读:
    整理之前的书籍
    Linux常用操作集合(二)
    集成电路模拟版图入门-版图基础学习笔记(六)
    SVN的基本使用
    数学建模Matlab之检验与相关性分析
    NodeJs 面试题 2022.6
    天天都在说的用户画像到底该如何构建?看这篇就够了!
    GBase 8s数据库级别权限
    锯齿波-RC充放电路
    体验Semantic Kernel图片内容识别
  • 原文地址:https://www.cnblogs.com/t-bar/p/15956868.html