• 字符串压缩(三)之短字符串压缩


      本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/16506289.html 谢谢!

     

     前言

      上一篇探索了LZ4的压缩和解压性能,以及对LZ4和ZSTD的压缩、解压性能进行了横向对比。文末的最后也给了一个彩蛋:任意长度的字符串都可以被ZSTD、LZ4之类的压缩算压缩得很好吗?

      本篇我们就来一探究竟。

    一、通用算法的短字符压缩

      开门见山,我们使用一段比较短的文本:Narrator: It is raining today. So, Peppa and George cannot  play outside.Peppa: Daddy, it's stopped raining.

      使用ZSTD与LZ4分别压缩一下上面这段短文本。下面分别是它们的压缩结果。

      ZSTD:

      

       LZ4:

      

      对短文本的压缩,zstd的压缩率很低,lz4压缩后的文本长度尽然超过了原有字符串的长度。这是为什么?说实话在这之前我也没想到。

      引用两位大佬的名言:

      Are you ok?  

      What's your problem?

    二、短字符串压缩

      从上面的结果可以得知,任何压缩算法都有它的使用场景,并不是所有长度的字符串都适合被某种算法压缩。一般原因是通用压缩算法维护了被压缩字符串的,用于字符串还原的相关数据结构,而这些数据结构的长度超过了被压缩短字符串的自身长度。

      那么问题来了,“我真的有压缩短字符串的需求,我想体验压缩的极致感,怎么办?”。

      短字符压缩算法它来了。这里挑选了3种比较优异的短字符压缩算法,分别是smaz,shoco,以及压轴的unisox2。跟前两章一样,还是从压缩率,压缩和解压缩性能的角度,一起看看他们在短字符压缩场景的各自表现吧。

    (1)Smaz

    1、Smaz的压缩和解压缩

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include 
     4 #include "smaz.h"
     5 
     6 using namespace std;
     7 
     8 int main()
     9 {
    10     int buf_len;
    11     int com_size;
    12     int decom_size;
    13     
    14     char com_buf[4096] = {0};
    15     char decom_buf[4096] = {0};
    16 
    17     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    18 
    19     buf_len = strlen(str_buf);
    20     com_size = smaz_compress(str_buf, buf_len, com_buf, 4096);
    21     
    22     cout << "text size:" << buf_len << endl;
    23     cout << "compress text size:" << com_size << endl;
    24     cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl;
    25 
    26     decom_size = smaz_decompress(com_buf, com_size, decom_buf, 4096);
    27     cout << "decompress text size:" << decom_size << endl;
    28 
    29     if(strncmp(str_buf, decom_buf, buf_len)) {
    30         cout << "decompress text is not equal to source text" << endl;
    31     }
    32 
    33     return 0;
    34 }
    复制代码

      执行结果如下:

       

       通过smaz压缩后的短字符串长度为77,和源字符串相比,减少了30Byte。

    2、Smaz的压缩和解压缩性能

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include 
     4 #include 
     5 #include "smaz.h"
     6 
     7 using namespace std;
     8 
     9 int main()
    10 {
    11     int cnt = 0;
    12     int buf_len;
    13     int com_size;
    14     int decom_size;
    15 
    16     timeval st, et;
    17     
    18     char *com_ptr = NULL;
    19     char* decom_ptr = NULL;
    20 
    21     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    22 
    23     buf_len = strlen(str_buf);
    24     gettimeofday(&st, NULL);
    25     while(1) {
    26         
    27         com_ptr = (char *)malloc(buf_len);
    28         com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len);
    29 
    30         free(com_ptr);
    31         cnt++;
    32         
    33         gettimeofday(&et, NULL);
    34         if(et.tv_sec - st.tv_sec >= 10) {
    35             break;
    36         }
    37     }
    38 
    39     cout << endl <<"compress per second:" << cnt/10 << " times" << endl;
    40 
    41     cnt = 0;
    42     com_ptr = (char *)malloc(buf_len);
    43     com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len);
    44     
    45     gettimeofday(&st, NULL);
    46     while(1) {
    47 
    48         // decompress length not more than origin buf length
    49         decom_ptr = (char *)malloc(buf_len + 1);
    50         decom_size = smaz_decompress(com_ptr, com_size, decom_ptr, buf_len + 1);
    51 
    52         // check decompress length
    53         if(buf_len != decom_size) {
    54             cout << "decom error" << endl;
    55         }
    56         
    57         free(decom_ptr);
    58         cnt++;
    59         
    60         gettimeofday(&et, NULL);
    61         if(et.tv_sec - st.tv_sec >= 10) {
    62             break;
    63         }
    64     }
    65 
    66     cout << "decompress per second:" << cnt/10 << " times" << endl << endl;
    67     
    68     free(com_ptr);
    69     return 0;
    70 }
    复制代码

      结果如何?

      

       压缩性能在40w条/S,解压在百万级,好像还不错哈!

    (2)Shoco

    1、Shoco的压缩和解压缩

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include 
     4 #include "shoco.h"
     5 
     6 using namespace std;
     7 
     8 int main()
     9 {
    10     int buf_len;
    11     int com_size;
    12     int decom_size;
    13     
    14     char com_buf[4096] = {0};
    15     char decom_buf[4096] = {0};
    16 
    17     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    18 
    19     buf_len = strlen(str_buf);
    20     com_size = shoco_compress(str_buf, buf_len, com_buf, 4096);
    21     
    22     cout << "text size:" << buf_len << endl;
    23     cout << "compress text size:" << com_size << endl;
    24     cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl;
    25 
    26     decom_size = shoco_decompress(com_buf, com_size, decom_buf, 4096);
    27     cout << "decompress text size:" << decom_size << endl;
    28 
    29     if(strncmp(str_buf, decom_buf, buf_len)) {
    30         cout << "decompress text is not equal to source text" << endl;
    31     }
    32 
    33     return 0;
    34 }
    复制代码

      执行结果如下:

      

       通过shoco压缩后的短字符串长度为86,和源字符串相比,减少了21Byte。压缩率比smaz要低。

     2、Shoco的压缩和解压缩性能

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include 
     4 #include 
     5 #include "shoco.h"
     6 
     7 using namespace std;
     8 
     9 int main()
    10 {
    11     int cnt = 0;
    12     int buf_len;
    13     int com_size;
    14     int decom_size;
    15 
    16     timeval st, et;
    17     
    18     char *com_ptr = NULL;
    19     char* decom_ptr = NULL;
    20 
    21     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    22 
    23     buf_len = strlen(str_buf);
    24     gettimeofday(&st, NULL);
    25     while(1) {
    26         
    27         com_ptr = (char *)malloc(buf_len);
    28         com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len);
    29 
    30         free(com_ptr);
    31         cnt++;
    32         
    33         gettimeofday(&et, NULL);
    34         if(et.tv_sec - st.tv_sec >= 10) {
    35             break;
    36         }
    37     }
    38 
    39     cout << endl <<"compress per second:" << cnt/10 << " times" << endl;
    40 
    41     cnt = 0;
    42     com_ptr = (char *)malloc(buf_len);
    43     com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len);
    44     
    45     gettimeofday(&st, NULL);
    46     while(1) {
    47 
    48         // decompress length not more than origin buf length
    49         decom_ptr = (char *)malloc(buf_len + 1);
    50         decom_size = shoco_decompress(com_ptr, com_size, decom_ptr, buf_len + 1);
    51 
    52         // check decompress length
    53         if(buf_len != decom_size) {
    54             cout << "decom error" << endl;
    55         }
    56         
    57         free(decom_ptr);
    58         cnt++;
    59         
    60         gettimeofday(&et, NULL);
    61         if(et.tv_sec - st.tv_sec >= 10) {
    62             break;
    63         }
    64     }
    65 
    66     cout << "decompress per second:" << cnt/10 << " times" << endl << endl;
    67     
    68     free(com_ptr);
    69     return 0;
    70 }
    复制代码

      执行结果如何呢?

      

       holy shit!压缩和解压缩居然都达到了惊人的百万级。就像算法作者们自己说的一样:“在长字符串压缩领域,shoco不想与通用压缩算法竞争,我们的优势是短字符的快速压缩,虽然压缩率很烂!”。这样说,好像也没毛病。

    (3)Unisox2

      我们再来看看unisox2呢。

    1、Unisox2的压缩和解压缩

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include "unishox2.h"
     4 
     5 int main()
     6 {
     7     int buf_len;
     8     int com_size;
     9     int decom_size;
    10     
    11     char com_buf[4096] = {0};
    12     char decom_buf[4096] = {0};
    13     
    14     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    15     
    16     buf_len = strlen(str_buf);
    17     com_size = unishox2_compress_simple(str_buf, buf_len, com_buf);
    18 
    19     printf("text size:%d\n", buf_len);
    20     printf("compress text size:%d\n", com_size);
    21     printf("compress ratio:%f\n\n", (float)buf_len / (float)com_size);
    22     
    23     decom_size = unishox2_decompress_simple(com_buf, com_size, decom_buf);
    24 
    25     printf("decompress text size:%d\n", decom_size);
    26 
    27     if(strncmp(str_buf, decom_buf, buf_len)) {
    28         printf("decompress text is not equal to source text\n");
    29     }
    30 
    31     return 0;
    32 }
    复制代码

      结果如下:

      

       通过Unisox2压缩后的短字符串长度为67,和源字符串相比,减少了40Byte,相当于是打了6折啊!不错不错。

     2、Unisox2的压缩和解压缩性能

      Unisox2的压缩能力目前来看是三者中最好的,如果他的压缩和解压性能也不错的话,那就真的就比较完美了。再一起看看Unisox2的压缩和解压性能吧!

    复制代码
     1 #include 
     2 #include <string.h>
     3 #include <malloc.h>
     4 #include 
     5 #include "unishox2.h"
     6 
     7 int main()
     8 {
     9     int cnt = 0;
    10     int buf_len;
    11     int com_size;
    12     int decom_size;
    13 
    14     struct timeval st, et;
    15     
    16     char *com_ptr = NULL;
    17     char* decom_ptr = NULL;
    18 
    19     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
    20 
    21     buf_len = strlen(str_buf);
    22     gettimeofday(&st, NULL);
    23     while(1) {
    24         
    25         com_ptr = (char *)malloc(buf_len);
    26         com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr);
    27 
    28         free(com_ptr);
    29         cnt++;
    30         
    31         gettimeofday(&et, NULL);
    32         if(et.tv_sec - st.tv_sec >= 10) {
    33             break;
    34         }
    35     }
    36 
    37     printf("\ncompress per second:%d times\n", cnt/10);
    38 
    39     cnt = 0;
    40     com_ptr = (char *)malloc(buf_len);
    41     com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr);
    42     
    43     gettimeofday(&st, NULL);
    44     while(1) {
    45 
    46         // decompress length not more than origin buf length
    47         decom_ptr = (char *)malloc(buf_len + 1);
    48         decom_size = unishox2_decompress_simple(com_ptr, com_size, decom_ptr);
    49 
    50         // check decompress length
    51         if(buf_len != decom_size) {
    52             printf("decom error\n");
    53         }
    54         
    55         free(decom_ptr);
    56         cnt++;
    57         
    58         gettimeofday(&et, NULL);
    59         if(et.tv_sec - st.tv_sec >= 10) {
    60             break;
    61         }
    62     }
    63 
    64     printf("decompress per second:%d times\n\n", cnt/10);
    65     
    66     free(com_ptr);
    67     return 0;
    68 }
    复制代码

      执行结果如下:

      

      事与愿违,Unisox2虽然有三个算法中最好的压缩率,可是却也拥有最差的压缩和解压性能。跟前两章分析的不谋而合:有高压缩率,就会损失自身的压缩性能,两者不可兼得。

    三、总结

      本篇分享了smaz,shoco,unisox2三种短字符串压缩算法,分别探索了它们各自的压缩率与压缩和解压缩性能,结果如下表所示。

    表1

      shoco的压缩率最低,但是拥有最高的压缩和解压速率;smaz居中;unisox2拥有最高的压缩率,可是它的压缩和解压性能最低。

      结论与前两章有关长字符串压缩的分析不谋而合:拥有高压缩率,就会损失自身的压缩性能,两者不可兼得。

      实际使用还是看自身需求和环境吧。如果适当压缩就好,那就可以选用shoco,毕竟性能高;想要节约更多的空间,那就选择smaz或者unisox2。

      好了,字符串压缩系列的分享就到此为止了,如果对你有些许帮助,还请各位技术爱好者登录点赞呀,谢谢!

     

      本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/16506289.html 谢谢!

  • 相关阅读:
    Sympy-nonlinsolve-只保留实数解问题
    四、Vue
    openGauss学习笔记-125 openGauss 数据库管理-设置账本数据库-校验账本数据一致性
    高通WLAN框架学习(33)-- WLAN/P2P get IOCTLs(iwpriv)命令大全
    关于Linux学习中的诸多问题
    python爬虫开源项目代码基于Python实现的数据分析系统
    限流及及常用解决方案
    剑指 Offer 16. 数值的整数次方
    【云原生生态圈】:服务快速上云--Docker部署SpringBoot案例详解
    Epoch和episodes的区别
  • 原文地址:https://www.cnblogs.com/t-bar/p/16506289.html