比较图片相似度算法介绍与应用（Java版）

引言

在前面两篇博客中，介绍了两种文本内容相似度比较算法，SimHash和MinHash，通过技术验证的结果来看，符合项目产品方案规划需求，接下来将把这两种算法应用于不同的场景。

而通常，我们的数据中不仅仅只有文本，也会存在图片。此前，存在图片的数据系统不会做任何处理，都只能直接交由人工处理。这样一来，工作量显然很庞大。所以，这次也调研了图片对比的几种算法，从结果来说，还是能够满足实际使用场景的。

下面介绍的三种算法本质思想都是一样的，即先计算图片的Hash，再通过海明距离表示差异，海明距离越小，则说明图片越相似。

均值Hash算法

算法过程与核心代码

缩小尺寸为8*8

public static BufferedImage thumb(BufferedImage source, int width,
                                 int height, boolean b) {
   // targetW，targetH分别表示目标长和宽
   int type = source.getType();
   BufferedImage target = null;
   double sx = (double) width / source.getWidth();
   double sy = (double) height / source.getHeight();

   if (b) {
       if (sx > sy) {
           sx = sy;
           width = (int) (sx * source.getWidth());
       } else {
           sy = sx;
           height = (int) (sy * source.getHeight());
       }
   }

   if (type == BufferedImage.TYPE_CUSTOM) { // handmade
       ColorModel cm = source.getColorModel();
       WritableRaster raster = cm.createCompatibleWritableRaster(width,
               height);
       boolean alphaPremultiplied = cm.isAlphaPremultiplied();
       target = new BufferedImage(cm, raster, alphaPremultiplied, null);
   } else
       target = new BufferedImage(width, height, type);
   Graphics2D g = target.createGraphics();
   // smoother than exlax:
   g.setRenderingHint(RenderingHints.KEY_RENDERING,
           RenderingHints.VALUE_RENDER_QUALITY);
   g.drawRenderedImage(source, AffineTransform.getScaleInstance(sx, sy));
   g.dispose();
   return target;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

简化色彩，转变为灰度图

public static int rgbToGray(int pixels) {
    // int _alpha = (pixels >> 24) & 0xFF;
    int _red = (pixels >> 16) & 0xFF;
    int _green = (pixels >> 8) & 0xFF;
    int _blue = (pixels) & 0xFF;
    return (int) (0.3 * _red + 0.59 * _green + 0.11 * _blue);
}
1
2
3
4
5
6
7

计算64个像素的灰度平均值

public static int average(int[] pixels) {
    float m = 0;
    for (int i = 0; i < pixels.length; ++i) {
        m += pixels[i];
    }
    m = m / pixels.length;
    return (int) m;
}
1
2
3
4
5
6
7
8

比较每个像素的灰度，与平均值进行比较。大于或等于平均值记为1；小于平均值记为0。
将上一步的结果组合在一起，就构成了一个64位的整数，即为图片的指纹。
比较hash值，计算海明距离。

/**
 * 计算"海明距离"（Hamming distance）。
 * 如果不相同的数据位不超过5，就说明两张图片很相似；如果大于10，就说明这是两张不同的图片。
 *
 * @param sourceHashCode 源hashCode
 * @param hashCode       与之比较的hashCode
 */
public static int hammingDistance(String sourceHashCode, String hashCode) {
    int difference = 0;
    int len = sourceHashCode.length();

    for (int i = 0; i < len; i++) {
        if (sourceHashCode.charAt(i) != hashCode.charAt(i)) {
            difference++;
        }
    }

    return difference;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

感知Hash算法

算法过程与核心代码

缩小尺寸为8*8

private static BufferedImage resize(BufferedImage image, int width, int height) {
    BufferedImage resizedImage = new BufferedImage(width, height, BufferedImage.TYPE_INT_ARGB);
    Graphics2D g = resizedImage.createGraphics();
    g.drawImage(image, 0, 0, width, height, null);
    g.dispose();
    return resizedImage;
}
1
2
3
4
5
6
7

简化色彩，转变为灰度图

private static int gray(int rgb) {
    //将最高位（24-31）的信息（alpha通道）存储到a变量
    int a = rgb & 0xff000000;
    //取出次高位（16-23）红色分量的信息
    int r = (rgb >> 16) & 0xff;
    //取出中位（8-15）绿色分量的信息
    int g = (rgb >> 8) & 0xff;
    //取出低位（0-7）蓝色分量的信息
    int b = rgb & 0xff;
    // NTSC luma，算出灰度值
    rgb = (r * 77 + g * 151 + b * 28) >> 8;
    //(int)(r * 0.3 + g * 0.59 + b * 0.11)
    //将灰度值送入各个颜色分量
    return a | (rgb << 16) | (rgb << 8) | rgb;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

计算DCT（离散余弦变换），得到32*32的系数矩阵。缩小DCT，只保留左上角的8乘8的矩阵

public static int[] DCT(int[] pix, int n) {
    double[][] iMatrix = new double[n][n];
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            iMatrix[i][j] = (double) (pix[i * n + j]);
        }
    }
    double[][] quotient = coefficient(n);    //求系数矩阵
    double[][] quotientT = transposingMatrix(quotient, n);    //转置系数矩阵

    double[][] temp = matrixMultiply(quotient, iMatrix, n);
    iMatrix = matrixMultiply(temp, quotientT, n);

    int newpix[] = new int[n * n];
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            newpix[i * n + j] = (int) iMatrix[i][j];
        }
    }
    return newpix;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

计算DCT的平均值

public static int averageGray(int[] pix, int w, int h) {
    int sum = 0;
    for (int i = 0; i < h; i++) {
        for (int j = 0; j < w; j++) {
            sum = sum + pix[i * w + j];
        }
    }
    return sum / (w * h);
}
1
2
3
4
5
6
7
8
9

计算哈希值
比较hash值，计算海明距离

差异值Hash算法

算法过程与核心代码

图片缩放为9*8大小
将图片灰度化
差异值计算（每行相邻像素的差值，这样会生成8*8的差值，前一个像素大于后一个像素则为1，否则为0）
生成哈希值
比较hash值，计算海明距离

public class ImageDHash {
    /**
     * 计算dHash方法
     *
     * @param file 文件
     * @return hash
     */
    public static String getDHash(File file) {
        //读取文件
        BufferedImage srcImage;
        try {
            srcImage = ImageIO.read(file);
        } catch (IOException e) {
            e.printStackTrace();
            return null;
        }

        //文件转成9*8像素，为算法比较通用的长宽
        BufferedImage buffImg = new BufferedImage(9, 8, BufferedImage.TYPE_INT_RGB);
        buffImg.getGraphics().drawImage(srcImage.getScaledInstance(9, 8, Image.SCALE_SMOOTH), 0, 0, null);

        int width = buffImg.getWidth();
        int height = buffImg.getHeight();
        int[][] grayPix = new int[width][height];
        StringBuffer figure = new StringBuffer();

        for (int y = 0; y < height; y++) {
            for (int x = 0; x < width; x++) {
                //图片灰度化
                int rgb = buffImg.getRGB(x, y);
                int r = rgb >> 16 & 0xff;
                int g = rgb >> 8 & 0xff;
                int b = rgb & 0xff;
                int gray = (r * 30 + g * 59 + b * 11) / 100;
                grayPix[x][y] = gray;

                //开始计算dHash 总共有9*8像素 每行相对有8个差异值 总共有 8*8=64 个
                if (x != 0) {
                    long bit = grayPix[x - 1][y] > grayPix[x][y] ? 1 : 0;
                    figure.append(bit);
                }
            }
        }

        return figure.toString();
    }

    /**
     * 计算海明距离
     * 
     * 原本用于编码的检错和纠错的一个算法
     * 现在拿来计算相似度，如果差异值小于一定阈值则相似，一般经验值小于5为同一张图片
     *
     * @param str1
     * @param str2
     * @return 距离
     */
    private static long getHammingDistance(String str1, String str2) {
        int distance;
        if (str1 == null || str2 == null || str1.length() != str2.length()) {
            distance = -1;
        } else {
            distance = 0;
            for (int i = 0; i < str1.length(); i++) {
                if (str1.charAt(i) != str2.charAt(i)) {
                    distance++;
                }
            }
        }
        return distance;
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

总结

相比感知Hash算法，差异值Hash算法的速度要快的多，相比平均值Hash算法，差异值Hash算法在效率几乎相同的情况下的效果要更好，它是基于渐变实现的。所以在项目中选择了最后一种差异值Hash算法，结果也是与预期一致。

相关阅读:
【Rust日报】2022-07-15 Rust 宏定义，拯救运行时错误
 招投标系统软件源码,招投标全流程在线化管理
 沙盘游戏培训感悟
 杭州ALIENWARE外星人电脑（大悦城旗舰店），玩起来就是不一样
 flex布局回顾
 MySQL常见面试题（二）
AWS SAA-C03 #36
Java进阶之路——从初级程序员到架构师，从小工到专家
 温岭商务局局长施珍娟莅临卧兔总部调研考察
 【计算机毕业设计】基于SpringBoot+Vue贵州旅游系统的设计与实现
原文地址：https://blog.csdn.net/u013034223/article/details/127866745