• PostgreSQL数据库统计信息——统计信息系统表


    pg_statistic

    系统表pg_ statistic存储关于数据库内容的统计数据。条目由ANALYZE创建,随后由查询计划器使用。请注意,所有统计数据本质上都是近似的,即使假设它是最新的。The catalog pg_statistic stores statistical data about the contents of the database. Entries are created by ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.

    通常,对于已分析的每个表列,都有一个条目,该列的stainherit必须为false。如果表具有继承子表,则还将创建第二个stainherit为true的条目。该行表示继承树上的列的统计信息,比如,使用SELECT column FROM table可以看到的数据统计信息,而stainherit为false行表示仅从表中选择列的结果SELECT column FROM ONLY table。Normally there is one entry, with stainherit = false, for each table column that has been analyzed. If the table has inheritance children, a second entry with stainherit = true is also created. This row represents the column’s statistics over the inheritance tree, i.e., statistics for the data you’d see with SELECT column FROM table, whereas the stainherit = false row represents the results of SELECT column FROM ONLY table. 其实就是如果pg_statistic系统表中列条目的stainherit字段为true,该行代表所有继承子表列的统计信息;如果为false,该行代表starelid指定表staattnum指定列的统计信息。

    pg_statistic还存储关于索引表达式的值的统计数据。这些被描述为好像它们是实际的数据列;特别是starelid引用索引。但是,普通的非表达式索引列没有条目,因为它与基础表列的条目是冗余的。目前,索引表达式的条目始终具有stainherit为false。pg_statistic also stores statistical data about the values of index expressions. These are described as if they were actual data columns; in particular, starelid references the index. No entry is made for an ordinary non-expression index column, however, since it would be redundant with the entry for the underlying table column. Currently, entries for index expressions always have stainherit = false.

    由于不同类型的统计数据可能适用于不同类型的数据,因此pg_statistic设计为不太假设其存储的统计数据类型。在pg_statistic中,只有极一般的统计信息(如空值)被指定为专用列。其他所有内容都存储在“插槽”中,插槽是一组相关列,其内容由插槽列中的一列中的代码编号标识。有关更多信息,请参阅src/include/catalog/pg_statistic.h。Since different kinds of statistics might be appropriate for different kinds of data, pg_statistic is designed not to assume very much about what sort of statistics it stores. Only extremely general statistics (such as nullness) are given dedicated columns in pg_statistic. Everything else is stored in “slots”, which are groups of associated columns whose content is identified by a code number in one of the slot’s columns. For more information see src/include/catalog/pg_statistic.h.

    pg_statistic不应被公众阅读,因为即使是关于表格内容的统计信息也可能被认为是敏感的。(示例:薪资列的最小值和最大值可能非常有趣。)pg_stats是一个关于pg_ statistic的公共可读视图,它仅公开当前用户可读的表的信息。pg_statistic should not be readable by the public, since even statistical information about a table’s contents might be considered sensitive. (Example: minimum and maximum values of a salary column might be quite interesting.) pg_stats is a publicly readable view on pg_statistic that only exposes information about those tables that are readable by the current user.

    pg_stats视图是我们经常用来观察pg_statistics系统表统计信息的工具,如下其提供了很多指标:null_frac表示null空值的比率、avg_width表示以字节为单位的平均宽度、n_distinct大于零就是非重复值的数量小于零则是非重复值的个数除以行数。这些都是比较好理解的指标,从pg_stats sql中也可以看出,其反映的也就是pg_statistics系统表对应的指标,这里不做过多解释。

    postgres=# \d pg_stats
                         View "pg_catalog.pg_stats"
             Column         |   Type   | Collation | Nullable | Default 
    ------------------------+----------+-----------+----------+---------
     schemaname             | name     |           |          |     ---模式名
     tablename              | name     |           |          |     ---表名
     attname                | name     |           |          |     ---列名
     inherited              | boolean  |           |          |     ---是否是继承列
     null_frac              | real     |           |          |     ---null空值的比率
     avg_width              | integer  |           |          |     ---平均宽度,字节
     n_distinct             | real     |           |          |     ---大于零就是非重复值的数量,小于零则是非重复值的个数除以行数
     most_common_vals       | anyarray |           |          |     ---高频值
     most_common_freqs      | real[]   |           |          |     ---高频值的频率
     histogram_bounds       | anyarray |           |          |     ---直方图
     correlation            | real     |           |          |     ---物理顺序和逻辑顺序的关联性
     most_common_elems      | anyarray |           |          |     ---高频元素,比如数组
     most_common_elem_freqs | real[]   |           |          |     ---高频元素的频率
     elem_count_histogram   | real[]   |           |          |     ---直方图(元素)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    剩下的指标如:correlation表示元组的物理顺序和逻辑顺序的关联度(-1到1之间,1表示逻辑顺序与存储的物理顺序相同,-1表示逻辑顺序与存储的物理顺序相反)、most_common_vals(MCV,高频值列表)、most_common_freqs(MCF,高频值的频率)、most_common_elems(高频元素,比如数组)、most_common_elem_freqs(高频元素的频率)、elem_count_histogram (直方图)、histogram_bounds(直方图,表示该字段除了高频值以外值的的柱状图信息,直方图中的数据不包含MCV/MCF的部分,两者的值是补充关系而且不会重合,但不一定互补(两种加起来未必是全部数据),用于将对应列的值划分为多个分组)。那么pg_stats是如何从pg_statistic中提取这些信息的呢。从下面的sql中我们知道pg_statistic stakindN可以提供五个槽位,这里的N对应着真正的数据槽N(stavaluesN或stanumbersN)。

    • 如果向stakindN设置为1,则说明有most_common_vals(MCV,高频值列表)和most_common_freqs(MCF,高频值的频率)指标,且存储在stavaluesN和stanumbersN中
    • 如果向stakindN设置为2,则说明有histogram_bounds(直方图)指标,且存储在stavaluesN中
    • 如果向stakindN设置为3,则说明有correlation(关联度),且存储在stanumbersN[1]中
    • 如果向stakindN设置为4,则说明有most_common_elems(高频元素)和most_common_elem_freqs(高频元素的频率),且存储在stavaluesN和stanumbersN中
    • 如果向stakindN设置为5,则说明有elem_count_histogram(直方图),且存储在stanumbersN中
    CREATE VIEW pg_stats WITH (security_barrier) AS
        SELECT nspname AS schemaname, relname AS tablename, attname AS attname, stainherit AS inherited, stanullfrac AS null_frac,
            stawidth AS avg_width, stadistinct AS n_distinct,
            CASE
                WHEN stakind1 = 1 THEN stavalues1
                WHEN stakind2 = 1 THEN stavalues2
                WHEN stakind3 = 1 THEN stavalues3
                WHEN stakind4 = 1 THEN stavalues4
                WHEN stakind5 = 1 THEN stavalues5
            END AS most_common_vals,
            CASE
                WHEN stakind1 = 1 THEN stanumbers1
                WHEN stakind2 = 1 THEN stanumbers2
                WHEN stakind3 = 1 THEN stanumbers3
                WHEN stakind4 = 1 THEN stanumbers4
                WHEN stakind5 = 1 THEN stanumbers5
            END AS most_common_freqs,
            CASE
                WHEN stakind1 = 2 THEN stavalues1
                WHEN stakind2 = 2 THEN stavalues2
                WHEN stakind3 = 2 THEN stavalues3
                WHEN stakind4 = 2 THEN stavalues4
                WHEN stakind5 = 2 THEN stavalues5
            END AS histogram_bounds,
            CASE
                WHEN stakind1 = 3 THEN stanumbers1[1]
                WHEN stakind2 = 3 THEN stanumbers2[1]
                WHEN stakind3 = 3 THEN stanumbers3[1]
                WHEN stakind4 = 3 THEN stanumbers4[1]
                WHEN stakind5 = 3 THEN stanumbers5[1]
            END AS correlation,
            CASE
                WHEN stakind1 = 4 THEN stavalues1
                WHEN stakind2 = 4 THEN stavalues2
                WHEN stakind3 = 4 THEN stavalues3
                WHEN stakind4 = 4 THEN stavalues4
                WHEN stakind5 = 4 THEN stavalues5
            END AS most_common_elems,
            CASE
                WHEN stakind1 = 4 THEN stanumbers1
                WHEN stakind2 = 4 THEN stanumbers2
                WHEN stakind3 = 4 THEN stanumbers3
                WHEN stakind4 = 4 THEN stanumbers4
                WHEN stakind5 = 4 THEN stanumbers5
            END AS most_common_elem_freqs,
            CASE
                WHEN stakind1 = 5 THEN stanumbers1
                WHEN stakind2 = 5 THEN stanumbers2
                WHEN stakind3 = 5 THEN stanumbers3
                WHEN stakind4 = 5 THEN stanumbers4
                WHEN stakind5 = 5 THEN stanumbers5
            END AS elem_count_histogram
        FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
             JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
             LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
        WHERE NOT attisdropped
        AND has_column_privilege(c.oid, a.attnum, 'select')
        AND (c.relrowsecurity = false OR NOT row_security_active(c.oid));
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58

    pg_statistic系统表的定义如下:
    在这里插入图片描述

    construct_array

    pg_statistic_ext_data

    系统表pg_statistic_ext_data保存pg_ statistics_ext中定义的扩展计划器统计数据的数据。该目录中的每一行对应于使用CREATE statistics创建的统计对象。与pg_statistic一样,pg_ statistic_ext_data不应被公众读取,因为内容可能被认为是敏感的。(示例:列中最常见的值组合可能非常有趣。)pg_stats_ext是pg_ statistic_exts_data上的公共可读视图(在与pgUstatic_expt连接后),它仅公开当前用户可读的表和列的信息。The catalog pg_statistic_ext_data holds data for extended planner statistics defined in pg_statistic_ext. Each row in this catalog corresponds to a statistics object created with CREATE STATISTICS. Like pg_statistic, pg_statistic_ext_data should not be readable by the public, since the contents might be considered sensitive. (Example: most common combinations of values in columns might be quite interesting.) pg_stats_ext is a publicly readable view on pg_statistic_ext_data (after joining with pg_statistic_ext) that only exposes information about those tables and columns that are readable by the current user.
    在这里插入图片描述

    系统表pg_statistic_ext包含扩展计划器统计信息的定义。该系统表中的每一行对应于使用CREATE statistics创建的统计对象。在创建统计数据期间,pg_statistic_ext条目被完全填充,但随后不会计算实际统计值。随后的分析ANALYZE命令计算期望值并填充pg_statistic_ext_data目录中的条目。The catalog pg_statistic_ext holds definitions of extended planner statistics. Each row in this catalog corresponds to a statistics object created with CREATE STATISTICS. The pg_statistic_ext entry is filled in completely during CREATE STATISTICS, but the actual statistical values are not computed then. Subsequent ANALYZE commands compute the desired values and populate an entry in the pg_statistic_ext_data catalog.
    在这里插入图片描述

    针对pg_statistic系统表的更新操作

    针对pg_statistic系统表的更新操作主要集中于update_attstats和RemoveStatistics函数,主要调用堆栈如下所示:

    • do_analyze_rel->update_attstats(src/backend/commands/analyze.c) 这个流程已经分析过
    • ATExecAlterColumnType(src/backend/commands/tablecmds.c)/index_drop(src/backend/catalog/index.c)/heap_drop_with_catalog(src/backend/catalog/heap.c)/RemoveAttributeById(src/backend/catalog/heap.c)–>RemoveStatistics(src/backend/catalog/heap.c) 为单表或单列删除pg_statistic系统表记录

    RemoveStatistics函数用于为单表或单列删除pg_statistic系统表中对应的记录。如果attnum为零,删除该表的所有记录;如果指定attnum列,则删除该列的记录。

    /* RemoveStatistics --- remove entries in pg_statistic for a rel or column
     * If attnum is zero, remove all entries for rel; else remove only the one(s) for that column.
     */
    void RemoveStatistics(Oid relid, AttrNumber attnum) {
    	SysScanDesc scan; ScanKeyData key[2];
    	int			nkeys; HeapTuple	tuple;
    	Relation pgstatistic = table_open(StatisticRelationId, RowExclusiveLock);
    	ScanKeyInit(&key[0],Anum_pg_statistic_starelid,BTEqualStrategyNumber, F_OIDEQ,ObjectIdGetDatum(relid));
    	if (attnum == 0) nkeys = 1; // 删除该表的所有记录
    	else{
    		ScanKeyInit(&key[1],Anum_pg_statistic_staattnum,BTEqualStrategyNumber, F_INT2EQ,Int16GetDatum(attnum));
    		nkeys = 2; // 删除该列的记录
    	}
    
    	scan = systable_beginscan(pgstatistic, StatisticRelidAttnumInhIndexId, true, NULL, nkeys, key);	
    	while (HeapTupleIsValid(tuple = systable_getnext(scan))) /* we must loop even when attnum != 0, in case of inherited stats */
    		CatalogTupleDelete(pgstatistic, &tuple->t_self); // 删除对应行记录
    
    	systable_endscan(scan);
    	table_close(pgstatistic, RowExclusiveLock);
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
  • 相关阅读:
    《红蓝攻防对抗实战》三.内网探测协议出网之HTTP/HTTPS协议探测出网
    day23-数据接口和selenium基础
    OkHttp原理解析(二)
    华为机试 - 全排列
    外包干了3个月,技术倒退1年。。。
    知识增强语言模型提示 零样本知识图谱问答10.8+10.11
    spark底层为什么选择使用scala语言开发
    FFmpeg源代码简单分析-其他-libavdevice的avdevice_register_all()
    某省医保局:强化医保信息化高质量建设,提升数字医疗保障服务能力
    springboot项目打包优化,将所有第三方包单独打包至lib目录
  • 原文地址:https://blog.csdn.net/asmartkiller/article/details/126684183