• Presto: 是谁偷走了我的一天


    Presto (including PrestoDB and PrestoSQL later re-branded to Trino) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata. One can even query data from multiple data sources within a single query. Presto is community driven open-source software released under the Apache License.

    起因

    最近在数据迁移的过程中遇到一个好玩的bug, 当我们要计算两个日期的间隔的是时候需要调用一些数据分析组件内置的函数, 如下所示: 代码表示的含义很简单 就是计算8月9日和8月8日之间相隔天数, 但是这样的计算的结果竟然不相同: 竟然出现同样的时间段里hive presto多一天这种情况. 即在hive分析中得到是相隔一天, 但是在presto分析中得到数据只有0天 .

    (组件版本号: Hive:2.3.7 Presto:0.240.1)

    ---hive: 1
    select datediff(
        '2022-08-09',
        '2022-08-08 08:00:01.0' 
       
      ) as b
    ---presto: 0
    select
      date_diff(
        'day',
       cast('2022-08-08 08:00:01.0' as timestamp),
       cast('2022-08-09'  as timestamp)
      ) as b
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    探究

    遇事不決 量子力学; 问题是小问题, 但是假如运用到生活的例子是你发工资的时候计算的时长, 这不就少计算一天. 为了打工人, 一定要找到是谁偷走了presto的一天.
    首先, 看了找了一下https://github.com/apache/hivedatediff()

    /**
     * UDFDateDiff.
     *
     * Calculate the difference in the number of days. The time part of the string
     * will be ignored. If dateString1 is earlier than dateString2, then the
     * result can be negative.
     *
     */
    @Description(name = "datediff",
        value = "_FUNC_(date1, date2) - Returns the number of days between date1 and date2",
        extended = "date1 and date2 are strings in the format "
            + "'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored."
            + "If date1 is earlier than date2, the result is negative.\n"
            + "Example:\n "
            + "  > SELECT _FUNC_('2009-07-30', '2009-07-31') FROM src LIMIT 1;\n"
            + "  1")
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    官方文档说的很清楚, yyyy-MM-dd HH:mm:ss or yyyy-MM-dd. The time parts are ignored. 在计算的过程中, 会自动截取时间, 只用日期进行计算. 那么以此类推, 是不是 presto 没有进行截取呢? 可能吗

    select cast('2022-08-09'  as timestamp) as b
    
    -- 2022-08-09 00:00:00.000
    -- 
    
    • 1
    • 2
    • 3
    • 4

    果然, 2022-08-09 转成timestamp日期格式就是 2022-08-09 00:00:00.000 时间直接用午夜零点补齐了. 如果不转timestamp 直接用日期可以吗?

    
    
    select
      date_diff(
        'day',
       cast('2022-08-08 08:00:01.0' as timestamp),
       '2022-08-09'
      ) as b
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    得报错了,貌似是格式不对,看来又得去官方源码找了, 从报错信息来看两者的日期类型需要对齐保持一样.

    Unexpected parameters (varchar(3), timestamp, varchar(10)) for function date_diff. Expected:
    date_diff(varchar(x), timestamp, timestamp) ,
    date_diff(varchar(x), date, date) ,
    date_diff(varchar(x), time, time) ,
    date_diff(varchar(x), time with time zone, time with time zone) ,
    date_diff(varchar(x), timestamp with time zone, timestamp with time zone)

    https://github.com/prestodb/presto

    
       ......
    
        @Description("difference of the given times in the given unit")
        @ScalarFunction("date_diff")
        @LiteralParameters("x")
        @SqlType(StandardTypes.BIGINT)
        public static long diffTimestamp(
                SqlFunctionProperties properties,
                @SqlType("varchar(x)") Slice unit,
                @SqlType(StandardTypes.TIMESTAMP) long timestamp1,
                @SqlType(StandardTypes.TIMESTAMP) long timestamp2)
        {
            if (properties.isLegacyTimestamp()) {
                return getTimestampField(getChronology(properties.getTimeZoneKey()), unit).getDifferenceAsLong(timestamp2, timestamp1);
            }
    
            return getTimestampField(UTC_CHRONOLOGY, unit).getDifferenceAsLong(timestamp2, timestamp1);
        }
    .....
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    从中定位到了date_diff 函数, 这个其中当是timestamp 的时候, 会调用这个 getDifferenceAsLong

    
    
        /**
         * Computes the difference between two instants, as measured in the units
         * of this field. Any fractional units are dropped from the result. Calling
         * getDifference reverses the effect of calling add. In the following code:
         *
         * 
         * long instant = ...
         * long v = ...
         * long age = getDifferenceAsLong(add(instant, v), instant);
         * 
    * * The value 'age' is the same as the value 'v'. * * @param minuendInstant the milliseconds from 1970-01-01T00:00:00Z to * subtract from * @param subtrahendInstant the milliseconds from 1970-01-01T00:00:00Z to * subtract off the minuend * @return the difference in the units of this field */
    public abstract long getDifferenceAsLong(long minuendInstant, long subtrahendInstant);
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    以上, 可以得出, 当不同date 类型 转成 timestamp的时候 时间部分直接用 00:00:00.000 替代, 然后调用的getDifferenceAsLong 是通过和 1970-01-01T00:00:00Z 进行比较 然后返回数值, 当其数值不大于24小时的部门, 自然不会计算当做一天. 所以当我们在计算时间间隔的时候 , 有时候精确度不需要那么苛刻的时候 , 建议 date_diff(varchar(x), date, date) 参数进行计算.

    由于时间仓促、水平有限,其中有许多不足之处在所难免,敬请各位大佬批评指正,给予多多留言指导。

    [ 1 ] https://github.com/prestodb/presto
    [ 2 ] https://github.com/apache/hive
    [ 3 ] https://prestodb.io/

  • 相关阅读:
    MySQL -- DQL
    python作业
    【PyTorch笔记】60分钟入门PyTorch——训练一个图片分类器
    《人生七年》纪录片-个体心理学中的自卑与超越角度解读
    C语言——流程控制
    电商前台项目(一):项目前的初始化及搭建
    SQL 查询的执行顺序
    自从拥有数位板,妈妈再也不用担心我学不好ZBrush了,ZBrush零基础新手必看菜单入门讲解
    软工作业2:个人项目
    “好物”推荐+Xshell连接实例+使用Conda创建独立的Python环境
  • 原文地址:https://blog.csdn.net/jankin6/article/details/126193242