• Spark DataFrame join后移除重复的列


    Spark,两个DataFrame做join操作后,会出现重复的列。例如:

     Dataset<Row> moviesWithRating = moviesDF
                    .join(averageRatingMoviesDF,
                            moviesDF.col("movieId").equalTo(averageRatingMoviesDF.col("movieId")));
    
    • 1
    • 2
    • 3

    其schema如下:

    //moviesWithRating.printSchema();
            /**
             * root
             *  |-- _id: struct (nullable = true)
             *  |    |-- oid: string (nullable = true)
             *  |-- actors: string (nullable = true)
             *  |-- description: string (nullable = true)
             *  |-- directors: string (nullable = true)
             *  |-- genres: string (nullable = true)
             *  |-- issue: string (nullable = true)
             *  |-- language: string (nullable = true)
             *  |-- movieId: integer (nullable = true)
             *  |-- shoot: string (nullable = true)
             *  |-- timeLong: string (nullable = true)
             *  |-- title: string (nullable = true)
             *  |-- movieId: integer (nullable = true)
             *  |-- avgRating: double (nullable = true)
             */
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    我们在继续操作这个DataFrame时,可能就会报错,如下:org.apache.spark.sql.AnalysisException: Reference ‘movieId’ is ambiguous

    解决方案有两种方法可以用来移除重复的

    • 方法一:join表达式使用字符串数组(用于join的列)
    Seq<String> joinColumns = JavaConversions.asScalaBuffer(Arrays.asList("movieId")).toList();
    Dataset<Row> moviesWithRating = moviesDF.join(
                            averageRatingMoviesDF,
                            joinColumns,
                            "inner");
    
    • 1
    • 2
    • 3
    • 4
    • 5

    这里DataFrame moviesDF和averageRatingMoviesDF使用了movieId和movieId两列来做join,返回的结果会对这两列去重
    如果是scala,解决方案如下:

     val moviesWithRating = moviesDf.join(averageRatingMoviesDF, Seq("movieId")) 
    
    • 1
    • 方法二:使用select返回指定的列
    Dataset<Row> moviesWithRating = moviesDF
                    .join(averageRatingMoviesDF,
                            moviesDF.col("movieId").equalTo(averageRatingMoviesDF.col("movieId")))
                    .select(
                            moviesDF.col("movieId"),
    
                            col("actors"),
                            col("description"),
                            col("directors"),
                            col("genres"),
                            col("issue"),
                            col("language"),
                            col("shoot"),
                            col("timeLong"),
                            col("title"),
                            col("avgRating")
                    );
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    说明:
    如果列较少, 推荐使用第二种.
    如果列较多, 推荐使用第一种.

  • 相关阅读:
    【Linux 之二】Ubuntu下开发环境的搭建(NFS \ SSH \ FTP \ Smba \ ...)
    【Leetcode】1444. Number of Ways of Cutting a Pizza
    C++是如何工作的
    SpringMVC的数据绑定
    Git简单使用
    一文梳理2048小游戏从开发到上云全流程
    Android - 无序广播动态注册广播
    Makefile基础
    Gartner发布当前至2024年的五大隐私趋势
    力扣刷题(122. 买卖股票的最佳时机 II)动规、贪心
  • 原文地址:https://blog.csdn.net/weixin_56058578/article/details/134399798