【spark】dataframe慎用limit - 码农知识堂 - 文章详情页

【spark】dataframe慎用limit
官方：limit通常和order by一起使用，保证结果是确定的

limit 会有两个步骤：
1. LocalLimit ，发生在每个partition
2. GlobalLimit，发生shuffle，聚合到一个parttion
当提取的n大时，第二步是比较耗时的
```
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (5)
+- * GlobalLimit (4)
   +- Exchange (3)
      +- * LocalLimit (2)
         +- Scan csv  (1)
1
2
3
4
5
6
```
如果对取样顺序没有要求，可用tablesample替代，使用详解。
```
== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand (3)
+- * Sample (2)
   +- Scan csv  (1)
1
2
3
4
```
参考

官方
 Stop using the LIMIT clause wrong with Spark
DataFrame orderBy followed by limit in Spark
相关阅读:
OS-process
88.(前端)商品分类TreeTable的显示——前端层级数据展示
 2023全新小程序广告流量主奖励发放系统源码流量变现系统
 【补档】基于PyTorch的手写数字识别
 【Linux】线程同步{死锁/线程同步相关接口/由浅入深理解线程同步}
Ribbon 负载均衡
 第一百三十四回自定义缓冲组件
 NC-UClient下载安装应用详解
 IntelliJ IDEA快捷键大全 + 动图演示
 背景图片设置
原文地址：https://blog.csdn.net/Code_LT/article/details/132627659