读取clickhouse数据库数据
import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
def getCKJdbcProperties(
batchSize: String = "100000",
socketTimeout: String = "300000",
numPartitions: String = "50",
rewriteBatchedStatements: String = "true"): Properties = {
val properties = new Properties
properties.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")
properties.put("user", "default")
properties.put("password", "数据库密码")
properties.put("batchsize", batchSize)
properties.put("socket_timeout", socketTimeout)
properties.put("numPartitions", numPartitions)
properties.put("rewriteBatchedStatements", rewriteBatchedStatements)
properties
}
// 读取click数据库数据
val today = "2023-06-05"
val ckProperties = getCKJdbcProperties()
val ckUrl = "jdbc:clickhouse://233.233.233.233:8123/ss"
val ckTable = "ss.test"
var ckDF = spark.read.jdbc(ckUrl, ckTable, ckProperties)
**show**
展示数据,类似于select * from test
的功能
[ckDF.show](http://ckDF.show)
默认展示前20个记录ckDF.show(3)
指定展示记录数ckDF.show(false)
是否展示前20个ckDF.show(3, 0)
截取记录数**ckDF.collect
** 方法会将 ckDF
中的所有数据都获取到,并返回一个Array
对象
ckDF.collectAsList
功能和collect
类似,只不过将返回结构变成了List
对象
**ckDF.describe**("ip_src").show(3)
****获取指定字段的统计信息
scala> ckDF.describe("ip_src").show(3)
+-------+------+
|summary|ip_src|
+-------+------+
| count|855035|
| mean| null|
| stddev| null|
+-------+------+
only showing top 3 rows
first, head, take, takeAsList
获取若干行记录
first
获取第一行记录head
获取第一行记录,head(n: Int)
获取前n行记录take(n: Int)
获取前n行数据takeAsList(n: Int)
获取前n行数据,并以List
的形式展现以
Row
或者Array[Row]
的形式返回一行或多行数据。first
和head
功能相同。take
和takeAsList
方法会将获得到的数据返回到Driver端,所以,使用这两个方法时需要注意数据量,以免Driver发生OutOfMemoryError