其他案例demo可以参考我的GitHub
https://github.com/NuistGeorgeYoung/flink_stream_test/
编写一个Flink程序大致上可以分为以下几个步骤:
val env = StreamExecutionEnvironment.getExecutionEnvironment
之后你可以设置以下配置
env.setParallelism(int parallelism)来设置。LOCAL)和集群执行(CLUSTER)。虽然getExecutionEnvironment默认可能根据你的环境选择执行模式,但在某些情况下你可能需要显式设置。不过,通常这不是通过StreamExecutionEnvironment直接设置的,而是通过配置文件或命令行参数来控制的。env.setStreamTimeCharacteristic(TimeCharacteristic timeCharacteristic)来设置时间特性。env.setStateBackend(StateBackend backend)来设置状态后端。常见的状态后端包括FsStateBackend(基于文件系统的状态后端)和RocksDBStateBackend(基于RocksDB的状态后端)。env.enableCheckpointing(long interval)来启用检查点,并通过一系列其他方法来配置检查点的具体行为,如setCheckpointingMode、setCheckpointTimeout等。env.setRestartStrategy方法来设置重启策略,比如固定延迟重启、失败率重启等。StreamExecutionEnvironment直接设置的,而是通过Flink的配置文件(如flink-conf.yaml)或提交作业时的命令行参数来配置的。它们涉及到Flink集群的资源分配和任务调度。readTextFile(path)/ TextInputFormat- 按行读取文件并将其作为字符串返回。
readTextFileWithValue(path)/ TextValueInputFormat- 按行读取文件并将它们作为StringValues返回。StringValues是可变字符串。
readCsvFile(path)/ CsvInputFormat- 解析逗号(或其他字符)分隔字段的文件。返回元组或POJO的DataSet。支持基本java类型及其Value对应作为字段类型。
readFileOfPrimitives(path, Class)/ PrimitiveInputFormat- 解析新行(或其他字符序列)分隔的原始数据类型(如String或)的文件Integer。
readFileOfPrimitives(path, delimiter, Class)/ PrimitiveInputFormat- 解析新行(或其他字符序列)分隔的原始数据类型的文件,例如String或Integer使用给定的分隔符。
readSequenceFile(Key, Value, path)/ SequenceFileInputFormat- 创建一个JobConf并从类型为SequenceFileInputFormat,Key class和Value类的指定路径中读取文件,并将它们作为Tuple2
fromCollection(Collection) - 从Java Java.util.Collection创建数据集。集合中的所有数据元必须属于同一类型。
fromCollection(Iterator, Class) - 从迭代器创建数据集。该类指定迭代器返回的数据元的数据类型。
fromElements(T ...) - 根据给定的对象序列创建数据集。所有对象必须属于同一类型。
fromParallelCollection(SplittableIterator, Class)- 并行地从迭代器创建数据集。该类指定迭代器返回的数据元的数据类型。
generateSequence(from, to) - 并行生成给定间隔中的数字序列。
readFile(inputFormat, path)/ FileInputFormat- 接受文件输入格式。
createInput(inputFormat)/ InputFormat- 接受通用输入格式。
- 例如map,flatMap,mapPartition,fliter,group,aggregate,distinct,join等,具体可以自己区官网学习
- https://flink.sojb.cn/dev/batch/dataset_transformations.html#filter
注意:print()、printToErr()、print(String msg) 和 printToErr(String msg) 方法并不直接关联到某个特定的OutputFormat类,因为它们是用于调试目的的直接输出到控制台的方法。同样,output()方法是一个更通用的接口,它依赖于具体的OutputFormat实现来定义数据如何被输出。
- <dependencies>
-
- <dependency>
- <groupId>org.projectlombokgroupId>
- <artifactId>lombokartifactId>
- <version>1.18.20version>
- <scope>providedscope>
- dependency>
-
- <dependency>
- <groupId>org.scala-langgroupId>
- <artifactId>scala-libraryartifactId>
- <version>${scala.version}version>
- dependency>
-
- <dependency>
- <groupId>org.scala-langgroupId>
- <artifactId>scala-reflectartifactId>
- <version>${scala.version}version>
- dependency>
- <dependency>
- <groupId>org.scala-langgroupId>
- <artifactId>scala-compilerartifactId>
- <version>${scala.version}version>
- dependency>
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-scala_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-streaming-scala_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- dependency>
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-clients_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- dependency>
- <dependency>
- <groupId>org.apache.kafkagroupId>
- <artifactId>kafka-clientsartifactId>
- <version>${kafka.version}version>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-connector-kafka_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- dependency>
- <dependency>
- <groupId>org.apache.hadoopgroupId>
- <artifactId>hadoop-clientartifactId>
- <version>${hadoop.version}version>
- dependency>
- <dependency>
- <groupId>mysqlgroupId>
- <artifactId>mysql-connector-javaartifactId>
- <version>${mysql.version}version>
- dependency>
- <dependency>
- <groupId>com.alibabagroupId>
- <artifactId>fastjsonartifactId>
- <version>${fastjson.version}version>
- dependency>
-
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-table-api-scala-bridge_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- <scope>providedscope>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-table-planner-blink_${scala.binary.version}artifactId>
- <version>${flink.version}version>
- <scope>providedscope>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-table-commonartifactId>
- <version>${flink.version}version>
- <scope>providedscope>
- dependency>
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-csvartifactId>
- <version>${flink.version}version>
- dependency>
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-connector-baseartifactId>
- <version>${flink.version}version>
- dependency>
-
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-javaartifactId>
- <version>1.13.2version>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-streaming-java_2.12artifactId>
- <version>1.13.2version>
- dependency>
- <dependency>
- <groupId>org.apache.flinkgroupId>
- <artifactId>flink-table-api-java-bridge_2.12artifactId>
- <version>1.13.2version>
- dependency>
-
- <dependency>
- <groupId>com.typesafe.playgroupId>
- <artifactId>play-json_2.12artifactId>
- <version>2.9.2version>
- dependency>
- dependencies>
我通过run方法不断模拟生成温度读数(TemperatureReading),每个读数包含时间戳、传感器ID和温度值。
通过随机延迟来模拟数据到达的时间间隔,同时温度值也随时间(以毫秒为单位)的增加而缓慢上升。
TemperatureTimestampExtractor负责为每个温度读数分配时间戳,并生成水位线。在这个例子中,水位线简单地设置为当前时间戳减1,这在实际应用中可能不是最佳实践,仅作为学习交流时的粗略设定。
定义了一个基于处理时间的滚动窗口,窗口大小为1秒。
在窗口内,使用reduce方法计算温度的平均值。最后对结果进行简单的println()输出。
- package stream
-
- import org.apache.flink.api.common.functions.ReduceFunction
- import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks
- import org.apache.flink.streaming.api.functions.source.SourceFunction
- import org.apache.flink.streaming.api.scala._
- import org.apache.flink.streaming.api.watermark.Watermark
- import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows
- import org.apache.flink.streaming.api.windowing.time.Time
- import scala.util.Random
-
- object FlinkRandomTemperatureSource {
- private case class TemperatureReading(timestamp: Long, sensorId: String, temperature: Double)
-
- private class RandomTemperatureSource(var totalElements: Int) extends SourceFunction[TemperatureReading] {
- private var isRunning = true
- private val random = new Random()
- private var currentTime = System.currentTimeMillis()
- private var baseTemperature = 30.0
-
- override def run(ctx: SourceFunction.SourceContext[TemperatureReading]): Unit = {
- while (isRunning && totalElements > 0) {
- val delay = (random.nextDouble() * 1000 + 500).toLong // 0.5到1.5秒之间的随机延迟
- Thread.sleep(delay)
-
- // 模拟时间流逝和温度增加
- currentTime += delay
- baseTemperature += delay / 1000000.0
-
- // 生成并发送数据
- val sensorId = "sensor" + random.nextInt(10)
- val temperatureReading = TemperatureReading(currentTime, sensorId, baseTemperature)
- ctx.collect(temperatureReading)
-
- totalElements -= 1
- }
-
- ctx.close()
- }
-
- override def cancel(): Unit = {
- isRunning = false
- }
- }
-
- /**
- * 自定义时间戳和水位线分配器
- * AssignerWithPunctuatedWatermarks其实已经过时了
- * 建议使用assignTimestampsAndWatermarks
- */
- private class TemperatureTimestampExtractor extends AssignerWithPunctuatedWatermarks[TemperatureReading] {
- override def extractTimestamp(element: TemperatureReading, previousElementTimestamp: Long): Long = {
- element.timestamp
- }
-
- override def checkAndGetNextWatermark(lastElement: TemperatureReading, extractedTimestamp: Long): Watermark = {
- new Watermark(extractedTimestamp - 1) // 简单地,我们可以将水位线设置为当前时间戳减1
- }
- }
-
- def main(args: Array[String]): Unit = {
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
-
- val source = new RandomTemperatureSource(100000) // 生成100000条数据
- val stream = env.addSource(source)
-
- // 为数据源分配时间戳和水位线
- val timestampedStream = stream.assignTimestampsAndWatermarks(new TemperatureTimestampExtractor)
-
- timestampedStream
- .keyBy(_.sensorId)
- .window(TumblingProcessingTimeWindows.of(Time.seconds(1)))
- .reduce((val1,val2)=>TemperatureReading
- (val1.timestamp,val1.sensorId,(val1.temperature+val2.temperature)/2))
- // .reduce(new ReduceFunction[TemperatureReading] {
- // override def reduce(value1: TemperatureReading,
- // value2: TemperatureReading): TemperatureReading = {
- // val avgTemp = (value1.temperature + value2.temperature) /2
- // TemperatureReading(value1.timestamp, value1.sensorId, avgTemp)
- // }
- // })
- .print()
- env.execute("Flink Random Temperature Source with Event Time")
-
- }
- }
