【从零开始的大数据学习】Flink官方教程学习笔记（一）

【从零开始的大数据学习】Flink官方教程学习笔记（一）
Flink官方教程学习笔记
学习资源

 基础Scala语法

The Scala Book:https://docs.scala-lang.org/overviews/scala-book/

Scala Basics:https://docs.scala-lang.org/tour/basics.html

Scala数据结构专题
- Scala 列表：
  - https://www.runoob.com/scala/scala-lists.html
  - https://docs.scala-lang.org/overviews/scala-book/list-class.html#inner-main
  - Scala中List是不可改变的！！！
  - 用apply索引
- Scala 元组：https://www.runoob.com/scala/scala-tuples.html
- Scala Set:
  - https://docs.scala-lang.org/scala3/book/collections-classes.html#working-with-sets
  - 用take索引
  - 有序集合：https://www.cnblogs.com/zhaohadoopone/p/9534982.html
```
val s =scala.collection.mutable.LinkedHashSet[Tuple2[String,String]]()
1
```
声明变量
```
var x = 2 // 变量
x =  x + 1  // correct
val x = 2 // 常量
x =  x + 1  // wrong!
1
2
3
4
```
代码块
- 块的最后一行是块的值
```
println({
  val x = 1 + 1
  x + 1 // the last expression is the result of the whole block
}) // 3
1
2
3
4
```
函数（function）
- 定义方式：函数名=(参数：参数类型)=>返回值
```
val addOne = (x: Int) => x + 1
println(addOne(1)) // 2
1
2
```
参数列表可以为空：
```
val getTheAnswer = () => 42
println(getTheAnswer()) // 42
1
2
```
- 函数本身就可以是一个参数：
```
def whileLoop(condition: => Boolean)(body: => Unit): Unit =
{}
1
2
```
方法(methods)

def 函数名（参数：参数类型，…）：
最后一行是返回值
```
def add(x:Int,y:Int):
  Int = x+y
1
2
```
Traits (接口)

包含自己的方法，可以被继承和实现，但是不能被初始化。
- useful as generic types and with abstract methods.
```
trait Iterator[A] { def hasNext: Boolean def next(): A }
1
```
Extending the trait Iterator[A] requires a type A and implementations of the methods hasNext and next.
- 继承用class extends 实现
```
class IntIterator(to: Int) extends Iterator[Int]{
private var t=0
override def hasNext: 
    Boolean = current < to 
override def next(): Int ={t}
}
1
2
3
4
5
6
```
class（类）
```
class User 
val user1 = new User // 实例化了一个类
1
2
```
示例代码：
```
//class
class  User
val user = new User

class Point(var x:Int, var y:Int){ // x 和y是Point class的成员变量
  def move(dx:Int, dy:Int):Unit = {
    x = x + dx
    y=  y + dy
  }

  override def toString: String =
    s"($x, $y)"
}

val point1 = new Point(2, 3)
println(point1.x)  // prints 2
println(point1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
```
注意：Java和Scala混用时，constructor的默认的参数值不管用。

tuple（元组）

和Python的元组相似：
```
val ingredient = ("Sugar", 25)
println(ingredient._1) // Sugar 
println(ingredient._2) // 25
1
2
3
```
Mutiple Parameter List
```
val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 
val res = numbers.foldLeft(0)((m, n) => m + n) 
println(res) // 55
1
2
3
```
传入的第一个参数是初始值，第二个参数是一个函数，用于定义初始值和List中每个元素的运算

 Flink Exercises （强推）

Flink 教程
- EG: https://github.com/apache/flink-training/blob/release-1.15/README.md
- CN: https://github.com/apache/flink-training/blob/release-1.15/README_zh.md
- Flink Chinese:https://flink.apachecn.org/#/
- Flink English: https://flink.apache.org/
Flink数据集
- Data: https://github.com/apache/flink-training/blob/release-1.15/README.md#using-the-taxi-data-streams
- 数据介绍：
  taxi event
```
rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
eventTime      : Instant   // the timestamp for this event
startLon       : Float     // the longitude of the ride start location
startLat       : Float     // the latitude of the ride start location
endLon         : Float     // the longitude of the ride end location
endLat         : Float     // the latitude of the ride end location
passengerCnt   : Short     // number of passengers on the ride
1
2
3
4
5
6
7
8
9
10
```
fare event
```
rideId         : Long      // a unique id for each ride
taxiId         : Long      // a unique id for each taxi
driverId       : Long      // a unique id for each driver
startTime      : Instant   // the start time for this ride
paymentType    : String    // CASH or CARD
tip            : Float     // tip for this ride
tolls          : Float     // tolls for this ride
totalFare      : Float     // total fare collected
1
2
3
4
5
6
7
8
```
配置Flink Tutorial所需的环境
1. 安装Flink：
- 注意Java版本：必须在11以上
- 注意Flink版本：没有Web UI可能是Flink的版本太低（1.4没有，1.9实测可用）
- Windows下推荐使用cygdrive命令行环境
  安装后界面：
2.下载Flink Tutorial：
https://github.com/apache/flink-training/tree/release-1.15/

在Win界面下可能会报错【# \r‘:command not found】

修改换行方式：
https://blog.csdn.net/fangye945a/article/details/120660824

3. 完善Exercise.scala文件，运行test文件，即可看到结果。
Scala中也可以调用Java的函数：https://docs.scala-lang.org/scala3/book/interacting-with-java.html

Flink Tutorial学习笔记

 流式处理

教程链接： https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/overview/
- Streams are data’s natural habitat.
- 批处理: ingest the entire dataset before producing any results
- 流处理：the input may never end, and so you are forced to continuously process the data as it arrives.
  在Flink中，数据流从source中读入，被operator转换，最终流入sink.
- 一次转换可能包含多个operator.
流可以从消息队列或分布式日志系统中读入，例如：Apache Kafka or Kinesis.但是Flink也可以读入bounded的数据来源。输出同理。

并行的数据流

Flink程序内在本身就是并行且分布式的。
- 每一个数据流都有多个stream partition.
- 每一个operator都有多个operator subtasks，不同操作符的并行级别不一样.
  - The number of operator subtasks is the parallelism of that particular operator. Different operators of the same program may have different levels of parallelism.
- One-to-one streams：例如source和map
- Redistributing streams：
  - 改变了数据流的划分
  - introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the Sink
- 实时流：可以通过在数据中加入时间戳
有状态的流处理
- Flink’s operations can be stateful. This means that how one event is handled can depend on the accumulated effect of all the events that came before it.
- A Flink application is run in parallel on a distributed cluster. The various parallel instances of a given operator will execute independently,in separate threads, and in general will be running on different machines.
- The 3rd operator is stateful.
- A fully-connected network shuffle is occurring between the second and third operators.
- This is being done to partition the stream by some key, so that all of the events that need to be processed together, will be.
数据流API

教程链接：https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/datastream_api/
- 可以流式处理的：
- basic types, i.e., String, Long, Integer, Boolean, Array
- composite types: Tuples, POJOs, and Scala case classes
执行环境
- Streaming applications need to use a StreamExecutionEnvironment.
- When env.execute() is called the job graph is packaged up and sent to the JobManager, which parallelizes the job and distributes slices of it to the Task Managers for execution.
- Each parallel slice of your job will be executed in a task slot.
Basic Stream functions

1.env.fromCollections:从列表创建
```
List people = new ArrayList();

people.add(new Person("Fred", 35));
people.add(new Person("Wilma", 35));
people.add(new Person("Pebbles", 2));

DataStream flintstones = env.fromCollection(people);
1
2
3
4
5
6
7
```
2.env.socketTextStream/readTextfile 从远程/文件读取
```
DataStream lines = env.socketTextStream("localhost", 9999);
1
```
```
DataStream lines = env.readTextFile("file:///path");
1
```
小结：流主要通过env中的函数读取。

Streams could also be debugged by inserting local breakpoints,etc.

ETL

教程链接：https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/etl/
Flink’s table API：https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/overview/
- map(): only suitable for one-to-one corespondence (全射)
  - for each and every stream element coming in, map() will emit one transformed element.
- flatmap(): otherwise cases
相关阅读:
python环境安装（windows）
手动安装Docker安装详细教程
 MySQL索引优化
 GPT-Engineer：一个提示就能生成完整应用｜全自动代码生成神器
 （纯文字版）靶向 HER2 阳性乳腺癌：进展和未来方向 Nature Reviwe Drug Discovery（一）
C# Winform自动更新
 数据库表设计优化
 Spark源码(启动ApplicationMaster和Driver线程)-第二期
 仿钉钉考勤统计页面的日历组件，通过日历展示每日考勤打卡情况，支持在日历上打两种不同类型的点，大致适配各种分辨率效果图
 Promise的简单用法
原文地址：https://blog.csdn.net/qq_41145832/article/details/127437917

Flink官方教程学习笔记

学习资源

基础Scala语法

Scala数据结构专题

声明变量

代码块

函数（function）

方法(methods)

Traits (接口)

class（类）

tuple（元组）

Mutiple Parameter List

Flink Exercises （强推）

Flink 教程

Flink数据集

配置Flink Tutorial所需的环境

Flink Tutorial学习笔记

流式处理

并行的数据流

有状态的流处理

数据流API

执行环境

Basic Stream functions

ETL