• 【从零开始的大数据学习】Flink官方教程学习笔记(一)


    学习资源

    基础Scala语法

    The Scala Book:https://docs.scala-lang.org/overviews/scala-book/

    Scala Basics:https://docs.scala-lang.org/tour/basics.html

    Scala数据结构专题

    • Scala 列表:
      • https://www.runoob.com/scala/scala-lists.html
      • https://docs.scala-lang.org/overviews/scala-book/list-class.html#inner-main
      • Scala中List是不可改变的!!!
      • 用apply索引
    • Scala 元组:https://www.runoob.com/scala/scala-tuples.html
    • Scala Set:
      • https://docs.scala-lang.org/scala3/book/collections-classes.html#working-with-sets
      • take索引
      • 有序集合:https://www.cnblogs.com/zhaohadoopone/p/9534982.html
    val s =scala.collection.mutable.LinkedHashSet[Tuple2[String,String]]()
    
    • 1
    声明变量
    var x = 2 // 变量
    x =  x + 1  // correct
    val x = 2 // 常量
    x =  x + 1  // wrong!
    
    • 1
    • 2
    • 3
    • 4
    代码块
    • 块的最后一行是块的值
    println({
      val x = 1 + 1
      x + 1 // the last expression is the result of the whole block
    }) // 3
    
    • 1
    • 2
    • 3
    • 4
    函数(function)
    • 定义方式:函数名=(参数:参数类型)=>返回值
    val addOne = (x: Int) => x + 1
    println(addOne(1)) // 2
    
    • 1
    • 2

    参数列表可以为空:

    val getTheAnswer = () => 42
    println(getTheAnswer()) // 42
    
    • 1
    • 2
    • 函数本身就可以是一个参数:
    def whileLoop(condition: => Boolean)(body: => Unit): Unit =
    {}
    
    • 1
    • 2
    方法(methods)

    def 函数名(参数:参数类型,…):
    最后一行是返回值

    def add(x:Int,y:Int):
      Int = x+y
    
    • 1
    • 2
    Traits (接口)

    包含自己的方法,可以被继承和实现,但是不能被初始化。

    • useful as generic types and with abstract methods.
    trait Iterator[A] { def hasNext: Boolean def next(): A }
    
    • 1

    Extending the trait Iterator[A] requires a type A and implementations of the methods hasNext and next.

    • 继承用class extends 实现
    class IntIterator(to: Int) extends Iterator[Int]{
    private var t=0
    override def hasNext: 
        Boolean = current < to 
    override def next(): Int ={t}
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    class(类)
    class User 
    val user1 = new User // 实例化了一个类
    
    • 1
    • 2

    示例代码:

    //class
    class  User
    val user = new User
    
    class Point(var x:Int, var y:Int){ // x 和y是Point class的成员变量
      def move(dx:Int, dy:Int):Unit = {
        x = x + dx
        y=  y + dy
      }
    
      override def toString: String =
        s"($x, $y)"
    }
    
    val point1 = new Point(2, 3)
    println(point1.x)  // prints 2
    println(point1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    注意:Java和Scala混用时,constructor的默认的参数值不管用。

    tuple(元组)

    和Python的元组相似:

    val ingredient = ("Sugar", 25)
    println(ingredient._1) // Sugar 
    println(ingredient._2) // 25
    
    • 1
    • 2
    • 3
    Mutiple Parameter List
    val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) 
    val res = numbers.foldLeft(0)((m, n) => m + n) 
    println(res) // 55
    
    • 1
    • 2
    • 3

    传入的第一个参数是初始值,第二个参数是一个函数,用于定义初始值和List中每个元素的运算

    Flink Exercises (强推)

    Flink 教程

    Flink数据集

    rideId         : Long      // a unique id for each ride
    taxiId         : Long      // a unique id for each taxi
    driverId       : Long      // a unique id for each driver
    isStart        : Boolean   // TRUE for ride start events, FALSE for ride end events
    eventTime      : Instant   // the timestamp for this event
    startLon       : Float     // the longitude of the ride start location
    startLat       : Float     // the latitude of the ride start location
    endLon         : Float     // the longitude of the ride end location
    endLat         : Float     // the latitude of the ride end location
    passengerCnt   : Short     // number of passengers on the ride
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    fare event

    rideId         : Long      // a unique id for each ride
    taxiId         : Long      // a unique id for each taxi
    driverId       : Long      // a unique id for each driver
    startTime      : Instant   // the start time for this ride
    paymentType    : String    // CASH or CARD
    tip            : Float     // tip for this ride
    tolls          : Float     // tolls for this ride
    totalFare      : Float     // total fare collected
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    配置Flink Tutorial所需的环境

    1. 安装Flink:
    • 注意Java版本:必须在11以上
    • 注意Flink版本:没有Web UI可能是Flink的版本太低(1.4没有,1.9实测可用)
    • Windows下推荐使用cygdrive命令行环境
      安装后界面:
      在这里插入图片描述

    2.下载Flink Tutorial:
    https://github.com/apache/flink-training/tree/release-1.15/

    在Win界面下可能会报错【# \r‘:command not found】
    在这里插入图片描述
    修改换行方式:
    https://blog.csdn.net/fangye945a/article/details/120660824
    在这里插入图片描述
    3. 完善Exercise.scala文件,运行test文件,即可看到结果。
    Scala中也可以调用Java的函数:https://docs.scala-lang.org/scala3/book/interacting-with-java.html

    Flink Tutorial学习笔记

    流式处理

    教程链接: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/overview/

    • Streams are data’s natural habitat.
    • 批处理: ingest the entire dataset before producing any results
    • 流处理:the input may never end, and so you are forced to continuously process the data as it arrives.
      在Flink中,数据流从source中读入,被operator转换,最终流入sink.
    • 一次转换可能包含多个operator.

    流可以从消息队列或分布式日志系统中读入,例如:Apache Kafka or Kinesis.但是Flink也可以读入bounded的数据来源。输出同理。
    在这里插入图片描述

    并行的数据流

    Flink程序内在本身就是并行且分布式的。

    • 每一个数据流都有多个stream partition.

    • 每一个operator都有多个operator subtasks,不同操作符的并行级别不一样.

      • The number of operator subtasks is the parallelism of that particular operator. Different operators of the same program may have different levels of parallelism.
      • 在这里插入图片描述
    • One-to-one streams:例如source和map

    • Redistributing streams:

      • 改变了数据流的划分
      • introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the Sink
    • 实时流:可以通过在数据中加入时间戳

    有状态的流处理

    • Flink’s operations can be stateful. This means that how one event is handled can depend on the accumulated effect of all the events that came before it.

    • A Flink application is run in parallel on a distributed cluster. The various parallel instances of a given operator will execute independently,in separate threads, and in general will be running on different machines.
      在这里插入图片描述

    • The 3rd operator is stateful.

    • A fully-connected network shuffle is occurring between the second and third operators.

    • This is being done to partition the stream by some key, so that all of the events that need to be processed together, will be.

    数据流API

    教程链接:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/datastream_api/

    • 可以流式处理的:
    • basic types, i.e., String, Long, Integer, Boolean, Array
    • composite types: Tuples, POJOs, and Scala case classes

    执行环境

    • Streaming applications need to use a StreamExecutionEnvironment.
    • When env.execute() is called the job graph is packaged up and sent to the JobManager, which parallelizes the job and distributes slices of it to the Task Managers for execution.
    • Each parallel slice of your job will be executed in a task slot.
      在这里插入图片描述

    Basic Stream functions

    1.env.fromCollections:从列表创建

    List people = new ArrayList();
    
    people.add(new Person("Fred", 35));
    people.add(new Person("Wilma", 35));
    people.add(new Person("Pebbles", 2));
    
    DataStream flintstones = env.fromCollection(people);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    2.env.socketTextStream/readTextfile 从远程/文件读取

    DataStream lines = env.socketTextStream("localhost", 9999);
    
    • 1
    DataStream lines = env.readTextFile("file:///path");
    
    • 1

    小结:流主要通过env中的函数读取。

    Streams could also be debugged by inserting local breakpoints,etc.

    ETL

    教程链接:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/learn-flink/etl/
    Flink’s table API:https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/overview/

    • map(): only suitable for one-to-one corespondence (全射)
      • for each and every stream element coming in, map() will emit one transformed element.
    • flatmap(): otherwise cases
  • 相关阅读:
    python环境安装(windows)
    手动安装Docker安装详细教程
    MySQL索引优化
    GPT-Engineer:一个提示就能生成完整应用|全自动代码生成神器
    (纯文字版)靶向 HER2 阳性乳腺癌:进展和未来方向 Nature Reviwe Drug Discovery(一)
    C# Winform自动更新
    数据库表设计优化
    Spark源码(启动ApplicationMaster和Driver线程)-第二期
    仿钉钉考勤统计页面的日历组件,通过日历展示每日考勤打卡情况,支持在日历上打两种不同类型的点,大致适配各种分辨率效果图
    Promise的简单用法
  • 原文地址:https://blog.csdn.net/qq_41145832/article/details/127437917