2024年8月

声明: 本文来自 Poe 的回答。

In Apache Flink, processing keyed streams can be done with or without windowing, and each approach serves different use cases. Here’s a comparison of the two:

Processing Keyed Streams Without Windowing

Characteristics:

  • Continuous Processing: Data is processed as it arrives, without grouping into windows.
  • Stateful Operations: You can maintain state across events for each key using operations like keyBy(), map(), or reduce().
  • Event-Time Handling: You can handle events in real time, allowing for low-latency processing.
  • Use Cases: Suitable for applications that need to respond immediately to each event, such as real-time analytics, alerting systems, or monitoring applications.

Example:

DataStream<Event> stream = // your source stream    
stream.keyBy(event -> event.getKey())
      .process(new ProcessFunction<Event, Result>() {
          @Override
          public void processElement(Event value, Context ctx, Collector<Result> out) {
              // Processing logic here
          }
      });

Processing Keyed Streams With Windowing

Characteristics:

  • Batch Processing: Data is grouped into finite windows based on time or count before processing.
  • Aggregation: Commonly used for calculating aggregates over a specific time frame (e.g., sum, average).
  • Window Types: Flink supports different window types, such as tumbling, sliding, and session windows.
  • Event-Time and Processing-Time: You can specify whether to use event-time or processing-time semantics for windowing.
  • Use Cases: Ideal for scenarios where you want to analyze data over intervals, such as calculating metrics per minute, hour, or day.

Example:

DataStream<Event> stream = // your source stream
stream.keyBy(event -> event.getKey())
      .window(TumblingEventTimeWindows.of(Time.minutes(1))) // or any other window type
      .aggregate(new SumAggregator());

Summary of Differences

image.png

Conclusion

  • Without Windowing: Use for real-time, low-latency applications that need immediate responses to each event.
  • With Windowing: Use for scenarios requiring aggregation or computation over a fixed period, like summarizing data trends over time.

内心希望是个 deep dive(虽然也不够资格),但是实际可能是个入门的简单介绍。

什么是 Flink - Overview

Stateful Computations over Data Streams

"Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale." - 摘自 Flink 官网首页

数据流之上的有状态计算
“Apache Flink 是一个框架和分布式处理引擎,用于在无界和有界数据流上进行有状态计算。Flink 设计用于在所有常见的集群环境中运行,以内存速度和任何规模执行计算。” - 翻译 DeepL 完成

架构 - 高阶视角

基于 Flink snapshot version,非 stable version。

Flink Cluster - high level architecture

Flink 集群剖析

Flink 程序剖析

关键概念

Event Time 与 Processing Time

Notions of Time: Event Time and Processing Time

Event Time 与 Watermarks

Event Time and Watermarks

Parallelism

深入话题

任务和调度

Job/Task

计算

一些坑(也就是实现要注意的地方)

参考

https://soulteary.com/:LLM/AI 实现编程相关
https://colobu.com/:鸟窝,Golang 和编程相关
https://blog.wildcat.io/blog/:国外来人,有点杂
https://www.huangjiwei.com/blog/:语言文字
https://www.kawabangga.com/:优秀的运维
https://jaapgrolleman.com/:生活在上海的荷兰人
https://www.daemonology.net/hn-daily/: Hacker News Daily top 10 posts
https://www.skyue.com/:
https://justgoidea.com/: 思考者,通过笔记类的文章发现的
https://manateelazycat.github.io/:优秀的思考者,商业人和开源者

均支持 RSS 订阅。