2024年8月

Flink processing keyed streams with or without windowing

作者: jfcai
时间: 2024-08-30
分类: 默认分类
评论

声明：本文来自 Poe 的回答。

In Apache Flink, processing keyed streams can be done with or without windowing, and each approach serves different use cases. Here’s a comparison of the two:

Processing Keyed Streams Without Windowing

Characteristics:

Continuous Processing: Data is processed as it arrives, without grouping into windows.
Stateful Operations: You can maintain state across events for each key using operations like keyBy(), map(), or reduce().
Event-Time Handling: You can handle events in real time, allowing for low-latency processing.
Use Cases: Suitable for applications that need to respond immediately to each event, such as real-time analytics, alerting systems, or monitoring applications.

Example:

DataStream<Event> stream = // your source stream    
stream.keyBy(event -> event.getKey())
      .process(new ProcessFunction<Event, Result>() {
          @Override
          public void processElement(Event value, Context ctx, Collector<Result> out) {
              // Processing logic here
          }
      });

Processing Keyed Streams With Windowing

Characteristics:

Batch Processing: Data is grouped into finite windows based on time or count before processing.
Aggregation: Commonly used for calculating aggregates over a specific time frame (e.g., sum, average).
Window Types: Flink supports different window types, such as tumbling, sliding, and session windows.
Event-Time and Processing-Time: You can specify whether to use event-time or processing-time semantics for windowing.
Use Cases: Ideal for scenarios where you want to analyze data over intervals, such as calculating metrics per minute, hour, or day.

Example:

DataStream<Event> stream = // your source stream
stream.keyBy(event -> event.getKey())
      .window(TumblingEventTimeWindows.of(Time.minutes(1))) // or any other window type
      .aggregate(new SumAggregator());

Summary of Differences

Conclusion

Without Windowing: Use for real-time, low-latency applications that need immediate responses to each event.
With Windowing: Use for scenarios requiring aggregation or computation over a fixed period, like summarizing data trends over time.

Flink 介绍

作者: jfcai
时间: 2024-08-21
分类: 默认分类
评论

内心希望是个 deep dive（虽然也不够资格），但是实际可能是个入门的简单介绍。

什么是 Flink - Overview

Stateful Computations over Data Streams

"Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale." - 摘自 Flink 官网首页

数据流之上的有状态计算
“Apache Flink 是一个框架和分布式处理引擎，用于在无界和有界数据流上进行有状态计算。Flink 设计用于在所有常见的集群环境中运行，以内存速度和任何规模执行计算。” - 翻译 DeepL 完成

架构 - 高阶视角

基于 Flink snapshot version，非 stable version。

Flink Cluster - high level architecture

Flink 集群剖析

Flink 程序剖析

关键概念

Event Time 与 Processing Time

Notions of Time: Event Time and Processing Time

Event Time 与 Watermarks

Event Time and Watermarks

Parallelism

深入话题

任务和调度

Job/Task

计算

一些坑（也就是实现要注意的地方）

参考

优秀的博客

作者: jfcai
时间: 2024-08-06
分类: 默认分类
评论

https://soulteary.com/：LLM/AI 实现编程相关
https://colobu.com/：鸟窝，Golang 和编程相关
https://blog.wildcat.io/blog/：国外来人，有点杂
https://www.huangjiwei.com/blog/：语言文字
https://www.kawabangga.com/：优秀的运维
https://jaapgrolleman.com/：生活在上海的荷兰人
https://www.daemonology.net/hn-daily/: Hacker News Daily top 10 posts
https://www.skyue.com/: 杂
https://justgoidea.com/: 思考者，通过笔记类的文章发现的
https://manateelazycat.github.io/：优秀的思考者，商业人和开源者

均支持 RSS 订阅。