14 Jul, 2025
systemctl-restart-hadoop-datanode-fail
现象

airflow JOB 中，重启 Hadoop Datanode 经常失败，然后登陆到机器看，已经成功了

journalctl 日志
```
Jul 10 17:18:05 XXXX systemd[1]: Started Hadoop datanode.
Jul 10 17:18:05 XXXX hadoop-hdfs-datanode[3678026]: Started Hadoop datanode (hadoop-hdfs-datanode):[  OK  ]
Jul 10 17:17:56 XXXX su[3677203]: pam_unix(su:session): session opened for user root(uid=0) by (uid=0)
Jul 10 17:17:56 XXXX su[3677203]: (to root) root on none
Jul 10 17:17:56 XXXX systemd[1]: Starting Hadoop datanode...
Jul 10 17:17:56 XXXX systemd[1]: Stopped Hadoop datanode.
Jul 10 17:17:56 XXXX systemd[1]: datanode.service: Scheduled restart job, restart counter is at 1.
Jul 10 17:15:56 XXXX systemd[1]: Failed to start Hadoop datanode.
Jul 10 17:15:56 XXXX systemd[1]: datanode.service: Failed with result 'exit-code'.
Jul 10 17:15:56 XXXX systemd[1]: datanode.service: Control process exited, code=exited, status=1/FAILURE
Jul 10 17:15:56 XXXX hadoop-hdfs-datanode[3675354]: Failed to start Hadoop datanode. Return value: 1[FAILED]
Jul 10 17:15:46 XXXX su[3675290]: pam_unix(su:session): session opened for user root(uid=0) by (uid=0)
Jul 10 17:15:46 XXXX su[3675290]: (to root) root on none
Jul 10 17:15:46 XXXX systemd[1]: Starting Hadoop datanode...
Jul 10 17:15:46 XXXX systemd[1]: Stopped Hadoop datanode.
Jul 10 17:15:46 XXXX systemd[1]: datanode.service: Failed with result 'signal'.
Jul 10 17:15:46 XXXX hadoop-hdfs-datanode[3675282]: Stopped Hadoop datanode:[  OK  ]
Jul 10 17:15:46 XXXX systemd[1]: datanode.service: Main process exited, code=killed, status=9/KILL
Jul 10 17:15:46 XXXX hadoop-hdfs-datanode[3675238]: datanode did not stop gracefully after 5 seconds: killing with kill -9
Jul 10 17:15:41 XXXX hadoop-hdfs-datanode[3675238]: stopping datanode
Jul 10 17:15:41 XXXX systemd[1]: Stopping Hadoop datanode...
```
日志显示，Stop 之后，第一次 Start 失败了，第二次 Start 成功了。

服务配置
```
# cat /etc/systemd/system/datanode.service
[Unit]
Description=Hadoop datanode
After=syslog.target local-fs.target network-online.target rc-local.service
Requires=rc-local.service

[Service]
Type=forking
#RemainAfterExit=yes
ExecStart=/etc/init.d/hadoop-hdfs-datanode start
ExecStop=/etc/init.d/hadoop-hdfs-datanode stop
ExecStartPost=/usr/bin/cp /run/hadoop-hdfs/hadoop-hdfs-datanode.pid /run/
PIDFile=/run/hadoop-hdfs-datanode.pid
TimeoutStartSec=60
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target
```
120 秒重启是 datanode 的配置。

因为我自己对 hadoop java jsvc 这套东西完全不熟悉，开始没有在 jsvc 的日志中找到关键信息，也对 jsvc 的机制不了解，没有马上定位到原因。一度拿 bcc 中的 execsnoop 来定位，但这个版本的 bcc 好像有 Bug，exec 的信息中，所有 ret code 都是 0。但基本上也能定位到是 jsvc 启动时出错（启动后退出了，和 pidfile 中的 pid 不一致）。

/opt/log/hadoop/jsvc.err 中的一条关键日志如下：
```
Still running according to PID file /var/run/hadoop-hdfs/hadoop_secure_dn.pid, PID is 3911276
Service exit with a return value of 122
```
原因

中间的排查过程已经不完全记得了，直接总结下原因吧。

start and stop

在我们公司的 hadoop 启停脚本中，机制是这样的：

使用 jsvc 来启动 datanode。

stop 的时候，先是 kill，如果 5 秒之后进程还在，就使用 kill -9 来结束。

jsvc 中 kill 效果

jsvc 在启动后，会有一个父进程（进程号2）和子进程（进程号3）。

pidfile 中记录的是子进程的 pid。

hadoop 服务的脚本中，kill 的时候，kill 的是父进程。

如果是默认的 kill TERM，父进程会尝试 gracefully 停止子进程。

如果是 kill -9，父进程会直接结束，子进程不会被杀掉。

我们公司的服务 stop 时发生了什么
1. kill 父进程（不是 pidfile 里面的）
2. 父进程收到信号后，尝试 gracefully 停止子进程
3. 5 秒后，父子进程还在（因为有些 gracefull 任务），父进程被 kill -9
4. 子进程没有被杀掉，仍然在运行 gracefull 任务
5. systemd 认为服务已经停止，尝试重新启动服务
6. jsvc 启动时，发现 pidfile 中的 pid 仍然在运行（子进程），因此认为服务仍然在运行，退出
7. systemd 认为服务启动失败
8. systemd 重启服务，第二次启动是在 120 秒之后，子进程的 gracefull 任务已经完成，进程退出，pidfile 中的 pid 不再存在。服务启动成功
解决方案

整个服务的脚本有些复杂，多个脚本调用流程过长。pid 文件也有多个，一个在 systemd 中使用，用来判断服务是不是在运行，一个在 jsvc 中用在 -pid 参数中。

对我来说，直接修改脚本比较麻烦。

所以在 airflow 中添加了 retry 机制，间隔 15 秒。

因为实践下来发现，当第一次 systemd restart 失败之后，15秒之后的第二次 systemctl restart ，会等待到 120 秒之后，才会和 systemctl 自己的重试一次执行。

但这也只是工程上的一个 workaround。不太好。

但相比全部重写脚本并推送到几千台机器，这个 workaround 影响面比较小，先这样。

如果从头重写脚本，我自己觉得不要使用 jsvc 比较好，就直接使用 systemd 自己的机制就可以。

如果已经这样使用了 jsvc 了，也不要仅仅等待 5 秒就 kill -9，可以和 systemd 配置对等，等待默认的 90 秒左右，可以适当少一些，避免和systemd 的重试时间冲突。

7 May, 2025

在RAGFlow中知识图谱是如何工作的

我之前比较好奇，RAG领域里面，知识图谱是如何工作的。主要是它的组织结构，以及搜索中如何利用。

正好 Ragflow 中用到了，可以看一下他的代码，再结合测试中打印日志，了解一下知识图谱的工作原理。

目前 Ragflow 的版本是 0.18.0。

那从两个方面来讲，一个是知识图谱的创建，组织结构（和存储），另一个是知识图谱的搜索。

简要版

后面的过程写的太繁琐了，我自己看着都烦，写一个简要版的放在最前面。

创建，使用 LLM（prompt如下），来提取 entity 和 relation，以及总结出 description。存到ES中。他们没有外键关联，纯粹的独立的字符串。
搜索，使用 LLM 从用户问题中提取关键词和实体，然后在知识图谱中进行向量搜索，找到相关的边和节点。ragflow 中并没有继续按节点关系跳到其他文档。

知识图谱的创建

用户新添加一篇文档之后，task_executor 开始先解析文件，创建chunks并写入向量库（我们是使用的ES）。

文件解析完之后，会调用queue_raptor_o_graphrag_tasks触发知识图谱的任务。

然后到 run_graphrag 开始创建知识图谱。

大的步骤有两个：generate_subgraph -> merge_subgraph

generate_subgraph 是创建这个新文档的graph，merge_subgraph 再将这个graph合并到整个库的知识图谱中。

subgraph = await generate_subgraph(
    LightKGExt
    if row["kb_parser_config"]["graphrag"]["method"] != "general"
    else GeneralKGExt,
    tenant_id,
    kb_id,
    doc_id,
    chunks,
    language,
    row["kb_parser_config"]["graphrag"]["entity_types"],
    chat_model,
    embedding_model,
    callback,
)

# ...

new_graph = await merge_subgraph(
    tenant_id,
    kb_id,
    doc_id,
    subgraph,
    embedding_model,
    callback,
)

核心是 generate_subgraph，来细看一下:

初始化一个 Extractor 实例，参数 entity_types 是界面上配置的。 Extractor 实现了 __call__ 方法。(Extractor 有两个实现，分别是 GeneralKGExt 和 LightKGExt，前者是通用的，后者是轻量级的，但我感觉好像没有大的区别)
调用 __call__，返回了所有节点和边，用来创建 subgraph。

__call__ 里面最核心的是 _process_single_content

每一个 chunk 都会调用 _process_single_content 来处理，并将结果都存储到 out_results 里面。

_process_single_content 里面提取知识图谱是使用 LLM 来做的，我把 prompt 来贴一下：

-Goal-
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized, in language of 'Text'
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities in language of 'Text'
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other in language of 'Text'
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)

3. Return output as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

4. When finished, output {completion_delimiter}

######################
-Examples-
######################
Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.

Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. “If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us.”

The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.

It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
################
Output:
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"Taylor"{tuple_delimiter}"person"{tuple_delimiter}"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective."){record_delimiter}
("entity"{tuple_delimiter}"Jordan"{tuple_delimiter}"person"{tuple_delimiter}"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device."){record_delimiter}
("entity"{tuple_delimiter}"Cruz"{tuple_delimiter}"person"{tuple_delimiter}"Cruz is associated with a vision of control and order, influencing the dynamics among other characters."){record_delimiter}
("entity"{tuple_delimiter}"The Device"{tuple_delimiter}"technology"{tuple_delimiter}"The Device is central to the story, with potential game-changing implications, and is revered by Taylor."){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Taylor"{tuple_delimiter}"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Jordan"{tuple_delimiter}"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."{tuple_delimiter}6){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"Jordan"{tuple_delimiter}"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Jordan"{tuple_delimiter}"Cruz"{tuple_delimiter}"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."{tuple_delimiter}5){record_delimiter}
("relationship"{tuple_delimiter}"Taylor"{tuple_delimiter}"The Device"{tuple_delimiter}"Taylor shows reverence towards the device, indicating its importance and potential impact."{tuple_delimiter}9){completion_delimiter}
#############################
Example 2:

Entity_types: [person, technology, mission, organization, location]
Text:
They were no longer mere operatives; they had become guardians of a threshold, keepers of a message from a realm beyond stars and stripes. This elevation in their mission could not be shackled by regulations and established protocols—it demanded a new perspective, a new resolve.

Tension threaded through the dialogue of beeps and static as communications with Washington buzzed in the background. The team stood, a portentous air enveloping them. It was clear that the decisions they made in the ensuing hours could redefine humanity's place in the cosmos or condemn them to ignorance and potential peril.

Their connection to the stars solidified, the group moved to address the crystallizing warning, shifting from passive recipients to active participants. Mercer's latter instincts gained precedence— the team's mandate had evolved, no longer solely to observe and report but to interact and prepare. A metamorphosis had begun, and Operation: Dulce hummed with the newfound frequency of their daring, a tone set not by the earthly
#############
Output:
("entity"{tuple_delimiter}"Washington"{tuple_delimiter}"location"{tuple_delimiter}"Washington is a location where communications are being received, indicating its importance in the decision-making process."){record_delimiter}
("entity"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"mission"{tuple_delimiter}"Operation: Dulce is described as a mission that has evolved to interact and prepare, indicating a significant shift in objectives and activities."){record_delimiter}
("entity"{tuple_delimiter}"The team"{tuple_delimiter}"organization"{tuple_delimiter}"The team is portrayed as a group of individuals who have transitioned from passive observers to active participants in a mission, showing a dynamic change in their role."){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Washington"{tuple_delimiter}"The team receives communications from Washington, which influences their decision-making process."{tuple_delimiter}7){record_delimiter}
("relationship"{tuple_delimiter}"The team"{tuple_delimiter}"Operation: Dulce"{tuple_delimiter}"The team is directly involved in Operation: Dulce, executing its evolved objectives and activities."{tuple_delimiter}9){completion_delimiter}
#############################
Example 3:

Entity_types: [person, role, technology, organization, event, location, concept]
Text:
their voice slicing through the buzz of activity. "Control may be an illusion when facing an intelligence that literally writes its own rules," they stated stoically, casting a watchful eye over the flurry of data.

"It's like it's learning to communicate," offered Sam Rivera from a nearby interface, their youthful energy boding a mix of awe and anxiety. "This gives talking to strangers' a whole new meaning."

Alex surveyed his team—each face a study in concentration, determination, and not a small measure of trepidation. "This might well be our first contact," he acknowledged, "And we need to be ready for whatever answers back."

Together, they stood on the edge of the unknown, forging humanity's response to a message from the heavens. The ensuing silence was palpable—a collective introspection about their role in this grand cosmic play, one that could rewrite human history.

The encrypted dialogue continued to unfold, its intricate patterns showing an almost uncanny anticipation
#############
Output:
("entity"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"person"{tuple_delimiter}"Sam Rivera is a member of a team working on communicating with an unknown intelligence, showing a mix of awe and anxiety."){record_delimiter}
("entity"{tuple_delimiter}"Alex"{tuple_delimiter}"person"{tuple_delimiter}"Alex is the leader of a team attempting first contact with an unknown intelligence, acknowledging the significance of their task."){record_delimiter}
("entity"{tuple_delimiter}"Control"{tuple_delimiter}"concept"{tuple_delimiter}"Control refers to the ability to manage or govern, which is challenged by an intelligence that writes its own rules."){record_delimiter}
("entity"{tuple_delimiter}"Intelligence"{tuple_delimiter}"concept"{tuple_delimiter}"Intelligence here refers to an unknown entity capable of writing its own rules and learning to communicate."){record_delimiter}
("entity"{tuple_delimiter}"First Contact"{tuple_delimiter}"event"{tuple_delimiter}"First Contact is the potential initial communication between humanity and an unknown intelligence."){record_delimiter}
("entity"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"event"{tuple_delimiter}"Humanity's Response is the collective action taken by Alex's team in response to a message from an unknown intelligence."){record_delimiter}
("relationship"{tuple_delimiter}"Sam Rivera"{tuple_delimiter}"Intelligence"{tuple_delimiter}"Sam Rivera is directly involved in the process of learning to communicate with the unknown intelligence."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"First Contact"{tuple_delimiter}"Alex leads the team that might be making the First Contact with the unknown intelligence."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"Alex"{tuple_delimiter}"Humanity's Response"{tuple_delimiter}"Alex and his team are the key figures in Humanity's Response to the unknown intelligence."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"Control"{tuple_delimiter}"Intelligence"{tuple_delimiter}"The concept of Control is challenged by the Intelligence that writes its own rules."{tuple_delimiter}7){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: {entity_types}
Text: {input_text}
######################
Output:

还有一个补充过程，就是会继续追问大模型，有没有遗漏的实体和关系。

CONTINUE_PROMPT = "MANY entities were missed in the last extraction.  Add them below using the same format:\n"
LOOP_PROMPT = "It appears some entities may have still been missed. Answer Y if there are still entities that need to be added, or N if there are none. Please answer with a single letter Y or N.\n"

对每一个 chunk 做完 _process_single_content 之后，所有的结果存储到 out_results 里面。

然后是一个工程上的处理吧，大概如下：

遍历 out_results，将数据放到 maybe_nodes maybe_edges 中。
然后对 maybe_nodes maybe_edges 做 merge，得到 all_entities_data all_relationships_data，并返回
遍历 all_entities_data all_relationships_data ，调用 add_edge 和 add_node ，创建一个 subgraph 并返回。也就是前面提到的两大步骤里面的 generate_subgraph。

数据结构

解析过程中的数据结构

使用了 https://networkx.org/ 这个库来创建和操作图。上面提到的 subgraph 就是一个 NetworkX 的 Graph 对象。

存储(ES)中的数据结构

解析之后，得到三种数据，entity（节点，多个）, relation（边，多个）, graph（一个）。

entity

knowledge_graph_kwd: "entity" 代表这是一个entity(节点)

entity_type_kwd 是节点的类型，具体可以有哪些类型，是用户在 ragflow 界面上配置的。

description 是大模型根据文档(也根据自己的知识)生成的描述。这个在后面的搜索中会返回，当成知识库内容的一部分，交给大模型用来总结回答问题。

{
  "knowledge_graph_kwd": "entity",
  "entity_kwd": "JBOD",
  "entity_type_kwd": "CATEGORY",
  "content_with_weight": "{\"entity_type\": \"CATEGORY\", \"description\": \"JBOD refers to a storage configuration where multiple hard drives are connected without RAID, allowing for independent operation.\", \"source_id\": [\"ab6bae962a3c11f09c470242ac1b0002\"], \"entity_name\": \"JBOD\", \"pagerank\": 0.037568948943104266}",
}

relation

{
  "knowledge_graph_kwd": "relation",
  "from_entity_kwd": "JBOD",
  "to_entity_kwd": "硬盘",
  "content_with_weight": "{\"src_id\": \"JBOD\", \"tgt_id\": \"硬盘\", \"description\": \"JBOD配置允许硬盘独立工作，适用于不需要RAID的环境。\", \"keywords\": [\"存储配置, 硬件管理\"], \"weight\": 7.0, \"source_id\": [\"ab6bae962a3c11f09c470242ac1b0002\"]}",
}

graph

gragh 是一个总的图，包含了所有的节点和边。貌似在搜索中用不到，只是用来在前端展示。

知识图谱的搜索

在ragflow 中的实现，和我之前自己想像的不太一样。

我的理解是，先从向量库中找到相关的文档，然后再从知识图谱中找到相关的边和节点，再根据这些节点，捞取对应的文档。将所有这些文档拼接在一起，交给大模型来回答问题。

ragflow 的实现是，先从向量库中找到相关的文档，接下来就是直接从用户的问题去搜索知识图谱，然后根据特定的算法对节点和边打分，将 TopN 的结果返回给大模型。主要是用到其中的 description。

但它也在代码中，留下了一些接口，后面也许能将知识图谱中边和节点对应的文档捞取出来：

kb_prompt 会根据节点和边的 doc_id 属性，去向量库中捞取对应的文档。但目前，边和节点的 doc_id 还是空字符串。
目前只是将文档的 meta 信息补充进来。
如果真的要捞取文档，也不能根据 doc_id 捞数据，而是应该 chunk_id。

ragflow 的界面中，有三个地方可以搜索。

知识库的“检索验证”页面，这里可以勾选“知识图谱”。
顶部的“搜索”页面，这里可以多知识库一起搜索，但没有知识图谱的选项。
创造一个聊天机器人，这里可以选择知识库和是否使用知识图谱。

我们用第三种搜索来看它的具体实现。

核心代码

# 根据关键词和向量 检索知识库（不包括知识图谱）
kbinfos = retriever.retrieval(
    " ".join(questions),
    embd_mdl,
    tenant_ids,
    dialog.kb_ids,
    1,
    dialog.top_n,
    dialog.similarity_threshold,
    dialog.vector_similarity_weight,
    doc_ids=attachments,
    top=dialog.top_k,
    aggs=False,
    rerank_mdl=rerank_mdl,
    rank_feature=label_question(" ".join(questions), kbs),
)

# 这里是知识图谱的检索，下面具体分析
ck = settings.kg_retrievaler.retrieval(
    " ".join(questions),
    tenant_ids,
    dialog.kb_ids,
    embd_mdl,
    LLMBundle(dialog.tenant_id, LLMType.CHAT),
)

kbinfos["chunks"].insert(0, ck)

## 补充数据，目前是补充 doc 的 meta 信息
knowledges = kb_prompt(kbinfos, max_tokens)

# 根据 kownledges 让大模型回答问题，并对结果进行一定的格式化处理，在 HTML 上呈现更好的效果。
# ...

知识图谱检索的核心代码 settings.kg_retrievaler.retrieval ：

# 使用大模型提取 keyword_type(目前版本始终为空) 和 entities
ty_kwds, ents = self.query_rewrite(llm, qst, [index_name(tid) for tid in tenant_ids], kb_ids)

# 知识图谱的检索

## 对 entities(节点) 进行向量搜索
ents_from_query = self.get_relevant_ents_by_keywords(ents, filters, idxnms, kb_ids, emb_mdl, ent_sim_threshold)
## 对 keyword_type 分词搜索(非向量搜索)
ents_from_types = self.get_relevant_ents_by_types(ty_kwds, filters, idxnms, kb_ids, 10000)
## 对 relation(边) 进行向量搜索
rels_from_txt = self.get_relevant_relations_by_txt(qst, filters, idxnms, kb_ids, emb_mdl, rel_sim_threshold)

# 对边和节点打分

# 将所有的节点和边的数据，简单按csv格式拼接并返回

社区报告

还未了解，后面补充。

20 Aug, 2024
使用nginx和vouch-proxy做OIDC登陆和权限控制
需求

我做了一个前后端分离的网页（前端使用vuejs3，后端使用 golang）

但我需要一个权限控制的功能，以保证某些“危险的”操作只能特定的人有权限操作。这些危险的操作我定义为POST PUT DELETE 等方法。

要做权限控制，首先需要有用户的信息，最好是和公司的 Ldap 结合的。

我不想从头实现这样的功能，想仅仅通过已有第三方的组件，简单的配置就可以实现。另外，我们公司有 OIDC 服务，所以优先选择能使用 OIDC 的组件。

以上就是我的需求。

OIDC

选择了使用 vouch-proxy 和 nginx 来做这个事情。（因为公司有OIDC服务了，如果没有，可能需要自己搭建）

申请 OIDC Client

申请之后，拿到 clientid 和 clientsecret，这两个需要配置在 vouch 的 config 里面（最后有示例）。

申请的时候，需要提交 callback 地址信息，这个地址需要配置在 vouch config 文件中，最好一点都不能错，包括 schema 和最后的 / 。有些 OIDC 服务在这方便比较严格。

OIDC 服务提供方还会返回一个 odic 服务的地址，比如 https://oidc.corp.com。

但 vouch 需要同时配置 token_url 和 user_info_url 。如果 OIDC 服务提供方没有给这两个地址，可以自己通过 https://oidc.corp.com/.well-known/openid-configuration 获取。（有些 OIDC 组件会自己去这个地址获取，不用用户自己配置）

域名准备

我们自己的页面叫 app.corp.com

再给 vouch 服务一个域名，vouch.app.corp.com

注意两点：
1. 他们应该是同一个父域名，方便设置 cookie
2. 只用一个域名可能也行，在 nginx 里面配置一下。我没有尝试，这里也不讨论了。
OIDC 原理

结合 vouch-proxy 说一下 OIDC 原理，知道了原理，以后有问题才好处理。

OIDC 并没有单点登陆的功能。所以原理里面并不会包括单点登陆的东西。
1. 访问一个页面时，nginx 通过 auth_request 将请求交给 vouch 做权限控制。但注意这里的权限控制，我不知道应该叫什么更确切，不是看用户有没有特定的权限做特定的操作，在这个场景下，可以理解成只是看用户是不是登陆了。如果读者对 nginx auth_request 模块不了解，需要先去做个简单了解，这里不做解释了。
2. vouch 根据 cookie 来判断用户是不是已经登陆了，以及用户名等等信息。这里的 Cookie 怎么来的呢，我们后面会说到。
3. 第一次访问的时候，肯定没有这个Cookie，vouch 根据 auth_request 的协议返回 401。nginx 通过 auth_request 模块，知道用户没有登陆，就会 302 跳转到一个登陆页面，在这里，我们是配置成 302 到 vouch.app.corp.com/login
4. vouch 的 login handler 会根据 OIDC 的配置，跳转到 OIDC 登陆页面。
5. OIDC 登陆之后，会根据 redirect_url 跳回 vouch 的 vouch.corp.com/auth，OIDC 跳回这里的时候，会带上一个 code 参数
6. vouch 拿 code 参数去请求 OIDC 服务，如果 code 合法，则返回用户信息。
7. auth handler 拿到用户信息之后，会set-cookie，将 token（是一串加密信息）设置到 app.corp.com domain 中。
8. 用户再次打开 app.corp.com 时，同步骤1 一样将请求交给 vouch 做判断是否有权限。
9. vouch 通过 cookie 中的 token 拿到用户信息，直接返回 200，通过。
10. nginx 中通过 auth_request_set 配置，将用户名传递给 proxy_pass，以做其他用途。目前我这里还没有用到，后面的权限控制会用到。
OIDC 在这里就 OK 了，用户需要登陆才能使用 app.corp.com 了。

权限控制

这块是纯粹在 nginx 配置的，因为通过 auth_request_set，我们在 nginx 中已经可以获取用户名。但这块也非常折腾，主要还是 nginx 的指令实在是不符合代码的正常思维（个人感受）。可能用 OpenResty 更好。

这块不多说了，拿两个尝试的失败的方案说一下。最后给出目前的可以工作的配置。
1. 最简单的想法是使用 if 指令，但是 if 指令不管写在配置的什么位置，都会在 auth_request 之前执行，导致拿不到用户信息。关于这个问题，有用户给出了详细的问题描述和尝试。https://serverfault.com/questions/1121736/nginx-is-not-handling-auth-request-before-if-statement。我也是在这个帖子的回复中找到进一步尝试的方法。
2. 根据问题1 中下面某人的回复，我尝试使用 map 来做这件事。但还是失败了。失败的原因很奇怪，如果使用变量名做 proxy_pass 的后端，则url 会被全部截断传到后端，完全无法正常工作。我没有找到相关的官方文档，所以自己瞎尝试了几种配置，都失败了。最后还是放弃了这个看似美好而且配置相对简单的方案。
3. 一度想在 vouch 里面来做这个事，后面发现不合适。因为 vouch 返回 401 的话，会再次去 OIDC 登陆，用户会迷茫，明明已经登陆过了啊，死循环了。所以在 vouch 中做这个事不合适。
配置

nginx 配置
```
upstream healer {
    server 127.0.0.1:8001;
}

server {
    access_log  off;
    listen 8001;

    set $allow 0;

    if ($request_method = GET) {
      set $allow 1;
    }
    if ($request_method = HEAD) {
      set $allow 1;
    }
    if ($request_method = OPTIONS) {
      set $allow 1;
    }
    if ($http_x_vouch_user = "zhangsan" ) {
      set $allow 1;
    }
    if ($http_x_vouch_user = "lisi" ) {
      set $allow 1;
    }

    if ($allow = 0) {
      return 403;
    }

    location / {
         proxy_pass http://127.0.0.1:8000;
    }
}

server {
    log_format  main  '$remote_addr $auth_resp_x_vouch_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" "$request_body"';

    access_log  /opt/logs/access.log  main;

    listen       8080 default_server;
    server_name  app.corp.com;

    auth_request /validate;
    location = /validate {
        # forward the /validate request to Vouch Proxy
        proxy_pass http://127.0.0.1:9090/validate;

        # be sure to pass the original host header
        proxy_set_header Host $http_host;

        # Vouch Proxy only acts on the request headers
        proxy_pass_request_body off;
        proxy_set_header Content-Length "";

        # optionally add X-Vouch-User as returned by Vouch Proxy along with the request
        auth_request_set $auth_resp_x_vouch_user $upstream_http_x_vouch_user;

        # these return values are used by the @error401 call
        auth_request_set $auth_resp_jwt $upstream_http_x_vouch_jwt;
        auth_request_set $auth_resp_err $upstream_http_x_vouch_err;
        auth_request_set $auth_resp_failcount $upstream_http_x_vouch_failcount;
    }

    error_page 401 = @error401;
    location @error401 {
      return 302 http://vouch.app.corp.com/login?url=$scheme://$http_host$request_uri&vouch-failcount=$auth_resp_failcount&X-Vouch-Token=$auth_resp_jwt&error=$auth_resp_err;

    }

    location / {
        if ($http_user_agent ~* HealthChecker) {
            return 200;
        }
        if ($http_user_agent ~* kube-probe) {
            return 200;
        }

        root   /opt/kafka-admin/;

        auth_request_set $auth_resp_x_vouch_user $upstream_http_x_vouch_user;
        auth_request_set $auth_resp_x_vouch_idp_claims_groups $upstream_http_x_vouch_idp_claims_groups;
        auth_request_set $auth_resp_x_vouch_idp_claims_given_name $upstream_http_x_vouch_idp_claims_given_name;
    }

    location /api/ {
        auth_request_set $auth_resp_x_vouch_user $upstream_http_x_vouch_user;
        auth_request_set $auth_resp_x_vouch_idp_claims_groups $upstream_http_x_vouch_idp_claims_groups;
        auth_request_set $auth_resp_x_vouch_idp_claims_given_name $upstream_http_x_vouch_idp_claims_given_name;

        proxy_pass http://healer/;
        proxy_set_header X-Vouch-User $auth_resp_x_vouch_user;
    }


    #error_page  404              /404.html;

    # redirect server error pages to the static page /50x.html
    #
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/html;
    }
}

server {
    listen 8080 ;
    server_name vouch.app.corp.com;
    location / {
        proxy_pass http://127.0.0.1:9090;
        # be sure to pass the original host header
        proxy_set_header Host vouch.app.corp.com;
    }
}
```
vouch oidc
```
vouch:
  session: # 如果是多机器部署在一个 LB 后面，session.key 需要一样。这里不配置就会随机生成，导致认证失败。
    key: 4LjeTY9/E17RvQx1itF0p6CsfFuqhgQiNtVQQGh32is=

  #whitelist:
    #- rmself
    #- childe

  #allowAllUsers: true

  domains:
  - corp.com # 会用来验证 OIDC 返回的 email 是不是在此 domains 里面，如果不在会401。
  cookie:
    secure: false
    # vouch.cookie.domain must be set when enabling allowAllUsers
    domain: app.corp.com

oauth:
  # Generic OpenID Connect
  provider: oidc
  client_id: xxxxxx
  client_secret: xxxxxx
  auth_url: https://oidc.corp.com/authorize
  token_url: https://oidc.corp.com/token
  user_info_url: https://oidc.corp.com/userinfo
  scopes:
    - openid
    - email
    - profile
  callback_url: http://vouch.oidc.corp.com/auth
```
3 Feb, 2024
python black 格式化相关的一个问题
如实做一个记录。

我写 Python 脚本使用 neovim + coc。lsp 使用 pyright。使用 black 做格式化。配置了保存时自动格式化。

但是 black 不能 sort import。

coc-python 支持 sort import，可以用户自己配置使用 pright，还是 isort。我使用的是 pyright。

但我没找到 coc-python 里面怎么配置保存时自动 sort import。所以我在 vim 配置里面添加了一行： autocmd BufWritePre *.py :silent call CocAction('runCommand', 'editor.action.organizeImport')

以上是前提。用了一段时间在代码格式化这块都工作良好，直到今天。

项目中有些参数，我使用了环境变量来做配置。今天我在项目中使用 dotenv 来加载自己配置的环境变量。
```
- main.py
- utils
  - common.py
```
项目结构大概如上这样。common.py 里面有些参数从环境变量上加载。我在 main.py 里面调用 load_dotenv。但这时候已经晚了，因为 main.py import utils.commond 在 load_dotenv 之前，这时候已经把环境变量的值取到了，还不是 load_dotenv 的内容。

解决方法当然很简单，就是 main.py 里面把 from dotenv import load_dotenv; load_dotenv() 放最前面。但麻烦的事情来了：保存的时候 load_dotenv 被格式化到了所有 import 后面！

中间查解决方案的过程比较长，说一下有些折腾的原因：

black 本身可以使用 #fmt: off #fmt: on 这样的语法让一段代码不被格式化。但是，pyright sort import 的时候并不认 black 的语法，而且它没有自己的类似这样 skip format 的功能。

最后我还是在 coc-python 里面换成了使用 isort 来做 sort import。

isort 的配置里面，添加 profile=black 配置可以了，这样 black 和 isort 都会认 fmt:off 这样的语法了。

我的 isort 安装在了 /.vim/isort/bin/isort ，isort 配置也放在这里。
```
# cat ~/.vim/isort/.isort.cfg
[settings]
profile=black
```
main.py 代码就像下面这样
```
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# fmt: off
from dotenv import load_dotenv

load_dotenv()
# fmt: on

import argparse
import json
```
16 Jun, 2023
golang-klog
klog

klog 是对 glog 的永久 fork，为了对 glog 做一些改进。

为什么又做了 klog?

glog 已经不再活跃开发了。它有一些缺点，而且我们又需要一些新功能。所以我们只能新创建一个 klog。

我们来说一下，glog 有哪些缺点，我们又需要什么新功能。
1. glog 有很多 gotchas，给容器环境中使用 glog 带来挑战。
2. glog 没有提供方便的测试方法，对使用它的软件的可靠性有影响。
3. 我们还有一个长期目标：实现一个logging 的接口，可以允许我们添加 context，改变输出的格式等。
glog 缺陷

Kubernets 项目中有一个 issue 记录了glog 有哪些缺陷。Use of glog for logging is problematic，我们来看看。
1. glog 在init 中注册了很多 flag，而且不能以编程方式配置其行为。对于 k8s 库的用户来说，他们必须调用 flag.Parse() 可能会感到惊讶，这很容易出错（例如，没有办法配置glog coredns/coredns#1597）。
2. glog 的默认行为是将日志记录到磁盘上的文件中。一个库的用户通常不希望它在没有明确配置的情况下写入文件。
3. 更糟糕的是，如果 glog 无法创建文件，它会调用os.Exit。这可能非常有害，特别是在使用只读根文件系统运行容器化二进制文件时，很容易触发。
4. glog 不对其写入的文件进行任何管理，因此如果没有类似于 logrotate 的东西（特别是在容器中），日志文件将会不断积累。但是 logrotate 貌似也不容易处理 glog 的日志。
klog 的改进
1. 使用 klog.InitFlags(nil) 显式地初始化全局flag，因为我们不再使用 init() 方法注册 flag
2. 现在可以使用 log_file 而不是 log_dir 来记录到单个文件
3. 如果您想将使用klog记录的所有内容重定向到其他地方（比如syslog！），您可以使用 klog.SetOutput() 方法并提供一个io.Writer。
4. 更多的 log 规范 Logging Conventions

15 Feb, 2023

cgroup memory limit 是不是计算 page cache

测试环境

OS: CentOS Linux release 7.6.1810 (Core) ; KERNEL 5.10.56 ; Cgroup: v1

问题

一个问题，Cgroup 下面的进程读写文件，就会生成 page cache。这部分内存使用计算到 Cgroup Memory Limit 里面吗？

我感觉是不会的（其实会），因为如果会的话，会引出下面这个问题：

问题二：系统怎么知道这个 page cache 属于哪个进程？

A 读了一个文件之后，B 又读了，那这个 page cache 是同时算到 A 和 B 进程里面？

A 读了之后，B 又使用到，那就算到 B 头上了？从 A 里面去掉？

结果

直接说测试下来的结果吧。

page cache 会计算到 cgroup limit 里面。

但内存超的时候，不会直接触发 OOM Killer，而是会清理部分 page cache。

page cache 始终算在第一个使用它的进程（以及对应的 Cgroup）上。

清理策略

多提一句，cgroup v2 的清理 page cache 策略应该比 v1 有优化，参考如下：

Another important topic in cgroup v2, which was unachievable with the previous v1, is a proper way of tracking Page Cache IO writebacks. The v1 can’t understand which memory cgroup generates disk IOPS and therefore, it incorrectly tracks and limits disk operations. Fortunately, the new v2 version fixes these issues. It already provides a bunch of new features which can help with Page Cache writeback.

https://biriukov.dev/docs/page-cache/6-cgroup-v2-and-page-cache/

验证

准备工作

创建几个文件待用

# dd if=/dev/random of=1 bs=1000000 count=500
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.03696 s, 165 MB/s
[root@test-page-cache]# for i in {1..6} ; do dd if=/dev/random of=$i bs=1000000 count=500 ; done
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.13473 s, 160 MB/s
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.06832 s, 163 MB/s
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.04262 s, 164 MB/s
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.03191 s, 165 MB/s
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.04711 s, 164 MB/s
500+0 records in
500+0 records out
500000000 bytes (500 MB) copied, 3.07169 s, 163 MB/s

清理一下page cache

[root@test-page-cache]# free -h
              total        used        free      shared  buff/cache   available
Mem:            22G        5.8G        2.8G        365M         14G         16G
Swap:          8.0G         16M        8.0G
[root@test-page-cache]# echo 3 > /proc/sys/vm/drop_caches
[root@test-page-cache]# free -h
              total        used        free      shared  buff/cache   available
Mem:            22G        5.3G         16G        365M        1.2G         16G
Swap:          8.0G         16M        8.0G

启动一个带 memory limit 的 docker 容器

docker run --rm -ti -v $PWD:/tmp/test-page-cache -w /tmp/test-page-cache --memory 1G alpine:3.16.2

测试1

测试1，是想验证一下 page cache 算在 limit 里面吗，以及触发了 limit 会怎么样？

读取两个之前生成的文件（每个文件500MB）

/tmp/test-page-cache # cat 1 > /dev/null
/tmp/test-page-cache # cat 2 > /dev/null

看一下 page cache 情况

[root@89e16b39fdc6da0f6c0e4d936aa0df8141ebe20d033fd8a7540ad0d550d8163c]# free -h
              total        used        free      shared  buff/cache   available
Mem:            22G        5.3G         15G        365M        2.6G         16G
Swap:          8.0G         16M        8.0G

# pwd
/sys/fs/cgroup/memory/docker/89e16b39fdc6da0f6c0e4d936aa0df8141ebe20d033fd8a7540ad0d550d8163c
# cat memory.usage_in_bytes
1073512448
# cat memory.stat
cache 1069584384
rss 106496
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 488796
pgpgout 227651
pgfault 924
pgmajfault 0
inactive_anon 102400
active_anon 0
inactive_file 1069461504
active_file 0
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 2147483648
total_cache 1069584384
total_rss 106496
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 488796
total_pgpgout 227651
total_pgfault 924
total_pgmajfault 0
total_inactive_anon 102400
total_active_anon 0
total_inactive_file 1069461504
total_active_file 0
total_unevictable 0

在容器里面继续读文件

/tmp/test-page-cache # for i in 1 2 3 4 5 6 ; do echo $i ; cat $i > /dev/null ; done
1
2
3
4
5
6
/tmp/test-page-cache #

可以看到 page cache 不再增加了

# free -h
              total        used        free      shared  buff/cache   available
Mem:            22G        5.3G         14G        365M        2.7G         16G
Swap:          8.0G         16M        8.0G
# cat memory.stat
cache 1067421696
rss 241664
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 1099560
pgpgout 838969
pgfault 1617
pgmajfault 0
inactive_anon 372736
active_anon 0
inactive_file 1067384832
active_file 0
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 2147483648
total_cache 1067421696
total_rss 241664
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 1099560
total_pgpgout 838969
total_pgfault 1617
total_pgmajfault 0
total_inactive_anon 372736
total_active_anon 0
total_inactive_file 1067384832
total_active_file 0
total_unevictable 0
# cat memory.usage_in_bytes
1073528832

使用 vmtouch 这样的工具，应该可以看到前一个文件的 cache 被清掉了。不方便安装，就不再继续验证了。

测试2

测试2是验证一下，page cache 算在第一个进程里面。其实就是说，如果文件已经在 page cache 里面了，再次 read 不会增加本进程所在cgroup 的 cache 值。

先在宿主机上面清理一下 page cache，准备一个干净的环境。然后在宿主机在 cat 一个文件，缓存到 page cache 里面。

# echo 3 > /proc/sys/vm/drop_caches
# free -h
              total        used        free      shared  buff/cache   available
Mem:            22G        5.3G         16G        365M        1.2G         16G
Swap:          8.0G         16M        8.0G
[root@VMS172906 ~]# cd /tmp/test-page-cache/
# cat 1 > /dev/null

然后在容器里面，再次 cat 1，可以看到容器的内存使用没有增加。

# cat memory.stat
cache 675840
rss 241664
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 1100286
pgpgout 1100138
pgfault 2673
pgmajfault 0
inactive_anon 507904
active_anon 0
inactive_file 520192
active_file 0
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 2147483648
total_cache 675840
total_rss 241664
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 1100286
total_pgpgout 1100138
total_pgfault 2673
total_pgmajfault 0
total_inactive_anon 507904
total_active_anon 0
total_inactive_file 520192
total_active_file 0
total_unevictable 0

27 Jan, 2023
golang gc
https://tip.golang.org/doc/gc-guide

https://github.com/golang/go/issues/30333
27 Dec, 2022
根证书验证中的一点细节
很多时候，公司内部会使用私有证书。我们把私有根证书下载到本地并信任，就可以访问公司内的 HTTPS 域名了。

那具体一点说，我们访问 HTTPS 域名的时候，网站会把域名的证书推送给我们（一般来说是浏览器，或者是 curl 命令等），这个证书里面包含了 CN 以及公钥等信息。同时还包含证书链 – 中间证书直到根证书。链上的证书的信息是不完整的，比如说，根证书里面可能有 CN 信息，但没有公钥数据。

那么问题来了，我怎么知道这个证书需要使用哪个根证书去验证呢？通过对比 CN 吗？还是别的什么字段？

如果你知道这个答案，可以不用看下去了。如果不知道，也可以直接翻到最后一节看。

不好！出问题了！

突然接到隔壁组同事的电话，说他们访问我这边的一个应用，app.corp.com，报了证书过期的错误。

我第一反应是，这个域名大概在半年前证书过期了，导致了一次事故。难不成是又过期了？

马上打开电脑，打开网站，正常访问，查看证书信息，也没看到什么异常。（注：其实应该是有异常的，可能是 Edge 浏览器忽略了这个异常？后面换浏览器打不开网站。）

但同事给的报错截图上面写的清清楚楚，“证书过期”。

然后登陆到一台服务器上面，curl -v https://app.corp.com，发现果然有问题。上面显示 issuer’s certificate 过期。

我们的域名证书是由根证书直接签发的，难道说根证书过期了？

一直以来的理解都不对

一方面庆幸，还好不是我的锅了，另外一方面又想，如果根证书过期，事情就搞大了吧？

因为一直以来我对根证书的理解是这样的：

开天辟地之时，公司先创建一个私钥和公钥，然后拿私钥公钥做出来一个根证书 CA。这个 CA 呢，一方面用来签发其他子域名的证书，另外一方面，需要分发到每一台服务器以及员工的办公电脑上面。

按我这个理解的话，如果根证书过期了。就需要拿私钥公钥再重新做一个根证书，然后给每个子域名再签发证书，同时还要把新的根证书分发到每一台服务器以及员工办公电脑上。

这个工作量有些大。

并非如此

我们联系了 SRE 的同事。他们说是需要下载一个新的根证书，并更新到出问题的服务器，也就是调用我们服务的那个同事的服务器上，就可以了。（并不需要什么再签发子域名的证书。）

测下来也的确如此。

那么问题就来了，app.corp.com 传过来的证书，是怎么对应到新下载的根证书的呢？

Golang 代码答疑

在网上搜索了一下，并没有找到答案（可能是我的关键字给的不对）。不过一切都在代码里，去翻翻看代码吧。

Golang 的 tls 库里面有一个代码示例 https://pkg.go.dev/crypto/tls#example-Dial。从这里 debug 进去，仔细查看，终于明白了这其中机制。

简单来说，golang 会遍历所有的根证书，看看用根证书能不能解码 app.corp.com 的加密信息。如果可以，那就是我们要找的根证书。然后再看这个根证书是不是过期啊，等等。

再稍微多说一些细节呢。
1. golang 会首先会根证书做一个排序，优先验证更可能的根证书，比如说 CN 一样的。
2. golang 不会验证所有的根证书，这里面有一个硬编码，最多验证100个。
3. golang 验证到合法根证书之后，不会停下来，会尝试拿所有的证书链信息做后续的使用（这应该是一个 golang 可以改进的点）。黑盒测试的结果，curl 会停下来。也就是说，如果我们电脑上没有合法的根证书，客户端在根证书验证这里耗时会增加不少。

24 Nov, 2022

为什么 rsyslog 把 etcd 日志采集到了 kernel.log

etcd版本:

etcd Version: 3.4.9
Git SHA: 54ba95891
Go Version: go1.12.17
Go OS/Arch: linux/amd64

经测试，rsyslog 的下面这个配置导致日志采集到了 kernel.log

#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
kern.*                                                  /var/log/kernel

rsyslogd 版本

rsyslogd 8.29.0, compiled with:
	PLATFORM:				x86_64-redhat-linux-gnu
	PLATFORM (lsb_release -d):
	FEATURE_REGEXP:				Yes
	GSSAPI Kerberos 5 support:		No
	FEATURE_DEBUG (debug build, slow code):	No
	32bit Atomic operations supported:	Yes
	64bit Atomic operations supported:	Yes
	memory allocator:			system default
	Runtime Instrumentation (slow code):	No
	uuid support:				Yes
	Number of Bits in RainerScript integers: 64

See http://www.rsyslog.com for more information.

ETCD 在 journald 中的一条日志如下：

{
	"__CURSOR" : "s=78fdc2bf435b4aa6b7df9f50ff1e9c9f;i=662c5390;b=f3260b93a64641889bbf8fed67f4365a;m=2f4d3ea8e9fc;t=5ee329ba40462;x=82eb53b68142dca7",
	"__REALTIME_TIMESTAMP" : "1669276010546274",
	"__MONOTONIC_TIMESTAMP" : "52008810244604",
	"_BOOT_ID" : "f3260b93a64641889bbf8fed67f4365a",
	"PRIORITY" : "7",
	"SYSLOG_IDENTIFIER" : "etcd",
	"_TRANSPORT" : "journal",
	"_PID" : "840",
	"_UID" : "997",
	"_GID" : "993",
	"_COMM" : "etcd",
	"_EXE" : "/usr/local/bin/etcd",
	"_CMDLINE" : "/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf.yml --log-output stderr",
	"_CAP_EFFECTIVE" : "0",
	"_SYSTEMD_CGROUP" : "/system.slice/etcd.service",
	"_SYSTEMD_UNIT" : "etcd.service",
	"_SYSTEMD_SLICE" : "system.slice",
	"_MACHINE_ID" : "c94b645006e94b62b253832779707d12",
	"_HOSTNAME" : "SVR15178IN5112",
	"PACKAGE" : "etcdserver/api/v3rpc",
	"MESSAGE" : "start time = 2022-11-24 15:46:50.545251478 +0800 CST m=+51575280.245951347, time spent = 821.324\uffffffc2\uffffffb5s, remote = 10.4.241.133:54952, response type = /etcdserverpb.KV/Txn, request count = 0, request size = 0, response count = 0, response size = 31, request content = compare:<key:\"cilium/state/identities/v1/id/292984\" version:0 > success:<request_put:<key:\"cilium/state/identities/v1/id/292984\" value_size:196 >> failure:<>",
	"_SOURCE_REALTIME_TIMESTAMP" : "1669276010546119"
}

粗看下来，是因为 etcd 使用一些库记录日志到 journald 的时候，没有加 FACILITY 字段。

rsyslog 采集日志的时候，会通过 PRIORITY » 3 的方式计算 FACILITY。计算结果为0，认为FACILITY 是 kernel。

Rsyslog 的一些 const value: https://github.com/rsyslog/rsyslog/blob/d083a2a2c20df6852a53e45f1e7a3f47679236d6/runtime/rsyslog.h#L202

#define	LOG_KERN	(0<<3)	/* kernel messages */
#define	LOG_USER	(1<<3)	/* random user-level messages */
#define	LOG_MAIL	(2<<3)	/* mail system */
#define	LOG_DAEMON	(3<<3)	/* system daemons */
#define	LOG_AUTH	(4<<3)	/* security/authorization messages */
#define	LOG_SYSLOG	(5<<3)	/* messages generated internally by syslogd */
#define	LOG_LPR		(6<<3)	/* line printer subsystem */
#define	LOG_NEWS	(7<<3)	/* network news subsystem */
#define	LOG_UUCP	(8<<3)	/* UUCP subsystem */

Rsyslog 计算 FACILITY 的宏 https://github.com/rsyslog/rsyslog/blob/d083a2a2c20df6852a53e45f1e7a3f47679236d6/runtime/rsyslog.h#L251

pri2fac(const syslog_pri_t pri)
{
	unsigned fac = pri >> 3;
	return (fac > 23) ? LOG_FAC_INVLD : fac;
}

2 Nov, 2022
[译]Linux下 OOMKiller 什么时候被触发
原文Roughly when the Linux Out-Of-Memory killer triggers (as of mid-2019)

原文发表时间，2019-08-11

因为某些别的原因，我最近想了解 Linux 里面 OOM Killer 是何时触发（以及不触发）的，以及为什么。这方面的详细文档不多，以及有些已经过时了。我在这里也没办法添加详细文档，因为这需要对 Linux 内核代码很了解，但我至少可以大概写些观点供我自己使用。

现如今有两种不同类型的 OOM Killer：全局的 OOM Killer 和依赖 cgroup 的 OOM Killer（后者通过cgroup 内存控制器实现）。我主要是对全局的感兴趣，部分是因为 cgroup OOM Killer 相对来说容易预测。

简单来说，当内核在分配物理内存页有有困难时，全局 OOM Killer 触发。当内核尝试分配内存页（不管用于任何用途，用于内核使用或需要内存页的进程）并且失败时，它将尝试各种方法来回收和压缩内存。如果成功了或至少取得了一些进展，内核会继续重试分配（我从代码中了解到）；如果他们未能释放内存页或取得进展，它会在许多（但不是全部）情况下触发 OOM Killer。

(比如说，内核申请连续大段内存页失败，是不会触发的，参考Decoding the Linux kernel’s page allocation failure messages。如果申请的连续内存小于等于 30KB 才会触发。git blame 显示从2007年就开始是这样的了。)

现象

journalctl 日志

服务配置

原因

start and stop

jsvc 中 kill 效果

我们公司的服务 stop 时发生了什么

解决方案

简要版

知识图谱的创建

数据结构

解析过程中的数据结构

存储(ES)中的数据结构

entity

relation

graph

知识图谱的搜索

社区报告

需求

OIDC

申请 OIDC Client

域名准备

OIDC 原理

权限控制

配置

klog

为什么又做了 klog?

glog 缺陷

klog 的改进

测试环境

问题

结果

清理策略

验证

准备工作

测试1

测试2

不好！出问题了！

一直以来的理解都不对

并非如此

Golang 代码答疑