- bpf 最初是为bsd操作系统开发,全称是 「Berkeley Packet Filter」 ,翻译过来是 「伯克利包过滤器」 ,顾名思义,它是在伯克利大学诞生的,1992年Steven McCanne 和 Van Jacobson 写了一篇 《The BSD Packet Filter: A New Architecture for User-level Packet Capture》论文
- 用户使用bpf 虚拟机的指令集(bpf字节码)定义过滤器表达式,然后传递给内核, 由解释器执行。从而使得包过滤器可以在内核中直接执行。避免向用户态进程复制每个数据包,从而提升数据包过滤的性能。
工具体验
代码语言:javascript复制root@heidsoft-dev:~# apt-get install libbpfcc
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
libclang-common-7-dev libclang1-7 liblldb-12 libllvm7 libz3-4 libz3-dev llvm-12-runtime llvm-12-tools llvm-9-runtime llvm-9-tools python3-lldb-12
Use 'apt autoremove' to remove them.
The following packages will be REMOVED:
bcc-tools libbcc libbcc-examples python-bcc
The following NEW packages will be installed:
libbpfcc
0 upgraded, 1 newly installed, 4 to remove and 183 not upgraded.
Need to get 14.9 MB of archives.
After this operation, 15.3 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://cn.archive.ubuntu.com/ubuntu focal/universe amd64 libbpfcc amd64 0.12.0-2 [14.9 MB]
Fetched 14.9 MB in 7s (1,995 kB/s)
(Reading database ... 282090 files and directories currently installed.)
Removing bcc-tools (0.10.0-34.git.6836168) ...
Removing python-bcc (0.10.0-34.git.6836168) ...
Removing libbcc-examples (0.10.0-34.git.6836168) ...
Removing libbcc (0.10.0-34.git.6836168) ...
Selecting previously unselected package libbpfcc.
(Reading database ... 281556 files and directories currently installed.)
Preparing to unpack .../libbpfcc_0.12.0-2_amd64.deb ...
Unpacking libbpfcc (0.12.0-2) ...
Setting up libbpfcc (0.12.0-2) ...
Processing triggers for libc-bin (2.31-0ubuntu9.7) ...
root@heidsoft-dev:~# apt-get install bpftrace
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
libclang-common-7-dev libclang1-7 liblldb-12 libllvm7 libz3-4 libz3-dev llvm-12-runtime llvm-12-tools llvm-9-runtime llvm-9-tools python3-lldb-12
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
bpftrace
0 upgraded, 1 newly installed, 0 to remove and 183 not upgraded.
Need to get 457 kB of archives.
After this operation, 1,547 kB of additional disk space will be used.
Get:1 http://cn.archive.ubuntu.com/ubuntu focal/universe amd64 bpftrace amd64 0.9.4-1 [457 kB]
Fetched 457 kB in 3s (160 kB/s)
Selecting previously unselected package bpftrace.
(Reading database ... 281568 files and directories currently installed.)
Preparing to unpack .../bpftrace_0.9.4-1_amd64.deb ...
Unpacking bpftrace (0.9.4-1) ...
Setting up bpftrace (0.9.4-1) ...
Processing triggers for man-db (2.9.1-1) ...
root@heidsoft-dev:~# bpftrace --help
USAGE:
bpftrace [options] filename
bpftrace [options] -e 'program'
OPTIONS:
-B MODE output buffering mode ('full', 'none')
-f FORMAT output format ('text', 'json')
-o file redirect bpftrace output to file
-d debug info dry run
-dd verbose debug info dry run
-b force BTF (BPF type format) processing
-e 'program' execute this program
-h, --help show this help message
-I DIR add the directory to the include search path
--include FILE add an #include file before preprocessing
-l [search] list probes
-p PID enable USDT probes on PID
-c 'CMD' run CMD and enable USDT probes on resulting process
--unsafe allow unsafe builtin functions
-v verbose messages
--info Print information about kernel BPF support
-V, --version bpftrace version
ENVIRONMENT:
BPFTRACE_STRLEN [default: 64] bytes on BPF stack per str()
BPFTRACE_NO_CPP_DEMANGLE [default: 0] disable C symbol demangling
BPFTRACE_MAP_KEYS_MAX [default: 4096] max keys in a map
BPFTRACE_CAT_BYTES_MAX [default: 10k] maximum bytes read by cat builtin
BPFTRACE_MAX_PROBES [default: 512] max number of probes
BPFTRACE_LOG_SIZE [default: 409600] log size in bytes
BPFTRACE_NO_USER_SYMBOLS [default: 0] disable user symbol resolution
BPFTRACE_VMLINUX [default: None] vmlinux path used for kernel symbol resolution
BPFTRACE_BTF [default: None] BTF file
EXAMPLES:
bpftrace -l '*sleep*'
list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }'
trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
count syscalls by process name
代码语言:javascript复制 bpftrace -e 'tracepoint:syscalls:sys_enter_open {printf("%s %sn",comm,str(args->filename));}'
代码语言:javascript复制 bpftrace -e 'tracepoint:syscalls:sys_enter_open {printf("%s %sn",comm,str(args->filename));}'
代码语言:javascript复制bpftrace -l 'tracepoint:syscalls:sys_enter_open*'
tracepoint:syscalls:sys_enter_open_by_handle_at
tracepoint:syscalls:sys_enter_open_tree
tracepoint:syscalls:sys_enter_open
tracepoint:syscalls:sys_enter_openat
tracepoint:syscalls:sys_enter_openat2
root@heidsoft-dev:~# bpftrace -e 'tracepoint:syscalls:sys_enter_open* { @[probe] = count(); }'
Attaching 5 probes...
^C
@[tracepoint:syscalls:sys_enter_openat]: 33
代码语言:javascript复制root@heidsoft-dev:~# opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID COMM FD ERR PATH
891 irqbalance 2 0 /proc/interrupts
891 irqbalance 2 0 /proc/stat
891 irqbalance 2 0 /proc/irq/19/smp_affinity
891 irqbalance 2 0 /proc/interrupts
891 irqbalance 2 0 /proc/stat
891 irqbalance 2 0 /proc/irq/19/smp_affinity
833 vmtoolsd 2 0 /proc/meminfo
833 vmtoolsd 2 0 /proc/vmstat
833 vmtoolsd 2 0 /proc/stat
833 vmtoolsd 2 0 /proc/zoneinfo
833 vmtoolsd 2 0 /proc/uptime
833 vmtoolsd 2 0 /proc/diskstats
探针类型
Type | Description |
---|---|
tracepoint | Kernel static instrumentation points |
usdt | User-level statically defined tracing |
kprobe | Kernel dynamic function instrumentation |
kretprobe | Kernel dynamic function return instrumentation |
uprobe | User-level dynamic function instrumentation |
uretprobe | User-level dynamic function return instrumentation |
software | Kernel software-based events |
hardware | Hardware counter-based instrumentation |
watchpoint | Memory watchpoint events (in development) |
profile | Timed sampling across all CPUs |
interval | Timed reporting (from one CPU) |
BEGIN | Start of bpftrace |
END | End of bpftrace |
Listing probes | bpftrace -l 'tracepoint:syscalls:sys_enter_*' |
---|---|
Hello world | bpftrace -e 'BEGIN { printf("hello worldn") }' |
File opens | bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %sn", comm, str(args->filename)) }' |
Syscall counts by process | bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count() }' |
Distribution of read() bytes | bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 18644/ { @bytes = hist(args->retval) }' |
Kernel dynamic tracing of read() bytes | bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200) }' |
Timing read()s | bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]); delete(@start[tid]) }' |
Count process-level events | bpftrace -e 'tracepoint:sched:sched* { @[name] = count() } interval:s:5 { exit() }' |
Profile on-CPU kernel stacks | bpftrace -e 'profile:hz:99 { @[stack] = count() }' |
Scheduler tracing | bpftrace -e 'tracepoint:sched:sched_switch { @[stack] = count() }' |
Block I/O tracing | bpftrace -e 'tracepoint:block:block_rq_issue { @ = hist(args->bytes); } |
Kernel struct tracing (a script, not a one-liner) | Command: bpftrace path.bt, where the path.bt file is:#include <linux/path.h>#include <linux/dcache.h>kprobe:vfs_open { printf("open path: %sn", str(((path *)arg0)->dentry->d_name.name)); } |
典型项目
- https://github.com/heidsoft-linux/cilium Cilium is an open source project to provide networking, security, and observability for cloud native environments such as Kubernetes clusters and other container orchestration platforms.
- Cilium是一个开源项目,旨在为云原生环境(如Kubernetes集群和其他容器编排平台)提供网络,安全性和可观察性。
- At the foundation of Cilium is a new Linux kernel technology called eBPF, which enables the dynamic insertion of powerful security, visibility, and networking control logic into the Linux kernel. eBPF is used to provide high-performance networking, multi-cluster and multi-cloud capabilities, advanced load balancing, transparent encryption, extensive network security capabilities, transparent observability, and much more.
- Cilium 的基础是一种名为 eBPF 的新型 Linux 内核技术,它能够将强大的安全性、可见性和网络控制逻辑动态插入到 Linux 内核中。eBPF 用于提供高性能网络、多集群和多云功能、高级负载平衡、透明加密、广泛的网络安全功能、透明的可观察性等等。
- Cilium consists of an agent running on all cluster nodes and servers in your environment. It provides networking, security, and observability to the workloads running on that node. Workloads can be containerized or running natively on the system.
- Cilium 由在环境中的所有群集节点和服务器上运行的代理组成。它为该节点上运行的工作负荷提供网络、安全性和可观察性。工作负载可以在系统上进行容器化或本机运行。
eBPF
eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in an operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.
eBPF是一项革命性的技术,起源于Linux内核,可以在操作系统内核中运行沙盒程序。它用于安全有效地扩展内核的功能,而无需更改内核源代码或加载内核模块。
Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. At the same time, an operating system kernel is hard to evolve due to its central role and high requirement towards stability and security. The rate of innovation at the operating system level has thus traditionally been lower compared to functionality implemented outside of the operating system.
从历史上看,由于内核具有监督和控制整个系统的特权能力,操作系统一直是实现可观察性,安全性和网络功能的理想场所。同时,操作系统内核由于其核心作用以及对稳定性和安全性的高要求而难以发展。因此,与在操作系统之外实现的功能相比,操作系统级别的创新率传统上较低。
eBPF changes this formula fundamentally. By allowing to run sandboxed programs within the operating system, application developers can run eBPF programs to add additional capabilities to the operating system at runtime. The operating system then guarantees safety and execution efficiency as if natively compiled with the aid of a Just-In-Time (JIT) compiler and verification engine. This has led to a wave of eBPF-based projects covering a wide array of use cases, including next-generation networking, observability, and security functionality.
eBPF从根本上改变了这个公式。通过允许在操作系统中运行沙盒程序,应用程序开发人员可以运行 eBPF 程序,以便在运行时向操作系统添加其他功能。然后,操作系统保证安全性和执行效率,就像在实时(JIT)编译器和验证引擎的帮助下进行本机编译一样。这导致了一波基于eBPF的项目,涵盖了广泛的用例,包括下一代网络,可观察性和安全功能。
today, eBPF is used extensively to drive a wide variety of use cases: Providing high-performance networking and load-balancing in modern data centers and cloud native environments, extracting fine-grained security observability data at low overhead, helping application developers trace applications, providing insights for performance troubleshooting, preventive application and container runtime security enforcement, and much more. The possibilities are endless, and the innovation that eBPF is unlocked has only just begun.
如今,eBPF被广泛用于驱动各种用例:在现代数据中心和云原生环境中提供高性能网络和负载平衡,以低开销提取细粒度的安全可观察性数据,帮助应用程序开发人员跟踪应用程序,为性能故障排除、预防性应用程序和容器运行时安全实施提供见解等等。可能性是无穷无尽的,eBPF解锁的创新才刚刚开始。
eBPF programs are event-driven and are run when the kernel or an application passes a certain hook point. Pre-defined hooks include system calls, function entry/exit, kernel tracepoints, network events, and several others.
eBPF 程序是事件驱动的,在内核或应用程序通过某个挂钩点时运行。预定义的挂钩包括系统调用、函数进入/退出、内核跟踪点、网络事件和其他几个。
If a predefined hook does not exist for a particular need, it is possible to create a kernel probe (kprobe) or user probe (uprobe) to attach eBPF programs almost anywhere in kernel or user applications.
如果不存在针对特定需求的预定义钩子,则可以创建内核探测器(kprobe)或用户探测器(uprobe),以将eBPF程序附加到内核或用户应用程序中的几乎任何位置。
In a lot of scenarios, eBPF is not used directly but indirectly via projects like Cilium, bcc, or bpftrace which provide an abstraction on top of eBPF and do not require to write programs directly but instead offer the ability to specify intent-based definitions which are then implemented with eBPF.
在很多情况下,eBPF不是直接使用,而是通过Cilium,bcc或bpftrace等项目间接使用,这些项目在eBPF之上提供抽象,不需要直接编写程序,而是提供指定基于意图的定义的能力,然后使用eBPF实现。
If no higher-level abstraction exists, programs need to be written directly. The Linux kernel expects eBPF programs to be loaded in the form of bytecode. While it is of course possible to write bytecode directly, the more common development practice is to leverage a compiler suite like LLVM to compile pseudo-C code into eBPF bytecode.
如果不存在更高级别的抽象,则需要直接编写程序。Linux内核期望eBPF程序以字节码的形式加载。虽然当然可以直接编写字节码,但更常见的开发实践是利用LLVM等编译器套件将伪C代码编译成eBPF字节码。
When the desired hook has been identified, the eBPF program can be loaded into the Linux kernel using the bpf system call. This is typically done using one of the available eBPF libraries. The next section provides an introduction into the available development toolchains.
当确定了所需的钩子时,可以使用 bpf 系统调用将 eBPF 程序加载到 Linux 内核中。这通常使用一个可用的 eBPF 库来完成。下一节将介绍可用的开发工具链。
As the program is loaded into the Linux kernel, it passes through two steps before being attached to the requested hook:
当程序被加载到 Linux 内核中时,它会在附加到请求的钩子之前经历两个步骤:
Verification 验证
The verification step ensures that the eBPF program is safe to run. It validates that the program meets several conditions, for example:
验证步骤可确保 eBPF 程序安全运行。它验证程序是否满足多个条件,例如:
- The process loading the eBPF program holds the required capabilities (privileges). Unless unprivileged eBPF is enabled, only privileged processes can load eBPF programs.
- 加载 eBPF 程序的过程具有所需的功能(特权)。除非启用了非特权的 eBPF,否则只有特权进程才能加载 eBPF 程序。
- The program does not crash or otherwise harm the system.
- 程序不会崩溃或以其他方式损坏系统。
- The program always runs to completion (i.e. the program does not sit in a loop forever, holding up further processing).
- 程序总是运行到完成(即程序不会永远处于循环中,从而阻碍进一步的处理)。
JIT Compilation JIT 编译
The Just-in-Time (JIT) compilation step translates the generic bytecode of the program into the machine specific instruction set to optimize execution speed of the program. This makes eBPF programs run as efficiently as natively compiled kernel code or as code loaded as a kernel module.
实时 (JIT) 编译步骤将程序的通用字节码转换为计算机特定的指令集,以优化程序的执行速度。这使得 eBPF 程序的运行效率与本机编译的内核代码或作为内核模块加载的代码一样高效。
Maps
A vital aspect of eBPF programs is the ability to share collected information and to store state. For this purpose, eBPF programs can leverage the concept of eBPF maps to store and retrieve data in a wide set of data structures. eBPF maps can be accessed from eBPF programs as well as from applications in user space via a system call.
eBPF程序的一个重要方面是能够共享收集的信息并存储状态。为此,eBPF程序可以利用eBPF映射的概念来存储和检索各种数据结构中的数据。eBPF地图可以通过系统调用从eBPF程序以及用户空间中的应用程序访问。
The following is an incomplete list of supported map types to give an understanding of the diversity in data structures. For various map types, both a shared and a per-CPU variation is available.
- Hash tables, Arrays
- LRU (Least Recently Used)
- Ring Buffer
- Stack Trace
- LPM (Longest Prefix match)
参考文献
- https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md
- https://opensource.com/article/19/8/introduction-bpftrace
- https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/
- https://cilium.io/blog/2018/11/20/fb-bpf-firewall
- https://ebpf.io/what-is-ebpf/
- https://blogs.oracle.com/linux/post/bpf-a-tour-of-program-types
- https://ebpf.io/what-is-ebpf/