Overview
China Linux System Forum (CLSF) 2026 — Chengdu, June 4–5. CLSF (started 2009) is the Chinese counterpart to LSF/MM: closed-door, invitation-only, single-track, ~50 senior kernel engineers and academic researchers (including authors of two SOSP 2025 best papers and several OSDI papers), with industrial leadership participation from Alibaba’s open-source organization and the openEuler chair.
Attendees spanned Chinese internet majors (Alibaba, Ant, ByteDance, Tencent, Kuaishou, DiDi, JD, China Telecom Cloud), traditional vendors (Huawei, Kylin, Red Hat, Intel), hardware/OEMs (Transsion, LeapIO, WDC), academia (SJTU IPADS, Beihang, PKU/Asterinas), and a small international contingent (Damien Le Moal/WDC, IBM Research).
Talks ran 40–90 minutes with extensive Q&A; conference custom is to not stop a session until disagreements are resolved, so many strong opinions were exchanged in discussion rather than on the slides themselves.
I attended in person and presented joint work with Changwoo Min on scx_lavd — “TaskSize-Aware Load Balancer with Budget-Controlled Invariant Runtime” (see §3.3 below). All public CLSF 2026 slides referenced here are at the CLSF 2026 drive folder.
3.1. SMR Status & Future
-
Slides: SMR Status and Future / Linux Kernel SMR Support: State of the Art and Future Work
-
What is SMR? Shingled Magnetic Recording — a hard-drive layout where disk tracks overlap like roof shingles to pack ~25% more bits per platter. The catch: you can’t randomly overwrite anymore — writes must be sequential within each disk “zone”.
-
The takeaway: SMR is no longer a niche capacity format. 65% of all HDD exabytes shipped in 2026 are SMR, projected to hit ~90% within three years. The driver is pure cost (more $/TB), and AI workloads amplify the demand rather than create it. If your storage stack touches HDDs, you need to plan for SMR.
-
Where Linux is today: Damien Le Moal (@Western Digital) gave a decade-in-review. Headline landing — v6.15 XFS-on-SMR: XFS now runs on SMR disks natively (recommend v6.18 LTS minimum). Setup is a one-liner:
mkfs.xfs -f /dev/<SSD> -r rtdev=/dev/<SMR>(metadata on SSD, data on the SMR disk).
3.2. MPTCP-KTLS: Bringing TLS to Multipath TCP
- Slides: MPTCP Support
- Geliang Tang (@MPTCP maintainer) presented his two-year effort to make TLS work over MPTCP (Multipath TCP — e.g. a phone using WiFi + cellular together), now at patchset v23.
3.3. TaskSize-Aware Load Balancer with Budget-Controlled Invariant Runtime (LAVD)
-
Gavin Guo (@Igalia) presented joint work with Changwoo Min (@Igalia), from OSPM 2026. Targets three
scx_lavdlimitations: (1) load metric counts queue length not task size; (2) no migration budget — thundering-herd stealing papered over by& 7randomization; (3) no task-type preference. -
Solution: replace load with
queued_load_invr + util_invr(capacity/frequency/thermal-scaled). Per-domain fair share is capacity-proportional. Migration is budget-controlled — stealee surrenders half excess, stealer accepts half deficit, minimum moves. Symmetric 50/50 rule eliminates the& 7heuristic. -
Results (
schbenchp99.9, lower is better):
| Platform | EEVDF | LAVD main | Task-Size-Aware LB | Δ |
|---|---|---|---|---|
| Meteor Lake (14 CPUs, hybrid) | 10,587 | 9,899 | 9,195 | −7.1% |
| AMD EPYC 9R14 (192 CPUs) | 11,707 | 7,297 | 6,777 | −7.1% |
3.4. From Imperative to Declarative OS — Dong Du’s Vision for Agentic Kernels
-
Dong Du (@SJTU IPADS) gave a ~70-min keynote arguing OSes must evolve from imperative POSIX to declarative kernels that take application intent as input. Credentials carry weight: FAST'26 Best Paper + Distinguished Artifact (first time in FAST history) and SOSP'25 Best Paper. Talk structured around four pillars, each with a concrete artifact.
-
Development — SpecOS: 20K-LoC spec-generated OS running a launcher in QEMU; played its own slides from inside the demo. Observation: vibe-coding hits a wall past 100K–200K LoC — the model can no longer keep the dependency graph coherent.
-
Abstraction —
isyscall(intent, args): the application says what it wants in plain English; an LLM living inside the kernel (exposed as a/dev/llm-style character device any kernel module canread/write) translates the intent into a sequence of kernel ops. A plan-cache absorbs the LLM-vs-syscall cost gap, so steady-state cache hit rate (not raw inference latency) governs viability. Results: 40–50% latency reduction on fork+execv, ~30% on ash shell. -
Extension — vBPF (OSDI'26): turns an eBPF hook from a fixed execution point into a connection point — a runtime dispatcher routes each event only to the eBPF programs that belong to the current agent’s namespace, instead of running every loaded program in sequence.
-
Cultural argument: “千 Agent 千 OS” (a thousand agents, a thousand OSes) — one kernel binary, runtime-customized per agent. Counter to the “AI will be the OS” narrative from large model vendors: “You are doing One of the Most Important Works (OS) Nowadays!”
3.5. Asterinas — A Rust OS for Security with Formally Verified CortenMM
-
Slides: not provided
-
Tian Hong-Liang 田洪亮, the Asterinas project lead (@Ant Group + PKU; ATC'25 + SOSP'25 Best Paper group) positioned Asterinas as the strongest existence proof that a Linux-ABI-compatible, formally-verified Rust kernel is production-viable. Security-first framing: memory-safety bugs dominate kernel CVE distributions.
-
Framekernel architecture: writing a kernel in Rust is not automatically safe — operations like page-table updates and MMIO have to bypass Rust’s safety checks using
unsafe. Earlier Rust OSes sprinkledunsafecode everywhere, so the safety claim collapsed to a slogan. Asterinas confines allunsafecode to a small foundation library called OSTD; everything built on top of it is pure safe Rust, with the compiler enforcing that boundary. The trusted code base is ~1/10 the size of other Rust OSes — and smaller than many commercial microkernel trusted bases. Unlike a microkernel, everything still runs in one address space, so there is no IPC overhead between subsystems. -
Linux compatibility & first commercial target: Asterinas runs unmodified Linux software — 230+ Linux syscalls (~65% of Linux’s ~336),
gdbandptracework, 100+ Linux user-space packages run as-is, and it is self-hosting (you can build the Asterinas source from inside an Asterinas VM). The first production deployment target is Intel TDX confidential virtual machines — the workload where you need a guest OS strong enough that a malicious host operator cannot exploit it through the para-virt interfaces.
3.6. AI in Kernel Development — Three Production Systems
-
Slides: AI kernel application
-
A ByteDance kernel engineer, 章雨晨, walked through three production LLM deployments in their internal kernel team: AI crash triage, AI patch backport, and AI-generated debug tooling. Today every crash is rule-terminated or AI-analyzed.
-
Backport pipeline: Sashiko has been integrated into the review system for AI-backported patch sets to address kernel crashes.
-
The provocative structural point: AI is now a producer; humans are a fixed-throughput server. Putting humans at the queue tail caps system throughput at human throughput. The only design that doesn’t bottleneck is AI reviewing AI, with humans setting the gates (risk thresholds, escalation rules) rather than approving every patch. Internal MR policy already accepts an “AI review assistant” account (“假同事” fake colleague) as one of the two required approvals.
-
Audience pushback: model-supplier poisoning — models controlled by other companies are a potential security issue. Model-supplier bias — if the AI generating patches and the AI reviewing them are the same model family, both can be biased together; informal mitigation today is “use two different model families.”
3.7. spawn_template: Caching exec() Templates for Agent Tool Startup
-
Slides: Agent Tool Startup
-
The problem: AI agents are constant tool-launchers —
rg,grep,git,bash,pythonget spawned thousands of times per agent session. Every launch pays the same fixed cost: re-parsing the binary’s ELF header, walking its program-header table, setting up the dynamic linker. One launch is cheap; under Kubernetes-scale multi-agent fan-out, the aggregate cost adds up. -
The takeaway: Chen Li (@China Telecom Cloud / 天翼云) proposed
spawn_template— a userspace-opt-in API that caches a slice of the exec setup per hot binary, so repeated launches skip the re-parsing. Eval: +4.99% throughput on a mixed agent benchmark. -
How
spawn_templateworks: a new syscall pair —spawn_template_create(exe)reads the binary once and caches the ELF header + program-header table;spawn_template_spawn(tmpl, argv, envp, ...)then launches future processes from that cache. -
The upstream redirect: Christian Brauner and Kees Cook accepted the motivation but pushed back on adding a one-off syscall.
3.8. AI Inference Storage — GD2FS, KV-Cache, P-D Disaggregation
-
Slides: AI reference architecture
-
The problem: an LLM-inference server asks two very different things from storage — and the existing FS / block / network stack was designed for neither.
- (1) Model weights — a multi-GB checkpoint loaded once at startup and read constantly; must reach the GPU’s high-bandwidth memory (HBM) before the model can serve any request.
- (2) KV cache — the per-token attention state every inference step produces and consumes, growing log-linearly with context length, recomputable on miss. Huge numbers of tiny discontiguous blocks; the industry de-facto on-disk format is Page-First (one small file per attention page).
-
The takeaway: Pi Zhenwei (@Tensorfer; formerly ByteDance, the youngest 3.2-era kernel maintainer) gave the closing ~80-min talk and proposed two answers — GD2FS (GPU Direct Distributed File System) as the storage layer, plus a completely stateless inference architecture that scales horizontally like a Redis cluster.
-
The design: PCIe bandwidth caps your AI performance. Single-server DDR (~500–600 GB/s) ≈ 8×400 G NIC ingress, so the cache and data networks must merge into one fabric. Pi’s answer is GD2FS — GPU-Direct RDMA + multi-NIC TCP — that scales like Redis. P-D disaggregation (prefill and decode on different machines).
3.9. HUATUO — Alibaba’s Production Kernel Observability Stack
-
Slides: HUATUO @ CLSF 2026
-
The problem: when something goes wrong on a Linux server, you need kernel-level evidence while it’s happening — but the tools that grab that evidence (BPF, kprobes, ftrace) can themselves crash the server if misused. Observability can become the next outage.
-
The takeaway: Alibaba’s HUATUO runs on tens of thousands of nodes across kernel 4.19 → 6.14 (github.com/ccfos/huatuo). The value is the safety scaffolding (bounded queues, CPU quotas, time-window dedup, blacklists, hard timeouts) — captured in their operating principle: “The observation tool can never become a new fault source.”
3.10. BTRFS Silent-Corruption Detection via Fuzzing
-
The problem: filesystems assume “disks may fail but they don’t lie.” But silent media corruption — bit-flips on platters or NAND that slip past on-device ECC — feeds the FS bytes that look valid but break its internal invariants, often crashing the kernel.
-
The takeaway: Huang Zhiyuan (@Beihang/BUAA, PhD) built a smarter fuzzer that targets metadata the workload actually reads (and fixes up the checksum after mutating), unlike Syzkaller’s blunt random byte-flipping. Found 4 real BTRFS bugs.
3.11. LeapRAID Driver — Tsinghua Spin-off Brings a Domestic SAS RAID HBA Upstream
-
Slides: LeapRAID
-
What was presented: a LeapIO Tech (上海纵存科技, Tsinghua spin-off Dec 2021) engineer walked through the source layout of their Linux driver for the LeapHBA-8200C — a PCIe Gen3 x8 SAS RAID controller card (RAID 0/1/10/5/50/6/60).
-
Where they are: performance is “basically on par with Broadcom” (the dominant incumbent), currently shipping to Inspur and similar Chinese server OEMs.
3.12. COPY_MC and ARM64 Support — A 4-Year Saga at v14
-
Slides: copy_mc and arm64
-
The problem: memory errors are now the #1 hardware-crash cause on big servers — JD Cloud attributes 37% of crashes to memory, Huawei attributes 61%, and HBM (used on GPUs/accelerators) is far less reliable than DRAM. The kernel needs a way to recover from a bad memory line instead of panicking; on x86 the answer is
copy_mc(a memory-error-safememcpythat returns “bytes remaining” instead of crashing). aarch64 still doesn’t have it. -
The takeaway: Wang Kefeng (@Huawei/openEuler) walked through the
copy_mcon arm64 patchset — now at v14 over 4 years, two co-author shifts, zero technical objections left standing. Blunt diagnosis: the bottleneck is ARM maintainer disengagement on RAS work, not technical readiness. Linaro Connect, face-to-face meetings, and private channels have all failed to produce a merge signal. Explicit ask to the room: large cloud customers need to coordinate upstream pressure together, because the code is done.
3.13. ublk — Use Cases and Progress
- Slides: ublk use cases
- Lei Ming (@Red Hat, ublk upstream maintainer) gave a walkthrough making clear that ublk (a userspace block device: the kernel exposes
/dev/ublkbNlike any normal block device, but all the I/O logic — compression, RAID, networking, snapshotting — runs in a userspace daemon connected viaio_uring; merged v6.0) is no longer niche. Real adopters: Android 16 (OTA snapshot), SUSE Longhorn, Pure Storage, SPDK, NVIDIA.
3.14. ublk RAID5 — RAID5 Implemented Entirely in ublk + io_uring
-
Slides: ublk RAID5
-
What was presented: Li Xiao (@Red Hat / 倪晓) showed a full RAID5 implementation living entirely in ublk +
io_uringBPF — no involvement from the kernel’s built-in md-RAID5. Parity XOR runs inside a BPF program against kernel pages, so data never crosses the user/kernel boundary. -
Performance vs md-RAID5 is honestly mixed: reads win +20–31% (credit
io_uring’s lighter batching), but writes lose 2–4× due to per-stripe lock contention plus an 8-hop read-modify-write round-trip between kernel driver, userspace daemon, and BPF program. One knob recovered most of that: bumpingHASH_BUCKETS256→1024 (the stripe-lock hash table) gave +210% seq-write and +132% rand-write — confirming lock contention, not XOR or BPF cost, is the bottleneck.
3.15. From Test-Time Scaling to Agentic Scaling — Infrastructure Evolution
-
The reframe: model quality is no longer just about parameter count. Three new axes matter — (1) inference speed is itself a capability; (2) runtime reasoning compute as a knob; (3) parallel agent attempts on current open-source models.
-
The takeaway: Hu Hsin-Wei 胡欣蔚 gave a talk arguing all three axes converge on one design — a “通智网存超节点” that pairs parallel agents with a merged cache+data fabric. The closing slogan: “This is the best window for systems people to participate in AI.”