As the demand for more advanced AI models increases, the efficiency of GPU kernels has become a significant challenge. Custom kernels often deliver performance improvements that standard libraries cannot match, but developing them requires specialized expertise that is becoming harder to find. LinkedIn's Liger Kernel project aims to address this issue, offering substantial performance enhancements while simplifying the optimization process.
Launched as an open-source initiative, the Liger Kernel project reports a 20% increase in throughput and a 60% reduction in memory usage across nearly 40 different model architectures. These optimizations enable it to work effectively with popular frameworks like HuggingFace Transformers and technologies such as Flash Attention, PyTorch FSDP, and DeepSpeed. Since its launch, the project has gained considerable traction, with over 7 million downloads and contributions from more than 100 companies.
However, maintaining and evolving such a large project comes with its own set of challenges. The rapid advancement in AI models requires continuous kernel development and optimization, tasks that typically demand valuable expert resources. To tackle this problem, LinkedIn is implementing AI agents that automate various aspects of kernel engineering, utilizing a concept known as agentic workflows.
Agentic Workflows for Kernel Engineering
The development process for the Liger Kernel follows a structured set of repeatable steps: analysis, implementation, testing, and benchmarking. While these workflows are well-suited for automation, the complexity of tasks—such as managing arbitrary shapes and multiple precision modes—calls for sophisticated methods. LinkedIn's approach involves embedding Liger-specific domain knowledge into reusable agent-driven workflows that automate intricate engineering tasks.
These workflows operate through a three-stage pipeline that incorporates human review checkpoints. In the first stage, the agent analyzes input materials, reasoning through the problem to create a structured profile for human validation. The second stage involves the agent generating or modifying files based on the approved profile and existing code, ensuring compliance with project conventions. Finally, the agent conducts correctness checks and benchmarks, which are essential for maintaining quality control throughout the process.
These agentic workflows have already yielded tangible contributions, including new kernels and model integrations that improve performance.
Automating Kernel Creation with Liger Agents
One key agent, known as liger-kernel-dev, focuses on converting PyTorch operations into optimized Triton kernels. It categorizes operations by complexity—from element-wise to more complex fused operations—to guide decisions on memory management and tiling strategies. For example, when optimizing the ReLU Squared activation function, the agent classified it as a Tier 1 operation, generating all necessary files and validating them effectively. This new kernel achieved a 1.9x speedup for the forward pass and a 3.2x speedup for the backward pass, while also reducing memory usage by 37.5% compared to traditional PyTorch implementations.
Another agent, liger-autopatch, streamlines the process of adding Liger optimization support for new models within HuggingFace Transformers. Given the subtle differences in model architectures, this agent identifies 12 architectural decisions, analyzes model source code, and generates or modifies up to 13 files, including convergence tests across various configurations. Its successful implementation for models like Nemotron and Mistral highlights the potential for easier integration without extensive manual coding.
Optimizing Existing Kernels
The liger-kernel-perf agent is dedicated to refining already functional kernels through an autonomous optimization loop. This agent profiles kernels, identifies GPU architecture, and detects performance bottlenecks, generating optimized versions to enhance efficiency. For instance, it improved the fused_add_rms_norm backward kernel with targeted optimizations that resulted in a 3.35x speedup for a specific hidden dimension and a 59% enhancement in full-pass speed without increasing memory usage.
In addition to these external contributions, LinkedIn incorporates agent-generated kernels into its internal training infrastructure via a custom compiler-based selection library. This library enhances the functionality of torch.compile, automatically identifying operations for fusion and selecting the most appropriate kernel from a registry. This has led to significant reductions in training times, exemplified by a 10x decrease in encoder step time for a recommendation model.
The introduction of agentic workflows marks a notable shift in AI development. Here, artificial intelligence not only analyzes data but also plays an active role in building and improving the underlying infrastructure. As the Liger Kernel project advances, it showcases the potential for AI to simplify complex engineering tasks, ultimately speeding up innovation in the field.
The stories that move AI & crypto markets — before the market reacts.
Free. 7am ET. Five stories. 62,400 readers.
