Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices

Page view(s)

Checked on

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/22330

Title:

Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices

Journal Title:

Software: Practice and Experience

DOI:

http://dx.10.1002/spe.3422

Publication URL:

https://doi.org/10.1002/spe.3422

Authors:

Ziqian Zeng, Tao Zhang, Zhengdong Lu, Wenjun Li, Huiping Zhuang, Hongen Shao, Sin G. Teo, Xiaofeng Zou

Keywords:

KV Cache Quantization, Per-Channel Key Quantization, Per-Token Value Quantization, Dynamic Window Quantization, Sink-Aware Quantization, Edge Devices

Publication Date:

02 April 2025

Citation:

Zeng, Z., Zhang, T., Lu, Z., Li, W., Zhuang, H., Shao, H., Teo, S. G., & Zou, X. (2025). Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices. Software: Practice and Experience, 55(8), 1287–1304. Portico. https://doi.org/10.1002/spe.3422

Abstract:

ABSTRACTBackgroundLarge Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their substantial computational and memory requirements present significant challenges for widespread deployment on edge devices.MotivationIn long‐context scenarios, even sub‐billion parameter LLMs face unavoidable memory and performance bottlenecks due to inefficient KV Cache utilization. Existing quantization methods fail to address these challenges effectively.MethodThis paper addresses these challenges by introducing advanced quantization techniques tailored for sub‐billion parameter LLMs. It specifically targets reducing memory consumption through the conversion of the model's KV Cache to lower‐bit integers. We present SubKV, a quantization method specifically designed to optimize the KV Cache in sub‐billion parameter LLMs. Our analysis reveals distinct distributional differences in the magnitude of key and value caches. Leveraging this insight, we apply Per‐Channel Quantization to the key cache and Per‐Token Quantization to the value cache. Furthermore, we introduce the Dynamic Window Quantization method to enhance attention computations. To mitigate the extreme sensitivity of the first token, we also introduce Attention Sink‐Aware Quantization.ResultsExperimental results demonstrate that SubKV significantly reduces the KV Cache size during long context inference while maintaining model performance, offering superior results to existing KV Cache quantization methods.

License type:

Publisher Copyright

Funding Info:

There was no specific funding for the research done

Description:

This is the peer reviewed version of the following article: Zeng, Z., Zhang, T., Lu, Z., Li, W., Zhuang, H., Shao, H., Teo, S. G., & Zou, X. (2025). Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices. Software: Practice and Experience, 55(8), 1287–1304. Portico. https://doi.org/10.1002/spe.3422 , which has been published in final form at https://doi.org/10.1002/spe.3422. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. This article may not be enhanced, enriched or otherwise transformed into a derivative work, without express permission from Wiley or by statutory rights under applicable legislation. Copyright notices must not be removed, obscured or modified. The article must be linked to Wiley’s version of record on Wiley Online Library and any embedding, framing or otherwise making available the article or pages thereof by third parties from platforms, services and websites other than Wiley Online Library must be prohibited.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/22330

ISSN:

0038-0644
1097-024X

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
software-practice-and-experience-1.pdf	2.58 MB	PDF	Open