Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices

Page view(s)
0
Checked on
Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices
Title:
Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices
Journal Title:
Software: Practice and Experience
Publication Date:
02 April 2025
Citation:
Zeng, Z., Zhang, T., Lu, Z., Li, W., Zhuang, H., Shao, H., Teo, S. G., & Zou, X. (2025). Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices. Software: Practice and Experience, 55(8), 1287–1304. Portico. https://doi.org/10.1002/spe.3422
Abstract:
ABSTRACTBackgroundLarge Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their substantial computational and memory requirements present significant challenges for widespread deployment on edge devices.MotivationIn long‐context scenarios, even sub‐billion parameter LLMs face unavoidable memory and performance bottlenecks due to inefficient KV Cache utilization. Existing quantization methods fail to address these challenges effectively.MethodThis paper addresses these challenges by introducing advanced quantization techniques tailored for sub‐billion parameter LLMs. It specifically targets reducing memory consumption through the conversion of the model's KV Cache to lower‐bit integers. We present SubKV, a quantization method specifically designed to optimize the KV Cache in sub‐billion parameter LLMs. Our analysis reveals distinct distributional differences in the magnitude of key and value caches. Leveraging this insight, we apply Per‐Channel Quantization to the key cache and Per‐Token Quantization to the value cache. Furthermore, we introduce the Dynamic Window Quantization method to enhance attention computations. To mitigate the extreme sensitivity of the first token, we also introduce Attention Sink‐Aware Quantization.ResultsExperimental results demonstrate that SubKV significantly reduces the KV Cache size during long context inference while maintaining model performance, offering superior results to existing KV Cache quantization methods.
License type:
Publisher Copyright
Funding Info:
There was no specific funding for the research done
Description:
This is the peer reviewed version of the following article: Zeng, Z., Zhang, T., Lu, Z., Li, W., Zhuang, H., Shao, H., Teo, S. G., & Zou, X. (2025). Subkv: Quantizing Long Context KV Cache for Sub‐Billion Parameter Language Models on Edge Devices. Software: Practice and Experience, 55(8), 1287–1304. Portico. https://doi.org/10.1002/spe.3422 , which has been published in final form at https://doi.org/10.1002/spe.3422. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. This article may not be enhanced, enriched or otherwise transformed into a derivative work, without express permission from Wiley or by statutory rights under applicable legislation. Copyright notices must not be removed, obscured or modified. The article must be linked to Wiley’s version of record on Wiley Online Library and any embedding, framing or otherwise making available the article or pages thereof by third parties from platforms, services and websites other than Wiley Online Library must be prohibited.
ISSN:
0038-0644
1097-024X
Files uploaded:

File Size Format Action
software-practice-and-experience-1.pdf 2.58 MB PDF Open