Your resource for web content, online publishing
and the distribution of digital products.
«  
  »
S M T W T F S
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 
 

Can LLMs Run on Your Laptop? A Study on Quantized Code Models

DATE POSTED:June 2, 2025

:::info Author:

(1) Enkhbold Nyamsuren, School of Computer Science and IT University College Cork Cork, Ireland, T12 XF62 ([email protected]).

:::

Table of Links
  1. Abstract and Introduction

  2. Related Works

    2.1 Code LLMs

    2.2 Quantization

    2.3 Evaluation benchmarks for code LLMs and 2.4 Evaluation metrics

    2.5 Low- and high-resource languages

  3. Methodology

    3.1 Run-time environment

    3.2 Choice of LLMs

    3.3 Choice of benchmarks

    3.4 Evaluation procedure

    3.5 Model parameters and 3.6 Source code and data

  4. Evaluation

    4.1 Pass@1 rates

    4.2 Errors

    4.3 Inference time

    4.4 Lines of code and 4.5 Comparison with FP16 models

  5. Discussion

  6. Conclusions and References

ABSTRACT

Democratization of AI, which makes AI accessible and usable for everyone, is an important topic, within the broader topic of the digital divide. This issue is especially relevant to Large Language Models (LLM), which are becoming increasingly popular as AI co-pilots but suffer from a lack of accessibility due to high computational demand. In this study, we evaluate whether quantization is a viable approach toward enabling LLMs on generic consumer devices. The study assesses the performance of five quantized code LLMs in Lua code generation tasks. All code LLMs had approximately 7 billion parameters and were deployed on a generic CPU-only consumer laptop. To evaluate the impact of quantization, the models were tested at 2-, 4-, and 8-bit integer precisions and compared to non-quantized code LLMs with 1.3, 2, and 3 billion parameters. Along with tasks such as question answering, text summarization, and text generation, programming tasks are one of the popular applications of AI co-pilots. Furthermore, code generation is a high-precision task, which makes it a suitable benchmark to evaluate and compare quantized models for everyday use by individuals. Lua is chosen as a low-level resource language to avoid models’ biases related to high-resource languages. The results suggest that the models quantized at the 4-bit integer precision offer the best trade-off between performance and model size. These models can be comfortably deployed on an average laptop without a dedicated GPU. The performance significantly drops at the 2-bit integer precision. The models at 8-bit integer precision require more inference time that does not effectively translate to better performance. The 4-bit models with 7 billion parameters also considerably outperform non-quantized models with lower parameter numbers despite having comparable model sizes with respect to storage and memory demand. While quantization indeed increases the accessibility of smaller LLMs with 7 billion parameters, these LLMs demonstrate overall low performance (less than 50%) on high-precision and low-resource tasks such as Lua code generation. While accessibility is improved, usability is still not at the practical level comparable to foundational LLMs such as GPT-4o or Llama 3.1 405B.

1 Introduction

Since the transformers were first proposed in 2017 in the seminal study by Vaswani et al. [1], there have been significant advancements in the development of Artificial Intelligence. Soon after, in less than a year, OpenAI published the first GPT (Generative Pre-trained Transformer) model that demonstrated an improvement of over 5% over existing best solutions [2]. Along with BERT (Bidirectional Encoder Representations from Transformers) [3], it became one of the first Large Language Models (LLM) as we know them today. Since then transformer-based large language models experienced rapid development.

\ When ChatGPT, powered by OpenAI’s GPT-3.5, became accessible to the public in 2022, it convincingly demonstrated its capability to assist humans in various tasks. Today, many attempts are trying to leverage the power of large (language) models in a variety of applications [4; 5]. Specific examples include using LLMs as AI-tutors in education [6], clinical decision support systems [7], and coding co-pilots [8].

\ Because LLMs are finding increasing adoption in everyday life, it is not far-fetched to assume that access to AI co-pilots can soon dictate how productive and, therefore, successful a person or an organization is. We have already experienced how increasing dependency on technology can lead to the digital divide [9], people not being able to enjoy the same opportunities because of lack of access to ICT. For example, the earliest form of the digital divide was and still is concerning the Internet in the early 2000s. Lythreatis, Singh, and El-Kassar [9] mention three levels of the digital divide regarding the Internet. The first level concerns the gap in access to the Internet. The second level concerns the digital inequality in skill and knowledge necessary for using the Internet. The final level is about the overall beneficial or adverse outcomes of using the Internet. The digital divide at all three levels still persists. Despite the increasing dependency on the Internet for essential everyday activities, people still have unequal access to the Internet [10]. Moreover, young people are often perceived as ‘digital natives’, yet they are also victims of the digital divide lacking both access and skills [11].

\ The three levels of the digital divide initially attributed to the Internet are also highly relevant to AI. If AI adoption follows the same trend then inequality in access to AI can lead to a wider digital divide [12; 13]. Unequal access to AI, lack of skills and knowledge to effectively and efficiently use AIs, and misuse of AI can all contribute to the widening digital divide. This highlights the importance of the democratization of AI, that is how accessible AI is to ordinary individuals. Democratization is especially relevant to language models due to the issue of high computational demand that cannot be easily resolved by personal users.

\ In this study, we explore AI with respect to the first level of the digital divide. That is how accessible Large Language Models for code generation are to regular users. For this purpose, we evaluate the performance of smaller LLMs with less than 10 billion parameters in code generation tasks. We have specifically chosen the Lua programming language as a testbed. Lua is an example of a low-resource language [14], a language with a relatively small size of data for training LLMs. For this reason, we can avoid the bias toward high-resource languages and obtain a more representative performance evaluation of LLMs in the code generation task across different languages.

\ One of the main focuses of this study is to evaluate LLMs on a typical consumer device, e.g. a laptop. The mainstream online solutions, such as GPT-4o and Claude 3.5, are often pay-walled and/or have usage restrictions. Furthermore, these are black box models that have certain privacy and security risks. On the other hand, there are free and open-source alternatives such as Llama 3.1 [15]. However, even these open-source models can be too demanding to be deployed on consumer hardware. To make LLMs more accessible, a post-training quantization [16] can be applied to them. A quantized model is a compressed model with lessened demand for computing resources at the cost of reduced performance, aka quality of generated output. Depending on the number of parameters, quantized models can be deployed and run reasonably well on consumer devices.

\ As discussed earlier, quantization can result in a degraded performance. However, there is a distinct lack of studies exploring the effects of quantization on the performance of LLMs for code generation, or code LLMs for short. In this study, we evaluate how well LLMs quantized at different precision levels perform on Lua-based code generation benchmarks. More specifically, we infer how the quantization level affects the correctness of generated solutions, inference time, and types of errors produced.

\ Finally, it may be possible that non-quantized models with a smaller number of parameters may perform better than quantized models with more parameters. To verify this assumption, we compare quantized models with 7 billion parameters with half-precision models with 3 billion and fewer parameters.

\ Overall, this study aims to answer the following research questions:

\ • RQ1: Which code LLMs with open source or permissible licenses can be feasibly run on consumer devices with the aid of quantization?

\ • RQ2: How does quantization precision affect code LLMs concerning the quality of generated code, inference time, and types of error in the generated code?

\ • RQ3: Which quantization precision provides a reasonable trade-off between performance degradation and decreased computational demand?

\ • RQ4: How do quantized code LLMs perform compared to non-quantized code LLMs of similar model size?

\

:::info This paper is available on arxiv under CC BY-SA 4.0 license.

:::

\