Fine-tuning models via repositories such as the Hugging Face Model Hub has become increasingly popular thanks to increasingly capable open models. This work has shown how fine-tuning can impact toxicity rates in hard-to-predict ways, across models from different AI labs. Model creators’ efforts to reduce toxicity during the instruction-tuning process can easily and inadvertently be undone when models are further fine-tuned on non-adversarial datasets. This phenomenon can be seen in practice in popular models fine-tuned by community contributors, where models fine-tuned for issues like multilingual capabilities can see surprisingly variable toxicity rates. These results emphasize the need for model creators, community contributors, model users, and policy-makers to pay attention to the toxicity performance of fine-tuned models, even when fine-tuning does not target toxicity.
Acknowledgments and Disclosure of FundingThe authors would like to thank the following individuals for helpful discussions and feedback throughout the course of this project: Kevin McKee, Inga Campos, Seliem El-Sayed, Laura Weidinger, Ramona Comanescu, and Charvi Rastogi.
\ Brent Mittelstadt and Chris Russell’s contributions to this work have been supported through research funding provided by the Wellcome Trust (grant nr 223765/Z/21/Z), Sloan Foundation (grant nr G2021-16779), Department of Health and Social Care, EPSRC (grant nr EP/Y019393/1), and Luminate Group. Their funding supports the Trustworthiness Auditing for AI project and Governance of Emerging Technologies research programme at the Oxford Internet Institute, University of Oxford. During the course of this work Will Hawkins held an employed position at Google DeepMind.
ReferencesAnthropic. (2023). Claude 2. https://www.anthropic.com/news/claude-2
\ Biderman, D., Portes, J., Ortiz, J. J. G., Paul, M., Greengard, P., Jennings, C., King, D., Havens, S., Chiley, V., Frankle, J., Blakeney, C., & Cunningham, J. P. (2024). LoRA Learns Less and Forgets Less (arXiv:2405.09673). arXiv. http://arxiv.org/abs/2405.09673
\ Bilenko, M. (2024, April 23). Introducing Phi-3: Redefining what’s possible with SLMs. Microsoft Azure Blog. https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/
\ Cecchini, D., Nazir, A., Chakravarthy, K., & Kocaman, V. (2024). Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications. In A. Ovalle, K.-W. Chang, Y. T. Cao, N. Mehrabi, J. Zhao, A. Galstyan, J. Dhamala, A. Kumar, & R. Gupta (Eds.), Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024) (pp. 109–117). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.trustnlp-1.11
\ Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., & Xin, R. (2023, December 4). Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. Databricks. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
\ Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language (arXiv:1703.04009). arXiv. http://arxiv.org/abs/1703.04009
\ Dawson, N. V., & Weiss, R. (2012). Dichotomizing Continuous Variables in Statistical Analysis: A Practice to Avoid. Medical Decision Making, 32(2), 225–226. https://doi.org/10.1177/0272989X12437605
\ Fu, Z., Yang, H., So, A. M.-C., Lam, W., Bing, L., & Collier, N. (2022). On the Effectiveness of ParameterEfficient Fine-Tuning (arXiv:2211.15583). arXiv. https://doi.org/10.48550/arXiv.2211.15583
\ Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models (arXiv:2009.11462). arXiv. http://arxiv.org/abs/2009.11462
\ Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-BA117A
\ Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., . . . Vinyals, O. (2024). Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805). arXiv. https://doi.org/10.48550/arXiv.2312.11805
\ Gemma Team, Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., . . . Kenealy, K. (2024). Gemma: Open Models Based on Gemini Research and Technology (arXiv:2403.08295). arXiv. http://arxiv.org/abs/2403.08295
\ He, L., Xia, M., & Henderson, P. (2024). What’s in Your ‘Safe’ Data?: Identifying Benign Data that Breaks Safety (arXiv:2404.01099). arXiv. http://arxiv.org/abs/2404.01099
\ Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. https://doi.org/10.48550/arXiv.2106.09685
\ HuggingFace. (2024, May 18). The Model Hub. https://huggingface.co/docs/hub/en/models-the-hub
\ Irwin, J. R., & McClelland, G. H. (2003). Negative Consequences of Dichotomizing Continuous Predictor Variables. Journal of Marketing Research, 40(3), 366–371. https://doi.org/10.1509/jmkr.40.3.366.19237
\ Kumar, D., Kumar, A., Agarwal, S., & Harshangi, P. (2024). Increased LLM Vulnerabilities from Fine-tuning and Quantization (arXiv:2404.04392). arXiv. http://arxiv.org/abs/2404.04392
\ Lermen, S., Rogers-Smith, C., & Ladish, J. (2023). LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B (arXiv:2310.20624). arXiv. https://doi.org/10.48550/arXiv.2310.20624
\ Liu, H., Liu, Z., Tang, R., Yuan, J., Zhong, S., Chuang, Y.-N., Li, L., Chen, R., & Hu, X. (2024). LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario (arXiv:2403.00108). arXiv. http://arxiv.org/abs/2403.00108
\ Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., & Zhang, Y. (2024). An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning (arXiv:2308.08747). arXiv. http://arxiv.org/abs/2308.08747
\ Meta. (2024a). Introducing Meta Llama 3: The most capable openly available LLM to date. Meta AI. https://ai.meta.com/blog/meta-llama-3/
\ Meta. (2024b). Our responsible approach to Meta AI and Meta Llama 3. Meta AI. https://ai.meta.com/blog/metallama-3-meta-ai-responsibility/
\ Nadeau, D., Kroutikov, M., McNeil, K., & Baribeau, S. (2024). Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations (arXiv:2404.09785). arXiv. http://arxiv.org/abs/2404.09785
\ OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., . . . Zoph, B. (2024). GPT-4 Technical Report (arXiv:2303.08774). arXiv. https://doi.org/10.48550/arXiv.2303.08774
\ Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback (arXiv:2203.02155). arXiv. https://doi.org/10.48550/arXiv.2203.02155
\ Phi-3 Safety Post-Training: Aligning Language Models with a “Break-Fix” Cycle. (2024). Retrieved 27 September 2024, from https://arxiv.org/html/2407.13833v1
\ Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., & Henderson, P. (2023). Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (arXiv:2310.03693). arXiv. http://arxiv.org/abs/2310.03693
\ Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Statistics in Medicine, 25(1), 127–141. https://doi.org/10.1002/sim.2331
\ Sun, A. Y., Zemour, E., Saxena, A., Vaidyanathan, U., Lin, E., Lau, C., & Mugunthan, V. (2024). Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information? (arXiv:2307.16382). arXiv. http://arxiv.org/abs/2307.16382
\ Taraghi, M., Dorcelus, G., Foundjem, A., Tambon, F., & Khomh, F. (2024). Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends (arXiv:2401.13177). arXiv. http://arxiv.org/abs/2401.13177
\ Tian, K., Mitchell, E., Yao, H., Manning, C. D., & Finn, C. (2023). Fine-tuning Language Models for Factuality (arXiv:2311.08401). arXiv. http://arxiv.org/abs/2311.08401
\ Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., . . . Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). arXiv. http://arxiv.org/abs/2307.09288
\ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need (arXiv:1706.03762). arXiv. http://arxiv.org/abs/1706.03762
\ Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2020, December 31). Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. arXiv.Org. https://arxiv.org/abs/2012.15761v2
\ Wan, A., Wallace, E., Shen, S., & Klein, D. (2023). Poisoning Language Models During Instruction Tuning. Proceedings of the 40th International Conference on Machine Learning, 35413–35425. https://proceedings.mlr.press/v202/wan23b.html
\ Wang, S., Wang, P., Zhou, T., Dong, Y., Tan, Z., & Li, J. (2024). CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models (arXiv:2407.02408). arXiv. https://doi.org/10.48550/arXiv.2407.02408
\ Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K. R., Wadden, D., MacMillan, K., Smith, N. A., Beltagy, I., & Hajishirzi, H. (2023). How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (arXiv:2306.04751). arXiv. https://doi.org/10.48550/arXiv.2306.04751
\ Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., . . . Gabriel, I. (2021). Ethical and social risks of harm from Language Models (arXiv:2112.04359). arXiv. http://arxiv.org/abs/2112.04359
\ Yang, X., Wang, X., Zhang, Q., Petzold, L., Wang, W. Y., Zhao, X., & Lin, D. (2023). Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (arXiv:2310.02949). arXiv. https://doi.org/10.48550/arXiv.2310.02949
\ Zeng, Y., & Lee, K. (2024). The Expressive Power of Low-Rank Adaptation (arXiv:2310.17513). arXiv. http://arxiv.org/abs/2310.17513
\ Zhan, Q., Fang, R., Bindu, R., Gupta, A., Hashimoto, T., & Kang, D. (2024). Removing RLHF Protections in GPT-4 via Fine-Tuning (arXiv:2311.05553). arXiv. http://arxiv.org/abs/2311.05553
\ Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2024). Instruction Tuning for Large Language Models: A Survey (arXiv:2308.10792). arXiv. http://arxiv.org/abs/2308.10792
\ Zhao, J., Deng, Z., Madras, D., Zou, J., & Ren, M. (2024). Learning and Forgetting Unsafe Examples in Large Language Models (arXiv:2312.12736). arXiv. http://arxiv.org/abs/2312.12736
\
:::info Authors:
(1) Will Hawkins, Oxford Internet Institute University of Oxford;
(2) Brent Mittelstadt, Oxford Internet Institute University of Oxford;
(3) Chris Russell, Oxford Internet Institute University of Oxford.
:::
:::info This paper is available on arxiv under CC 4.0 license.
:::
\
All Rights Reserved. Copyright , Central Coast Communications, Inc.