This work explored how fine-tuning can impact the propensity of models to output toxic content in prominent open language models. It demonstrated that AI labs fine-tuning base models lead to reductions in toxicity, suggesting labs are seeking to reduce toxic content, in line with their commitments to safety. We show that, despite this, these mitigations can easily and, crucially, inadvertently, be undone. This can be achieved by conducting a simple parameter efficient fine-tuning on non-toxic data, using Google Colab and a T4 GPU, and does not require an adversarial dataset designed to induce toxicity. The downstream impact of this can be seen in the results from the community-tuned experiments, where fine-tuning which may intend to improve a specific capability such as a language, can lead to difficult to predict deviations in toxicity rates.

\ As a result, users of fine-tuned models, and developers undertaking fine-tuning themselves, should not assume that prior toxicity performance will be reflected following tuning, even if a dataset does not contain harmful content. Instead, this work demonstrates the importance of establishing a culture of evaluation both before and after fine-tuning for pertinent safety issues. None of the community-tuned models assessed in this work disclosed safety evaluation data within the Hugging Face documentation for their work, meaning a user would not know how a model might respond to toxic or otherwise adversarial content. This suggests community developers could consider improving safety evaluation and documentation practices for fine-tuned models. Where evaluation results are not made available, users of fine-tuned models should conduct their own safety evaluations before use.

5 Limitations and Future Work

This work focused on popular models for fine-tuning within the open-source community, all of which are relatively small compared to state-of-the-art models. It would be valuable to further compare the impact across different sized models to identify possible variations. Similarly, we focused on LoRA-based fine-tuning, because of the popularity and effectiveness of this technique. However, further work could explore more fine-grained configurations and the impact of different fine-tuning techniques.

\ With this phenomenon identified, and the impact of it demonstrated for the community, future work could focus on exploring the reasons for such safety changes in the model. This could be due to model forgetting, with the safety fine-tuning conducted by model creators being “forgotten” by the model with additional fine-tuning (Luo et al., 2024). If this were the case, future experiments might find that after fine-tuning on benign data models converge towards the underlying pre-training toxicity rate of the base model. Alternatively, the movements in toxicity could be motivated only by the model learning from the new data, being shifted by semantic patterns within the fine-tuning data. If this were the case, future experiments might find that continual fine-tuning leads to all models converging on a similar toxicity rate when fine-tuned on the same dataset. Additional experiments could further explore whether different types of fine-tuning, beyond LoRA do have different impacts on toxicity, and could further assess whether impacts vary across different sub-topics (e.g. race, religion, etc.), with larger datasets. Finally, an additional avenue that requires exploration is the impact of fine-tuning on broader responsibility issues, such as fairness and representation properties of models.

:::info Authors:

(1) Will Hawkins, Oxford Internet Institute University of Oxford;

(2) Brent Mittelstadt, Oxford Internet Institute University of Oxford;

(3) Chris Russell, Oxford Internet Institute University of Oxford.

:::

:::info This paper is available on arxiv under CC 4.0 license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: content google internet mobile small underlying