Enhancing Customer Support with Fine-tuned Falcon LLM

In a bid to revolutionize customer support efficiency, the blog post details the journey of optimizing the Falcon LLM model using QLoRA techniques. The objective was to develop a context-aware solution leveraging a single GPU. The process involved selecting Falcon LLM for its adaptability, refining a dataset of over 100,000 query-response pairs, and employing QLoRA for model optimization. Challenges like GPU constraints and context relevance were addressed, leading to significant improvements in response time, accuracy, and user satisfaction.
Azumo Research
March 21, 2024
illustration for outsourcing


In our pursuit to remain at the forefront of AI-driven solutions, we embarked on an ambitious internal project. Recognizing the potential of AI in enhancing customer support, we sought to develop a solution that delivered both efficiency and precision.


Adapt the Falcon LLM to deliver context-aware and efficient customer support responses, leveraging the power of QLoRA and the efficiency of a single GPU.

  • Choosing the Base - Falcon LLM: We decided upon the Falcon LLM due to its adaptability and capacity to handle diverse queries.
  • Tailoring the Dataset: We extracted and refined over 100,000 query-response pairs from the past interactions, ensuring the data's quality and relevance.
  • QLoRA - The Game Changer: To optimize for the single GPU constraint, we employed QLoRA, a technique designed to fine-tune large models without compromising on performance. QLoRA’s balance of low-rank approximation and quantization allowed us to maintain essential model features with fewer parameters.
  • Optimized Training: Using the Hugging Face Transformers library, we tailored training parameters after extensive testing, ensuring optimal performance. The 8-bit paged atom optimizer was pivotal in managing the model's large size.

Challenges Tackled

  • Navigating GPU Constraints: The memory limitations of a single GPU were addressed using QLoRA's quantization, enabling us to reduce model size without sacrificing its effectiveness.
  • Context Relevance: Continuous validation checks ensured the model's relevance to unique customer interactions.

Performance Highlight: Before/ After

Response Time:
  • Before: Average of 6 seconds
  • After: Reduced to 2.6 seconds
Accuracy Rate:
  • Before: 85% accuracy in automated responses
  • After: A leap to 96% accuracy

User Satisfaction:

  • Before: 75% satisfaction rate
  • After: 92% satisfaction rate

QLoRA Training of Falcon-7B with GSM8K Dataset

Training Report


This initiative not only showcased our capability to innovate but also underscored our dedication to enhancing every aspect of our operations. By integrating the Falcon LLM optimized with QLoRA, we reimagined what AI-driven customer support could achieve, all on a single GPU.

For reference, you can access the trained model from here: 


Vector Database