TLDR: ElectriQ is a new benchmark and dataset designed to evaluate and improve large language models (LLMs) for electric power marketing customer service. It addresses limitations of current systems and general LLMs by providing domain-specific knowledge and evaluation metrics like professionalism, popularity, readability, and user-friendliness. The research demonstrates that even smaller LLMs can achieve high performance in this specialized field after fine-tuning and knowledge augmentation, offering a comprehensive foundation for developing tailored LLMs for the power sector.
Electric power marketing customer service is a vital function, handling everything from inquiries and complaints to service requests. However, traditional systems, such as China’s 95598 hotline, often face challenges like slow response times, rigid processes, and a lack of accuracy in specialized domains. While advanced Large Language Models (LLMs) like GPT-4o and Claude 3 show immense potential in general language tasks, they typically lack the specific domain knowledge and empathetic understanding crucial for effective power marketing customer interactions.
To bridge this gap, researchers have introduced ElectriQ, the first benchmark specifically designed for evaluating and enhancing LLMs in the electric power marketing sector. This innovative benchmark includes a comprehensive dialogue dataset covering six key service categories. It also defines four crucial metrics for assessing response quality: professionalism, popularity (how easy it is to understand), readability, and user-friendliness (empathy and personalized care).
A significant part of the ElectriQ framework involves a domain-specific knowledge base and a knowledge augmentation method. This approach aims to infuse LLMs with the necessary specialized information to improve their performance. The research conducted experiments on 13 different LLMs, revealing a fascinating insight: even smaller models, such as LLama3-8b, can surpass the performance of larger, more general models like GPT-4o after being fine-tuned and augmented with this domain-specific knowledge. This improvement was particularly noticeable in areas like professionalism and user-friendliness.
The development of ElectriQ provides a robust foundation for creating LLMs that are specifically tailored to meet the unique demands of power marketing customer service. The dataset for ElectriQ was meticulously constructed using a combination of real-world customer service voice records, which were transcribed and refined, and augmented data generated by GPT-4o. This augmentation process helped enrich the dataset with diverse scenarios, ensuring the models gain in-depth knowledge and optimize their interactive performance. Human preference-guided dialogue samples were also included to align model responses more closely with user needs.
The evaluation metrics are central to ElectriQ. Professionalism assesses the accuracy of technical terminology and depth of knowledge, ensuring responses adhere to industry rules and include necessary technical parameters. Popularity focuses on converting technical jargon into easily understandable language for everyday users. Readability evaluates the logical structure, grammatical accuracy, and conciseness of the response. User-friendliness measures the emotional care, reassurance, and personalized suggestions provided by the model, making interactions more human-like.
The experimental results clearly demonstrated a size-performance relationship, where larger models generally performed better initially. However, the combination of supervised fine-tuning (SFT) and knowledge enhancement proved highly effective, especially for models under 10B parameters. For instance, LLaMA3-8B and Mistral-7B, after this targeted training, achieved scores that rivaled or even surpassed GPT-4o in some tasks. This highlights that well-tuned mid-sized models, when equipped with domain-specific knowledge, can be highly competitive.
The study also conducted ablation experiments, confirming that both supervised fine-tuning and knowledge enhancement contribute significantly to performance improvements. Knowledge enhancement was particularly effective for models above 7B, as their robust computing power allowed for better absorption of the enhanced knowledge. Furthermore, comparative experiments showed that the augmented data performed very similarly to real data, effectively mitigating data scarcity issues in the electric power marketing domain.
The methodology and dataset proposed in this study also demonstrated good generalizability across other power-related domains, such as substation fault diagnosis, photovoltaic power generation, and hydropower scenarios. This indicates the potential for broader application of this approach within the energy sector.
Also Read:
- BALSAM: A New Benchmark to Advance Arabic Large Language Models
- Evaluating LLMs: Why Different Voices Matter in Benchmarking
In conclusion, ElectriQ represents a significant step forward in developing intelligent customer service solutions for the electric power industry. By providing a specialized benchmark and a method for knowledge enhancement, it paves the way for LLMs to deliver more efficient, accurate, and empathetic services. For more detailed information, you can refer to the full research paper here.


