In light of the remarkable success of large language models (LLMs) in natural language understanding and generation, a trend of applying LLM
In light of the remarkable success of large language models (LLMs) in natural language understanding and generation, a trend of applying LLMs to professional domains with specialized requirements stimulates interest across various fields. It is desirable to further understand the level of intelligence that can be achieved by LLMs in solving domain-specific problems, as well as the resources that need to be invested accordingly. This paper studies the problem of generating high-quality test questions with specified knowledge points and target cognitive levels in AI-assisted teaching and learning. Our study shows that LLMs, even those as immense as GPT-4 or Bard, can hardly fulfill the design objectives, lacking clear focus on cognitive levels pertaining to specific knowledge points. In this paper, we explore the opportunity of enhancing the capability of LLMs through system design, instead of training models with substantial domain-specific data, consuming mass computing and memory resources. We propose a novel design scheme that orchestrates a dual-LLM engine, consisting of a question generation model and a cognitive-level evaluation model, built with fine-tuned, lightweight baseline models and prompting technology to generate high-quality test questions. The experimental results show that the proposed design framework, TwinStar, outperforms the state-of-the-art LLMs for effective test question generation in terms of cognitive-level adherence and knowledge relevance. TwinStar implemented with ChatGLM2-6B improves the cognitive-level adherence by almost 50% compared to Bard and 21% compared to GPT-4.0. The overall improvement in the quality of test questions generated by TwinStar reaches 12.0% compared to Bard and 2% compared with GPT-4.0 while our TwinStar implementation consumes only negligible memory space compared with that of GPT-4.0. An implementation of TwinStar using LLaMA2-13B shows a similar trend of improvement.