ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

Abstract
In information retrieval, large language models (LLMs) have shown significant promise in text reranking tasks by leveraging their advanced reasoning capabilities. However, traditional supervised fine-tuning approaches can compromise these models’ general-purpose abilities, particularly their reasoning skills. This paper presents a novel methodology that combines Chain-of-Thought prompting with a training pipeline of Supervised Fine-Tuning followed by Direct Preference Optimization (SFT-DPO). This approach aims to enhance ranking performance while preserving the inherent reasoning strengths of LLMs. Experimental evaluations on the TREC Deep Learning datasets demonstrate that our method surpasses existing models like RankZephyr. Furthermore, it maintains robust performance on the Massive Multitask Language Understanding (MMLU) benchmark, indicating effective retention of general-purpose capabilities through strategic fine-tuning.
Type
Publication
ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers