ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

Mar 17, 2025·

Haowei Liu

Xuyang Wu

Guohao Sun

Zhiqiang Tao

Yi Fang

· 0 min read

PDF Cite

Abstract

In information retrieval, large language models (LLMs) have shown significant promise in text reranking tasks by leveraging their advanced reasoning capabilities. However, traditional supervised fine-tuning approaches can compromise these models’ general-purpose abilities, particularly their reasoning skills. This paper presents a novel methodology that combines Chain-of-Thought prompting with a training pipeline of Supervised Fine-Tuning followed by Direct Preference Optimization (SFT-DPO). This approach aims to enhance ranking performance while preserving the inherent reasoning strengths of LLMs. Experimental evaluations on the TREC Deep Learning datasets demonstrate that our method surpasses existing models like RankZephyr. Furthermore, it maintains robust performance on the Massive Multitask Language Understanding (MMLU) benchmark, indicating effective retention of general-purpose capabilities through strategic fine-tuning.

Type

Preprint

Publication

ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

Last updated on Mar 17, 2025

Large Language Models Information Retrieval Text Reranking

CLERF: Contrastive LEaRning for Full Range Head Pose Estimation Dec 3, 2024 →