DreamPRM

LM

Abstract

DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training
Training multimodal process reward models (PRMs) is challenged by distribution shifts and noisy data. We introduce DreamPRM-1.5, an instance-reweighted framework that adaptively adjusts the importance of each training example via bi-level optimization. We design two complementary strategies: Instance Table, effective for smaller datasets, and Instance Net, scalable to larger ones. Integrated into test-time scaling, DreamPRM-1.5 achieves 84.6 accuracy on the MMMU benchmark, surpassing GPT-5.

Overview

DreamPRM-1.5-InstanceTable and DreamPRM-1.5-InstanceNet
DreamPRM-1.5-InstanceTable and DreamPRM-1.5-InstanceNet

By integrating DreamPRM-1.5 into test-time scaling, we achieve a new state-of-the-art accuracy of 84.6 on the validation set of MMMU benchmark, further advancing the performance frontier of the strongest existing model, GPT-5-mini. Moreover, we conduct a thorough sanity check on instance reweighting, which highlights DreamPRM-1.5’s potential to approach oracle-level performance under test-time scaling.

CategoryModel / MethodAccuracy
Leaderboard (external, top-performing models)GPT-5 w/ thinking84.2
Gemini 2.5 Pro Deep-Think84.0
o382.9
Test-time Scaling (built on GPT-5-mini w/ thinking)Base: GPT-5-mini w/ thinking80.0
VanillaPRM — No Selection79.1 (-0.9)
Self-consistency81.4 (+1.4)
VisualPRM80.5 (+0.5)
DreamPRM-1.5 — Instance Table84.6 (+4.6)
DreamPRM-1.5 — Instance Net83.6 (+3.6)

Model Checkpoints

ModelHugging Face Link
DreamPRM-1.5-InstanceTable🤗 Checkpoint link
DreamPRM-1.5-InstanceNet🤗 Checkpoint link

Quick Start

All commands below are illustrative—rename scripts / paths to match your repo.

1. Codes

Git clone our repository, creating a python environment and ativate it via the following command

git clone https://github.com/coder-qicao/DreamPRM-1.5.git
cd DreamPRM-1.5

2. Environment

# (a) create conda env
conda create -n DreamPRM-1.5 python=3.10 -y
conda activate DreamPRM-1.5

# (b) install requirements
pip install -r requirements.txt   # torch betty-ml, transformers, accelerate, ...

Verify the installation of torch and torchvision is successful by running python -c "import torchvision; print(torchvision.__version__)". If it outputs the version number without any warnings or errors, then you are good to go. If it outputs any warnings or errors, try to uninstall torch by conda uninstall pytorch torchvision torchaudio cudatoolkit and then reinstall them following here. You need to find the correct command according to the CUDA version your GPU driver supports (check nvidia-smi).

3. Instance-reweighting

The current version of DreamPRM-1.5 is built on InternVL3-1B. Please download InternVL3-1B weights from 🤗InternVL3-1B.

Instance-reweighting for DreamPRM-1.5 fine-tuning:

# cold start
bash run_coldstart.sh
# Instance-Table or Instance-Net
bash run_table.sh
bash run_net.sh

You need at least 80 GB GPU memory for the training.

Best-of-N selection using re-trained PRM:

Use the re-trained PRM to select the most promising CoT among the candidates. Try different aggregation functions—such as mean, log-mean, or other variants—to evaluate and aggregate step-level scores effectively.

python test.py

4. Configuration Parameters

Data file path and model path

ParameterTypeDefaultDescription
--train_json_filestr"./data/train.json"Training data file path
--meta_json_filestr"./data/meta.json"Meta data file path
--weights_pathstr"./weights"Model weights path
--reward_modelstr"OpenGVLab/InternVL3-1B"Reward model

Bi-level optimization configuration

ParameterTypeDefaultDescription
--iteration_numint100000Total iterations
--save_every_iterationsint1Save frequency
--unroll_stepsint1Unroll steps
--gradiant_accumulationint1Gradient accumulation steps
--gradiant_clippingfloat1.0Gradient clipping value
--devicestr"cuda"Device type
--precisionstr"bf16"Precision mode
--strategystr"default"Training strategy
--rollbackflagFalseEnable rollback
--baselineflagFalseUse baseline
--seedint1Random seed
--local_rankint0Local rank for distributed training

Lower-level optimization hyperparameters

ParameterTypeDefaultDescription
--lrfloat5e-5Learning rate
--momentumfloat0.9Optimizer momentum
--weight_decayfloat0.01Weight decay
--batch_sizeint1Batch size
--scheduler_step_sizeint1000Scheduler step size
--scheduler_gammafloat0.95Scheduler decay factor

Upper-level optimization hyperparameters

ParameterTypeDefaultDescription
--meta_lrfloat1e-1Meta learning rate
--meta_momentumfloat0.9Meta optimizer momentum
--meta_weight_decayfloat1e-3Meta weight decay
--meta_batch_sizeint1Meta batch size
--meta_scheduler_step_sizeint1000Meta scheduler step size
--meta_scheduler_gammafloat0.95Meta scheduler decay factor

Other important parameters

ParameterTypeDefaultDescription
--retrainboolFalseRetrain flag
--activation_functionstr"LeakyReLU"Activation function (LeakyReLU | ReLU | No | Clip)
--aggregation_functionstr"mean"Aggregation function (mean | max | min | log_mean)
--loss_targetstr"both"Loss target (+ | both)
--initializationfloat1.0Initialization value
--max_patch_numint6Maximum patch number
--scheduler_typestr"cosine_schedule_with_warmup"scheduler type

Acknowledgement

License

This repository is under Apache License 2.0.

Citation

If you use this work in your research, please cite:

@misc{cao2025dreamprmdomainreweightedprocessreward,
      title={DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning}, 
      author={Qi Cao and Ruiyi Wang and Ruiyi Zhang and Sai Ashish Somayajula and Pengtao Xie},
      year={2025},
      eprint={2505.20241},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.20241}, 
}