Time-Search
/

TimeSearch-R

Video-Text-to-Text

Model card Files Files and versions

TimeSearch-R / README.md

nielsr's picture

nielsr HF Staff

Add library_name and usage example

fb41ece verified 6 months ago

|

2.87 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	language:
	- en
	license: apache-2.0
	pipeline_tag: video-text-to-text
	tags:
	- multimodal
	library_name: transformers
	---

	# TimeSearch-R-7B
	- Code: https://github.com/Time-Search/TimeSearch-R
	- Paper: [TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning](https://arxiv.org/abs/2511.05489)

	## Usage

	We provide the simple generation process for using our model. For more details, you could refer to [Github](https://github.com/Time-Search/TimeSearch-R).

	```python
	import numpy as np
	import torch
	from longvu.builder import load_pretrained_model
	from longvu.constants import (
	DEFAULT_IMAGE_TOKEN,
	IMAGE_TOKEN_INDEX,
	)
	from longvu.conversation import conv_templates, SeparatorStyle
	from longvu.mm_datautils import (
	KeywordsStoppingCriteria,
	process_images,
	tokenizer_image_token,
	)
	from decord import cpu, VideoReader

	tokenizer, model, image_processor, context_len = load_pretrained_model(
	"./checkpoints/longvu_qwen", None, "cambrian_qwen",
	)

	model.eval()
	video_path = "./examples/video1.mp4"
	qs = "Describe this video in detail"

	vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
	fps = float(vr.get_avg_fps())
	frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
	video = []
	for frame_index in frame_indices:
	img = vr[frame_index].asnumpy()
	video.append(img)
	video = np.stack(video)
	image_sizes = [video[0].shape[:2]]
	video = process_images(video, image_processor, model.config)
	video = [item.unsqueeze(0) for item in video]

	qs = DEFAULT_IMAGE_TOKEN + "
	" + qs
	conv = conv_templates["qwen"].copy()
	conv.append_message(conv.roles[0], qs)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=video,
	image_sizes=image_sizes,
	do_sample=False,
	temperature=0.2,
	max_new_tokens=128,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	)
	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	```

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@article{timesearch-r,
	title={TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning},
	author={Pan, Junwen and Zhang, Qizhe and Zhang, Rui and Lu, Ming and Wan, Xin and Zhang, Yuan and Liu, Chang and She, Qi},
	journal={arXiv preprint arXiv:2511.05489},
	year={2025}
	}
	```