踩坑

David LiuAugust 17, 2025About 1 min

踩坑

eval的时候缺少vision tower？

# 创建目录
sudo mkdir -p /data0/jacklishufan

# 下载vision tower到期望位置
export HF_ENDPOINT="https://hf-mirror.com"
python -c "
import os
from huggingface_hub import snapshot_download

print('下载 SigLIP vision tower 到 /data0/jacklishufan/siglip-so400m-patch14-384...')
snapshot_download(
    repo_id='google/siglip-so400m-patch14-384',
    local_dir='/data0/jacklishufan/siglip-so400m-patch14-384',
    resume_download=True
)
print('下载完成!')
"

爆显存

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 0 has a total capacity of 23.55 GiB of which 1.34 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.56 GiB is allocated by PyTorch, and 2.19 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

设置显存碎片整理（最简单）：

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

减小batch_size：你现在用的是 --batch_size 1，可以试试：

bash

--batch_size 1 --max_new_tokens 32  # 限制生成长度

启用混合精度：在model_args中添加：

bash

--model_args "pretrained=...,conv_template=llada,model_name=llava_llada,load_in_8bit=True"

清理显存：重启程序前运行：

bash

nvidia-smi --gpu-reset
# 或者
sudo fuser -v /dev/nvidia*

推荐的完整命令：

bash

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
accelerate launch -m lmms_eval \
  --model llava_llada \
  --model_args "pretrained=lavida-ckpts/lavida-llada-hd,conv_template=llada,model_name=llava_llada,load_in_8bit=True" \
  --tasks mmbench_en_dev_lite \
  --batch_size 1 \
  --gen_kwargs "max_new_tokens=32" \
  --output_path ./logs/

这样应该能解决显存问题。你可以先试试环境变量设置，如果还不行就加上8bit量化。