踩坑
August 17, 2025About 1 min
踩坑
eval的时候缺少vision tower?
# 创建目录
sudo mkdir -p /data0/jacklishufan
# 下载vision tower到期望位置
export HF_ENDPOINT="https://hf-mirror.com"
python -c "
import os
from huggingface_hub import snapshot_download
print('下载 SigLIP vision tower 到 /data0/jacklishufan/siglip-so400m-patch14-384...')
snapshot_download(
repo_id='google/siglip-so400m-patch14-384',
local_dir='/data0/jacklishufan/siglip-so400m-patch14-384',
resume_download=True
)
print('下载完成!')
"
爆显存
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.69 GiB. GPU 0 has a total capacity of 23.55 GiB of which 1.34 GiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 19.56 GiB is allocated by PyTorch, and 2.19 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
- 设置显存碎片整理(最简单):
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- 减小batch_size: 你现在用的是
--batch_size 1
,可以试试:
bash
--batch_size 1 --max_new_tokens 32 # 限制生成长度
- 启用混合精度: 在model_args中添加:
bash
--model_args "pretrained=...,conv_template=llada,model_name=llava_llada,load_in_8bit=True"
- 清理显存: 重启程序前运行:
bash
nvidia-smi --gpu-reset
# 或者
sudo fuser -v /dev/nvidia*
推荐的完整命令:
bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
accelerate launch -m lmms_eval \
--model llava_llada \
--model_args "pretrained=lavida-ckpts/lavida-llada-hd,conv_template=llada,model_name=llava_llada,load_in_8bit=True" \
--tasks mmbench_en_dev_lite \
--batch_size 1 \
--gen_kwargs "max_new_tokens=32" \
--output_path ./logs/
这样应该能解决显存问题。你可以先试试环境变量设置,如果还不行就加上8bit量化。