fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

sywangyi · 2024-11-14T07:32:48Z

What does this PR do?

fix crash when using tiiuae/falcon-11B-vlm for image-to-text generation task

fix two crashes

one is in inputs_embeds calculation, "image_token_index": 65024 while "vocab_size": 65024, so embedding operation out of bounds
language_modeling forward in llava_next. the parameters contains num_logits_to_keep, while forward of FalconForCausalLM does not contain the params. crash in forward API calling

vision models: @amyeroberts, @qubvel

Signed-off-by: Wang, Yi <[email protected]>

qubvel · 2024-11-15T23:10:37Z

Hi @sywangyi, thanks for submitting a PR! Can you please provide your environment setup and minimum example, so we can reproduce the error on our side too? Adding a test also would be much appreciated! Thank you!

cc @zucchini-nlp for vlms

Copilot reviewed 2 out of 2 changed files in this pull request and generated no suggestions.

Comments skipped due to low confidence (2)

src/transformers/models/llava_next/modeling_llava_next.py:840

Ensure that self.config.image_token_index is defined and used correctly to avoid runtime errors.

input_ids_mask[input_ids == self.config.image_token_index] = 0

src/transformers/models/falcon/modeling_falcon.py:1280

The new parameter 'num_logits_to_keep' should have test cases to ensure the slicing behavior of 'hidden_states' is correct.

num_logits_to_keep: int = 0,

sywangyi · 2024-11-16T03:31:07Z

I use the latest transformers code
and here's my example

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm", tokenizer_class='PreTrainedTokenizerFast')
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)


url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
cats_image = Image.open(requests.get(url, stream=True).raw)

prompt = "User:<image>\nWhat is shown in this image?\nAssistant:"
inputs = processor(text=prompt, images=cats_image, return_tensors="pt").to('cuda:0')
model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=100, do_sample=False)


generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()

print(generated_captions)

fix crash in tiiuae/falcon-11B-vlm image-to-text generation

974e254

Signed-off-by: Wang, Yi <[email protected]>

qubvel added Vision Multimodal labels Nov 15, 2024

qubvel requested a review from Copilot November 15, 2024 23:15

Copilot AI reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

sywangyi commented Nov 14, 2024

qubvel commented Nov 15, 2024 •

edited

Loading

sywangyi commented Nov 16, 2024

fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

Are you sure you want to change the base?

fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

Conversation

sywangyi commented Nov 14, 2024

What does this PR do?

qubvel commented Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

sywangyi commented Nov 16, 2024

qubvel commented Nov 15, 2024 •

edited

Loading