Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix crash in tiiuae/falcon-11B-vlm image-to-text generation #34728

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sywangyi
Copy link
Contributor

What does this PR do?

fix crash when using tiiuae/falcon-11B-vlm for image-to-text generation task

fix two crashes

  1. one is in inputs_embeds calculation, "image_token_index": 65024 while "vocab_size": 65024, so embedding operation out of bounds
  2. language_modeling forward in llava_next. the parameters contains num_logits_to_keep, while forward of FalconForCausalLM does not contain the params. crash in forward API calling

@qubvel
Copy link
Member

qubvel commented Nov 15, 2024

Hi @sywangyi, thanks for submitting a PR! Can you please provide your environment setup and minimum example, so we can reproduce the error on our side too? Adding a test also would be much appreciated! Thank you!

cc @zucchini-nlp for vlms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no suggestions.

Comments skipped due to low confidence (2)

src/transformers/models/llava_next/modeling_llava_next.py:840

  • Ensure that self.config.image_token_index is defined and used correctly to avoid runtime errors.
input_ids_mask[input_ids == self.config.image_token_index] = 0

src/transformers/models/falcon/modeling_falcon.py:1280

  • The new parameter 'num_logits_to_keep' should have test cases to ensure the slicing behavior of 'hidden_states' is correct.
num_logits_to_keep: int = 0,
@sywangyi
Copy link
Contributor Author

I use the latest transformers code
and here's my example

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm", tokenizer_class='PreTrainedTokenizerFast')
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)


url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
cats_image = Image.open(requests.get(url, stream=True).raw)

prompt = "User:<image>\nWhat is shown in this image?\nAssistant:"
inputs = processor(text=prompt, images=cats_image, return_tensors="pt").to('cuda:0')
model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=100, do_sample=False)


generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()

print(generated_captions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants