Text Generation (Vision)

Vision models behave similarly to text models, but they can accept and interpret images as well.

Model Spotlight: Qwen2.5-VL-7B-Instruct (Vision + Language)

Qwen2.5-VL-7B-Instruct is a powerful multimodal model designed to understand and process both visual and textual inputs. Whether you're building visual assistants, image-based chatbots, or AI tools that analyze pictures, this model is a seamless plug-and-play solution.


Getting Started with Vision-Language Inference

  1. Log in to your Sentry Block account.

  2. Make sure you have an active credit balance.

  3. Go to the "Serverless Endpoints" tab and select the Qwen2.5-VL-7B-Instruct model.

  4. Configure your parameters, upload your image (if applicable), enter your prompt, and hit Enter.


Additional Parameter (for Vision-Language models)

In addition to the standard generation parameters, vision-language models include:

  • Image – The image URL or file to be analyzed alongside the text prompt.


Using the API Directly

You can integrate the model directly into your own projects via the Sentry Block API. Below are some quick examples to get started.

Important: Don’t forget to include your API key. You can find authentication details in the [API Reference].


cURL Example


Python Example


JavaScript Example


Model Identifier

Use the following model ID in your API calls:

  • Qwen2.5-VL-7B-InstructQwen/Qwen2.5-VL-7B-Instruct


Response Example (Non-Streaming)


Response Example (Streaming Enabled)

Multiple chunks like this will be streamed in real time, forming the full generated response as data is received.

Last updated