AI

Omost: An AI Project that Transfors LLM Coding Capabilities into Image Composition

2 Mins read

Omost is an innovative project designed to enhance the image generation capabilities of large language models (LLMs) by converting their coding proficiency into advanced image composition skills. Pronounced, “almost,” the name Omost symbolizes two key ideas: first, after using Omost, the image will be “almost” perfect; second, “O” stands for “omni” (multi-modal), and “most” signifies extracting the utmost potential from the technology.

Omost equips LLMs with the ability to write code that composes visual content on a virtual Canvas agent. This Canvas can then be rendered using specific implementations of image generators to create actual images.

a ragged man wearing a tattered jacket in the nineteenth century:

Key Features and Models

Currently, Omost provides three pretrained LLM models based on variations of Llama3 and Phi3:

1. omost-llama-3-8b

2. omost-dolphin-2.9-llama3-8b

3. omost-phi-3-mini-128k

These models are trained using a diverse dataset that includes:

  • Ground-truth annotations from several datasets, including Open-Images.
  • Data extracted through automatic image annotation.
  • Reinforcement learning via Direct Preference Optimization (DPO), ensuring the code can be compiled by Python 3.10.
  • A small amount of tuning data from OpenAI GPT -4’s multi-modal capabilities.

To start using Omost, users can access the official HuggingFace space or deploy it locally. Local deployment requires an 8GB Nvidia VRAM. 

Understanding the Canvas Agent

The Canvas agent is central to Omost’s image composition. It provides functions to set global and local descriptions of images:

  • ‘Canvas.set_global_description`: Annotates the entire image.
  • `Canvas.add_local_description`: Annotates a specific part of the image.

Parameters for Image Composition

  • Descriptions: These are “sub-prompts” (less than 75 tokens) that describe elements independently.
  • Location, Offset, and Area: These define the bounding box for image elements using a 9×9 grid system, resulting in 729 possible locations.
  • Distance to Viewer: Indicates the relative depth of elements.
  • HTML Web Color Name: Specifies the color using standard HTML color names.

Advanced Rendering Techniques

Omost provides a baseline renderer based on attention manipulation, offering several methods for region-guided diffusion, including:

1. Multi-Diffusion: Runs UNet on different locations and merges results.

2. Attention Decomposition: Splits attention to handle different regions separately.

3. Attention Score Manipulation: Modifies attention scores to ensure proper activation in specified regions.

4. Gradient Optimization: Uses attention activations to compute loss functions and optimize gradients.

5. External Control Models: Utilizes models like GLIGEN and InstanceDiffusion for region guidance.

Experimental Features

  • Prompt Prefix Tree: A structure that improves prompt understanding by merging sub-prompts into coherent descriptions.
  • Tags, Atmosphere, Style, and Quality Meta: Experimental parameters that can enhance the overall quality and atmosphere of the generated image.

Omost represents a significant step forward in leveraging LLMs for sophisticated image composition. By combining robust coding capabilities with advanced rendering techniques, Omost allows users to generate high-quality images with detailed descriptions and precise control over visual elements. Whether using the official HuggingFace space or deploying locally, Omost provides a powerful toolset for creating compelling visual content.


Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.



Source link

Related posts
AI

PRISE: A Unique Machine Learning Method for Learning Multitask Temporal Action Abstractions Using Natural Language Processing (NLP)

2 Mins read
In the domain of sequential decision-making, especially in robotics, agents often deal with continuous action spaces and high-dimensional observations. These difficulties result…
AI

FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference

3 Mins read
Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address…
AI

Self-Route: A Simple Yet Effective AI Method that Routes Queries to RAG or Long Context LC based on Model Self-Reflection

3 Mins read
Large Language Models (LLMs) have revolutionized the field of natural language processing, allowing machines to understand and generate human language. These models,…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *