Alibaba
Alibaba

HappyHorse-1.0

Text-to-video and image-to-video model with 1080p output and short-form clip control

Text to VideoImage to Video

HappyHorse-1.0 Overview

HappyHorse-1.0 is a video generation model for text-to-video and image-to-video workflows. It supports output at 720p or 1080p, clip durations from 3 to 15 seconds, seeded generation, watermark control, and first-frame image conditioning for image-to-video generation.

How to Use HappyHorse-1.0

Overview

HappyHorse-1.0 is a short-form video generation model for text-to-video and image-to-video workflows.

It is designed for short clips with configurable resolution, duration, aspect ratio, and seed control, with a separate first-frame image workflow for image-to-video generation.

Strengths

Text-to-Video and Image-to-Video in One Family

HappyHorse-1.0 exposes both text-to-video and image-to-video variants. This makes it suitable for teams that want one model family for both prompt-only generation and first-frame-guided motion generation.

720p and 1080p Output

The model supports both 720p and 1080p output. This gives some flexibility between faster preview-style generations and higher-quality HD delivery.

Short-Form Clip Control

HappyHorse-1.0 supports clip durations from 3 to 15 seconds, making it useful for ads, short-form social content, quick concept clips, and other concise generation workflows.

Aspect Ratio Flexibility for Text-to-Video

The text-to-video interface supports multiple aspect ratios, including 16:9, 9:16, 1:1, 4:3, and 3:4. This makes it easier to target landscape, portrait, and square delivery formats from the same model family.

Deterministic Generation Control

The model supports a seed parameter for controlling generation determinism, which is useful when testing prompt variations, iterating on the same concept, or trying to reproduce a successful output more closely.

Capabilities

Text-to-Video

HappyHorse-1.0 supports prompt-based video generation through the happyhorse-1.0-t2v model variant.

Image-to-Video

HappyHorse-1.0 supports first-frame image-to-video generation through the happyhorse-1.0-i2v model variant.

First-Frame Conditioning

The image-to-video workflow accepts a single first_frame image as the visual anchor for the generated clip.

Resolution and Duration Controls

The API exposes controls for resolution, duration, seed, and watermark behavior across the documented workflows.

Input and Output

  • AIR ID: alibaba:[email protected]
  • Input: text prompts, or a text prompt plus one first-frame image URL for image-to-video
  • Output: generated video clips
  • Resolution: 720p or 1080p
  • Duration: 3 to 15 seconds

Best Fit

  • Short-form text-to-video generation
  • First-frame-guided image-to-video animation
  • Social, ad, and concept clips in multiple aspect ratios
  • Controlled prompt iteration with seed support
  • HD video generation workflows that do not need long-form output

More models from Alibaba

Wan2.7 is Alibaba's next-generation multimodal video model supporting text-to-video, image-to-video, reference-to-video, and video editing. It features multi-shot storytelling, subject-consistent multi-character generation, first-and-last-frame interpolation, video continuation, style transfer, instruction-based editing, and audio-conditioned generation with auto-dubbing. Output at 720p or 1080p, 30 FPS in multiple aspect ratios.

Wan2.7 Image Pro is the premium variant of Wan2.7 Image offering more stable composition and more precise prompt comprehension. It shares all capabilities of the standard model including avatar customization, color palette control, marquee editing, multilingual text rendering across 12 languages, and multi-image composition, with improved consistency and fidelity for professional workflows.

Wan2.7 Image is a unified image generation and editing model from Alibaba that combines generation and interactive editing in a shared latent space. It features virtual avatar face customization with fine bone structure and eye shape control, a color palette system for extracting and applying consistent color schemes, precise marquee selection editing for pixel-level element manipulation, multilingual text rendering supporting up to 3000 tokens in 12 languages, and compositional generation of up to 12 images in a single output.

Qwen3.5-27B

Coming Soon

Qwen3.5-27B is a 27B-parameter Qwen large language model for general reasoning, coding, multilingual generation, and long-context text workflows. It supports 262K native context extensible to about 1M tokens and is positioned as a smaller open-weight alternative to the flagship Qwen3.5 MoE models.

Qwen3.5-397B

Coming Soon

Qwen3.5-397B is a frontier Qwen large language model for reasoning, coding, search, and agent workflows. The underlying open-weight flagship uses a sparse MoE design with 397B total parameters and 17B activated parameters, supports 262K native context extensible to about 1M tokens, and is designed for high-throughput long-context inference.

Qwen-Image-2.0-Pro builds on Qwen-Image-2.0 with optimized visual fidelity, improved layout and typography handling, and advanced editing control for professional creative and enterprise applications. It delivers richer detail, more accurate text and iconography rendering, and refined editing semantics across a wide range of visual styles, making it suitable for advertising, branding, design systems, and high-impact visual content.

Qwen-Image-2.0 is an advanced image generation and editing model from Alibaba that produces high-quality images at native 2K resolution and renders professional-grade text within visuals. It unifies text-to-image and image-to-image editing into a single model with strong semantic understanding and adheres to detailed prompt instructions. The model excels at generating images that include complex textual content, infographics, posters, and layout-driven visuals.

Z-Image is a powerful open-source image generation model with 6 billion parameters built on a scalable single-stream diffusion transformer architecture. It delivers high visual fidelity, strong prompt adherence, and diverse stylistic output for text-to-image and image-to-image tasks, and serves as the full-capacity foundation for distilled variants like Z-Image-Turbo.

Wan2.6 Flash is a distilled, low-latency variant of the Wan2.6 multimodal video model designed for rapid image to video generation with fluid motion, visual stability, and optional synchronized audio. It produces HD clips from detailed static images while preserving subject structure and motion realism, making it suitable for preview workflows and high-throughput creative pipelines.

Qwen3-TTS 1.7B VoiceDesign is a text-to-speech model from Alibaba that creates custom voices from natural language descriptions specifying emotion, tone, and prosody. It supports voice cloning from a 3-second audio sample, generates speech in 10+ languages including Chinese, English, Japanese, Korean, and European languages, and achieves latency as low as 97ms.

Qwen3-TTS 1.7B Base is the foundation text-to-speech model from Alibaba's Qwen3-TTS family. It generates human-like speech across 10+ languages including Chinese, English, Japanese, Korean, and European languages. It supports voice cloning from a 3-second audio sample and achieves latency as low as 97ms for real-time applications.

Qwen3-TTS 1.7B CustomVoice is a text-to-speech model from Alibaba that offers nine premium preset timbres across various combinations of gender, age, language, and dialect. It provides precise style control over target voices through user instructions, supports voice cloning from a 3-second sample, and generates speech in 10+ languages with latency as low as 97ms.

Qwen-Image-2512

Api Only

Qwen-Image-2512 is an improved version of the Qwen-Image image foundation model with enhanced prompt understanding, superior text rendering accuracy, and more realistic visual details. It generates high-fidelity images from text prompts across diverse subjects and styles.

Qwen-Image-Layered

Api Only

Qwen-Image-Layered decomposes a static image into multiple RGBA layers, enabling independent editing of semantically distinct components without interfering with other parts of the image. This layered representation supports high-fidelity image editing tasks like resizing, repositioning, recoloring, and object manipulation with consistent detail and transparency handling.

Wan2.6 is a multimodal video model for text to video and image to video generation with support for multi-shot sequencing and native sound. It emphasizes temporal stability, consistent visual structure across shots, and reliable alignment between visuals and audio in short form video generation.

Wan2.6 Image is a single-frame image generation model derived from the Wan2.6 multimodal video architecture. It focuses on strong prompt adherence, clean spatial structure, and visually coherent results, delivering video-grade image quality for creative, editorial, and product-oriented workflows.

Z-Image-Turbo is a distilled vision model for sub second image generation. It produces sharp photorealistic results and supports accurate Chinese text and English text inside images. It follows complex layout instructions with stable structure for UI, posters, and scenes.

Qwen-Image-Edit-2511

Api Only

Qwen-Image-Edit-2511 is an image editing model that applies text instructions to modify existing images with precise semantic and appearance control. It preserves visual consistency during edits, supports multi-person and character consistency, and integrates selected features and extensions that enhance object manipulation, geometric reasoning, and layout coherence.

Wan2.5-Preview is Alibaba’s multimodal video model in research preview. It supports text to video and image to video with native audio generation for clips around 10 seconds. It offers strong prompt adherence, smooth motion, and multilingual audio for narrative scenes.

Wan2.5-Preview Image is a single frame generator built from the Wan2.5 video stack. It focuses on detailed depth structure, strong prompt following, multilingual text rendering, and video grade visual quality for production ready stills in creative or product workflows.

Qwen-Image-Edit-Plus is a 20B image editing model that supports multi image workflows and strong identity preservation. It improves consistency on single image edits and adds native ControlNet style conditioning for precise structure control, layout edits, and bilingual text manipulation.

Wan2.2 Animate

Api Only

Wan2.2 Animate is a unified video model that produces character-focused animations from static images and reference videos or replaces characters in existing footage while preserving motion, expressions, and scene consistency. It uses the Wan2.2 mixture-of-experts architecture to generate coherent character movement and seamless integration with background video.

Wan2.2 Animate Turbo is an accelerated variant of Wan2.2 Animate designed for faster character animation and replacement in video. It generates coherent character motion and expression from images or existing footage while prioritizing reduced inference time for rapid iteration workflows.

Qwen‑Image‑Edit is an instruction based image editing model built on the 20B Qwen‑Image foundation. It performs semantic edits and local appearance changes while preserving layout and text fidelity. Ideal for programmatic asset cleanup, style tweaks, and precise bilingual text updates.

Qwen-Image is a 20B parameter vision language model from Alibaba Cloud. It focuses on precise text conditioned image generation and supports complex Chinese or English typography. It also enables accurate image editing workflows that need layout control and strong prompt following.

Wan2.2 A14B is a Mixture of Experts video model with two 14B experts for layout and detail. It supports text prompts or reference images to generate cinematic 480p or 720p clips with stable inference cost and consistent motion. Ideal for pipelines on high end GPUs.

Qwen2.5-VL-7B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output generation. It integrates a vision encoder with a 7B instruction-tuned language backbone to support rich interactive multimodal understanding.

Qwen2.5-VL-3B-Instruct is a multimodal model that processes images and text together to perform visual reasoning, captioning, question answering, and structured output tasks. It integrates a vision encoder with an instruction-tuned language backbone to support complex visual understanding and interactive multimodal responses.

Wan2.2 A14B Turbo accelerates Wan2.2 with fused Lightning LoRA for ultra fast diffusion. It cuts inference to 8 steps while preserving cinematic structure and detail. Ideal for rapid 480p to 720p video prototyping and iteration in production workflows.

Qwen2.5-VL-7B Age Detector is a multimodal model that analyzes a facial image to estimate age. It uses the vision encoder from Qwen2.5-VL-7B and its instruction-tuned language backbone to interpret visual features and output age predictions or age categories.