● Hugging Face 📅 03/06/2025 à 17:27

Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H

Data Science

🏷️ Tags : rag

Back to Articles Holo1: New family of GUI automation VLMs powering GUI agent Surfer-H Team Article Published June 3, 2025 Upvote 71 +65 Mats L. Richter MatsLRichter Follow Hcompany Pierre-Louis Cedoz plcedoz38 Follow Hcompany Today, at H Company, we are releasing Holo1, a family of Action Vision Language Models (VLMs) and WebClick, a new multimodal localization benchmark on the Hugging Face Hub. Surfer-H, a web-native agent that interacts with browsers like a human relies on the Holo1. Technical Report Holo1 Holo1 is the first family of open-source Action VLMs designed specifically for deep web UI understanding and precise localization. The family includes Holo1-3B and Holo1-7B models, with the latter achieving 76.2% average accuracy on common UI localization benchmarks—the highest among small-size models. H Company has released these models with open-source on Hugging Face, along with the WebClick benchmark containing 1,639 human-like UI tasks. Use with Transformers Holo1 models are based on the Qwen2.5-VL architecture, and are fully compatible with transformers. Here we provide a simple usage example. You can load the model and the processor as follows. from transformers import AutoModelForImageTextToText, AutoProcessor import torch model = AutoModelForImageTextToText.from_pretrained( "Hcompany/Holo1-3B", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B") Load the image and preprocess. image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg" guidelines = "Localize an element on the GUI image according to my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the top edge." instruction = "Select July 14th as the check-out date" messages = [ { "role": "user", "content": [ { "type": "image", "url": image_url, }, {"type": "text", "text": f"{guidelines}\n{instruction}"}, ], } ] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) We can now infer. generated_ids = model.generate(**inputs, max_new_tokens=128) decoded = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) # Click(352, 348) Surfer-H Web automation represents one of AI's most practical applications for businesses, but until now, solutions have often sacrificed cost-efficiency for performance. By making our Holo1 Action Models available in Hugging Face, users can now implement web automation solutions that achieve 92.2% accuracy on real-world web tasks at only $0.13 per task. Surfer-H relies on the Holo1 family of open-weights models. It is a modular architecture for complete web task automation, which performs reading, thinking, clicking, scrolling, typing, and validating. It is designed to be flexible and modular, composed of three independent components: a Policy model that plans and drives the agent's behavior, a Localizer model that understands visual UIs for precise interactions, and a Validator model that confirms whether tasks are completed successfully. Unlike other agents that rely on custom APIs or brittle wrappers, Surfer-H operates purely through the browser — just like a real user. Together, these solutions represent a new frontier in web automation, achieving state-of-the-art localization performance and setting the Pareto frontier in cost-efficient web navigation on the WebVoyager benchmark: We're looking forward to see what you'll build with Holo1! Let's meet under the discussion tab of this blog post and the model repository! Citation @misc{andreux2025surferhmeetsholo1costefficient, title={Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights}, author={Mathieu Andreux and Breno Baldas Skuk and Hamza Benchekroun and Emilien Biré and Antoine Bonnet and Riaz Bordie and Matthias Brunel and Pierre-Louis Cedoz and Antoine Chassang and Mickaël Chen and Alexandra D. Constantinou and Antoine d'Andigné and Hubert de La Jonquière and Aurélien Delfosse and Ludovic Denoyer and Alexis Deprez and Augustin Derupti and Michael Eickenberg and Mathïs Federico and Charles Kantor and Xavier Koegler and Yann Labbé and Matthew C. H. Lee and Erwan Le Jumeau de Kergaradec and Amir Mahla and Avshalom Manevich and Adrien Maret and Charles Masson and Rafaël Maurin and Arturo Mena and Philippe Modard and Axel Moyal and Axel Nguyen Kerbel and Julien Revelle and Mats L. Richter and María Santos and Laurent Sifre and Maxime Theillard and Marc Thibault and Louis Thiry and Léo Tronchon and Nicolas Usunier and Tony Wu}, year={2025}, eprint={2506.02865}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2506.02865}, } Mentioned datasets Mentioned papers Mentioned collections More from this author Holotron-12B - High Throughput Computer Use Agent 17 March 17, 2026 H Company's new Holo2 model takes the lead in UI Localization 6 February 3, 2026 Community deleted Jun 4, 2025 This comment has been hidden Witerr1 Jun 4, 2025 • edited Jun 4, 2025 Generate a real-world See translation Reply pavel-ai Jun 6, 2025 The result is pretty impressive! Here are some examples with queries (the results presented as small red circles): Query #1 Where to click to upload a file? Query #2 Where the result of the request would be presented as an image? Worth noting, the points are located at the logical centers of the UI components (not at the titles or visual centers). I'd also want to add information about license in the article, it took some time to figure out where it's at HF. For those who are curious it's Apache 2.0 (very permissive). See translation Reply EditPreview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment Upvote 71 +59 Mentioned datasets Mentioned papers Mentioned collections

🔗 Lire l'article original 👁️ 4 lectures

← Retour