Vision-language models are fundamentally changing how humans and robots work together in manufacturing environments, creating opportunities for more intelligent, flexible, and safer industrial operations. These AI systems, which jointly process images and language, allow robots to interpret complex scenes, follow spoken or written instructions, and generate multi-step plans—capabilities that traditional rule-based systems could not achieve. A new survey published in Frontiers of Engineering Management provides the first comprehensive mapping of how VLMs are reshaping human-robot collaboration in smart manufacturing.
The research, conducted by a team from The Hong Kong Polytechnic University and KTH Royal Institute of Technology, examines 109 studies from 2020–2024 to demonstrate how VLMs add a powerful cognitive layer to industrial robots. According to the survey available at https://doi.org/10.1007/s42524-025-4136-9, these models enable robots to plan tasks, navigate complex environments, perform manipulation, and learn new skills directly from multimodal demonstrations. The authors emphasize that VLMs mark a turning point for industrial robotics because they enable a shift from scripted automation to contextual understanding.
In task planning applications, VLMs help robots interpret human commands, analyze real-time scenes, break down multi-step instructions, and generate executable action sequences. Systems built on CLIP, GPT-4V, BERT, and ResNet architectures achieve success rates above 90% in collaborative assembly and tabletop manipulation tasks. For navigation, VLMs allow robots to translate natural-language goals into movement, mapping visual cues to spatial decisions. These models can follow detailed step-by-step instructions or reason from higher-level intent, enabling robust autonomy in domestic, industrial, and embodied environments.
In manipulation tasks critical for factory safety, VLMs help robots recognize objects, evaluate affordances, and adjust to human motion. The review also highlights emerging work in multimodal skill transfer, where robots learn directly from visual-language demonstrations rather than labor-intensive coding. This capability could significantly reduce the time and expertise required to reprogram industrial robots for new tasks, potentially lowering barriers to automation adoption across manufacturing sectors.
The authors envision VLM-enabled robots becoming central to future smart factories—capable of adjusting to changing tasks, assisting workers in assembly, retrieving tools, managing logistics, conducting equipment inspections, and coordinating multi-robot systems. As VLMs mature, robots could learn new procedures from video-and-language demonstrations, reason through long-horizon plans, and collaborate fluidly with humans without extensive reprogramming. This represents a profound shift from robots as scripted tools to robots as flexible collaborators.
However, the survey cautions that achieving large-scale deployment will require addressing challenges in model efficiency, robustness, and data collection, as well as developing industrial-grade multimodal benchmarks for reliable evaluation. The authors conclude that breakthroughs in efficient VLM architectures, high-quality multimodal datasets, and dependable real-time processing will be key to unlocking their full industrial impact. These developments could potentially usher in a new era of safe, adaptive, and human-centric manufacturing where robots comprehend both what they see and what they are told, making human-robot interaction more intuitive and productive.


