Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeHigh-Speed Motion Planning for Aerial Swarms in Unknown and Cluttered Environments
Coordinated flight of multiple drones allows to achieve tasks faster such as search and rescue and infrastructure inspection. Thus, pushing the state-of-the-art of aerial swarms in navigation speed and robustness is of tremendous benefit. In particular, being able to account for unexplored/unknown environments when planning trajectories allows for safer flight. In this work, we propose the first high-speed, decentralized, and synchronous motion planning framework (HDSM) for an aerial swarm that explicitly takes into account the unknown/undiscovered parts of the environment. The proposed approach generates an optimized trajectory for each planning agent that avoids obstacles and other planning agents while moving and exploring the environment. The only global information that each agent has is the target location. The generated trajectory is high-speed, safe from unexplored spaces, and brings the agent closer to its goal. The proposed method outperforms four recent state-of-the-art methods in success rate (100% success in reaching the target location), flight speed (67% faster), and flight time (42% lower). Finally, the method is validated on a set of Crazyflie nano-drones as a proof of concept.
UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs
Agentic UAVs represent a new frontier in autonomous aerial intelligence, integrating perception, decision-making, memory, and collaborative planning to operate adaptively in complex, real-world environments. Driven by recent advances in Agentic AI, these systems surpass traditional UAVs by exhibiting goal-driven behavior, contextual reasoning, and interactive autonomy. We provide a comprehensive foundation for understanding the architectural components and enabling technologies that distinguish Agentic UAVs from traditional autonomous UAVs. Furthermore, a detailed comparative analysis highlights advancements in autonomy with AI agents, learning, and mission flexibility. This study explores seven high-impact application domains precision agriculture, construction & mining, disaster response, environmental monitoring, infrastructure inspection, logistics, security, and wildlife conservation, illustrating the broad societal value of agentic aerial intelligence. Furthermore, we identify key challenges in technical constraints, regulatory limitations, and data-model reliability, and we present emerging solutions across hardware innovation, learning architectures, and human-AI interaction. Finally, a future roadmap is proposed, outlining pathways toward self-evolving aerial ecosystems, system-level collaboration, and sustainable, equitable deployments. This survey establishes a foundational framework for the future development, deployment, and governance of agentic aerial systems (Agentic UAVs) across diverse societal and industrial domains.
Behavior Trees Enable Structured Programming of Language Model Agents
Language models trained on internet-scale data sets have shown an impressive ability to solve problems in Natural Language Processing and Computer Vision. However, experience is showing that these models are frequently brittle in unexpected ways, and require significant scaffolding to ensure that they operate correctly in the larger systems that comprise "language-model agents." In this paper, we argue that behavior trees provide a unifying framework for combining language models with classical AI and traditional programming. We introduce Dendron, a Python library for programming language model agents using behavior trees. We demonstrate the approach embodied by Dendron in three case studies: building a chat agent, a camera-based infrastructure inspection agent for use on a mobile robot or vehicle, and an agent that has been built to satisfy safety constraints that it did not receive through instruction tuning or RLHF.
FlyPose: Towards Robust Human Pose Estimation From Aerial Views
Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster response and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perception of human poses and actions from an aerial viewpoint. This perspective challenges existing methods with low resolution, steep viewing angles and (self-)occlusion, especially if the application demands realtime feasibile models. We train and deploy FlyPose, a lightweight top-down human pose estimation pipeline for aerial imagery. Through multi-dataset training, we achieve an average improvement of 6.8 mAP in person detection across the test-sets of Manipal-UAV, VisDrone, HIT-UAV as well as our custom dataset. For 2D human pose estimation we report an improvement of 16.3 mAP on the challenging UAV-Human dataset. FlyPose runs with an inference latency of ~20 milliseconds including preprocessing on a Jetson Orin AGX Developer Kit and is deployed onboard a quadrotor UAV during flight experiments. We also publish FlyPose-104, a small but challenging aerial human pose estimation dataset, that includes manual annotations from difficult aerial perspectives: https://github.com/farooqhassaan/FlyPose.
A 3D Multimodal Feature for Infrastructure Anomaly Detection
Ageing structures require periodic inspections to identify structural defects. Previous work has used geometric distortions to locate cracks in synthetic masonry bridge point clouds but has struggled to detect small cracks. To address this limitation, this study proposes a novel 3D multimodal feature, 3DMulti-FPFHI, that combines a customized Fast Point Feature Histogram (FPFH) with an intensity feature. This feature is integrated into the PatchCore anomaly detection algorithm and evaluated through statistical and parametric analyses. The method is further evaluated using point clouds of a real masonry arch bridge and a full-scale experimental model of a concrete tunnel. Results show that the 3D intensity feature enhances inspection quality by improving crack detection; it also enables the identification of water ingress which introduces intensity anomalies. The 3DMulti-FPFHI outperforms FPFH and a state-of-the-art multimodal anomaly detection method. The potential of the method to address diverse infrastructure anomaly detection scenarios is highlighted by the minimal requirements for data compared to learning-based methods. The code and related point cloud dataset are available at https://github.com/Jingyixiong/3D-Multi-FPFHI.
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
Beyond Coverage Path Planning: Can UAV Swarms Perfect Scattered Regions Inspections?
Unmanned Aerial Vehicles (UAVs) have revolutionized inspection tasks by offering a safer, more efficient, and flexible alternative to traditional methods. However, battery limitations often constrain their effectiveness, necessitating the development of optimized flight paths and data collection techniques. While existing approaches like coverage path planning (CPP) ensure comprehensive data collection, they can be inefficient, especially when inspecting multiple non connected Regions of Interest (ROIs). This paper introduces the Fast Inspection of Scattered Regions (FISR) problem and proposes a novel solution, the multi UAV Disjoint Areas Inspection (mUDAI) method. The introduced approach implements a two fold optimization procedure, for calculating the best image capturing positions and the most efficient UAV trajectories, balancing data resolution and operational time, minimizing redundant data collection and resource consumption. The mUDAI method is designed to enable rapid, efficient inspections of scattered ROIs, making it ideal for applications such as security infrastructure assessments, agricultural inspections, and emergency site evaluations. A combination of simulated evaluations and real world deployments is used to validate and quantify the method's ability to improve operational efficiency while preserving high quality data capture, demonstrating its effectiveness in real world operations. An open source Python implementation of the mUDAI method can be found on GitHub (https://github.com/soc12/mUDAI) and the collected and processed data from the real world experiments are all hosted on Zenodo (https://zenodo.org/records/13866483). Finally, this online platform (https://sites.google.com/view/mudai-platform/) allows interested readers to interact with the mUDAI method and generate their own multi UAV FISR missions.
InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds
Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion-official-implement.
TACK Tunnel Data (TTD): A Benchmark Dataset for Deep Learning-Based Defect Detection in Tunnels
Tunnels are essential elements of transportation infrastructure, but are increasingly affected by ageing and deterioration mechanisms such as cracking. Regular inspections are required to ensure their safety, yet traditional manual procedures are time-consuming, subjective, and costly. Recent advances in mobile mapping systems and Deep Learning (DL) enable automated visual inspections. However, their effectiveness is limited by the scarcity of tunnel datasets. This paper introduces a new publicly available dataset containing annotated images of three different tunnel linings, capturing typical defects: cracks, leaching, and water infiltration. The dataset is designed to support supervised, semi-supervised, and unsupervised DL methods for defect detection and segmentation. Its diversity in texture and construction techniques also enables investigation of model generalization and transferability across tunnel types. By addressing the critical lack of domain-specific data, this dataset contributes to advancing automated tunnel inspection and promoting safer, more efficient infrastructure maintenance strategies.
AutoRepo: A general framework for multi-modal LLM-based automated construction reporting
Ensuring the safety, quality, and timely completion of construction projects is paramount, with construction inspections serving as a vital instrument towards these goals. Nevertheless, the predominantly manual approach of present-day inspections frequently results in inefficiencies and inadequate information management. Such methods often fall short of providing holistic, exhaustive assessments, consequently engendering regulatory oversights and potential safety hazards. To address this issue, this paper presents a novel framework named AutoRepo for automated generation of construction inspection reports. The unmanned vehicles efficiently perform construction inspections and collect scene information, while the multimodal large language models (LLMs) are leveraged to automatically generate the inspection reports. The framework was applied and tested on a real-world construction site, demonstrating its potential to expedite the inspection process, significantly reduce resource allocation, and produce high-quality, regulatory standard-compliant inspection reports. This research thus underscores the immense potential of multimodal large language models in revolutionizing construction inspection practices, signaling a significant leap forward towards a more efficient and safer construction management paradigm.
CrackNex: a Few-shot Low-light Crack Segmentation Model Based on Retinex Theory for UAV Inspections
Routine visual inspections of concrete structures are imperative for upholding the safety and integrity of critical infrastructure. Such visual inspections sometimes happen under low-light conditions, e.g., checking for bridge health. Crack segmentation under such conditions is challenging due to the poor contrast between cracks and their surroundings. However, most deep learning methods are designed for well-illuminated crack images and hence their performance drops dramatically in low-light scenes. In addition, conventional approaches require many annotated low-light crack images which is time-consuming. In this paper, we address these challenges by proposing CrackNex, a framework that utilizes reflectance information based on Retinex Theory to help the model learn a unified illumination-invariant representation. Furthermore, we utilize few-shot segmentation to solve the inefficient training data problem. In CrackNex, both a support prototype and a reflectance prototype are extracted from the support set. Then, a prototype fusion module is designed to integrate the features from both prototypes. CrackNex outperforms the SOTA methods on multiple datasets. Additionally, we present the first benchmark dataset, LCSD, for low-light crack segmentation. LCSD consists of 102 well-illuminated crack images and 41 low-light crack images. The dataset and code are available at https://github.com/zy1296/CrackNex.
Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.
dacl10k: Benchmark for Semantic Bridge Damage Segmentation
Reliably identifying reinforced concrete defects (RCDs)plays a crucial role in assessing the structural integrity, traffic safety, and long-term durability of concrete bridges, which represent the most common bridge type worldwide. Nevertheless, available datasets for the recognition of RCDs are small in terms of size and class variety, which questions their usability in real-world scenarios and their role as a benchmark. Our contribution to this problem is "dacl10k", an exceptionally diverse RCD dataset for multi-label semantic segmentation comprising 9,920 images deriving from real-world bridge inspections. dacl10k distinguishes 12 damage classes as well as 6 bridge components that play a key role in the building assessment and recommending actions, such as restoration works, traffic load limitations or bridge closures. In addition, we examine baseline models for dacl10k which are subsequently evaluated. The best model achieves a mean intersection-over-union of 0.42 on the test set. dacl10k, along with our baselines, will be openly accessible to researchers and practitioners, representing the currently biggest dataset regarding number of images and class diversity for semantic segmentation in the bridge inspection domain.
YOLO9tr: A Lightweight Model for Pavement Damage Detection Utilizing a Generalized Efficient Layer Aggregation Network and Attention Mechanism
Maintaining road pavement integrity is crucial for ensuring safe and efficient transportation. Conventional methods for assessing pavement condition are often laborious and susceptible to human error. This paper proposes YOLO9tr, a novel lightweight object detection model for pavement damage detection, leveraging the advancements of deep learning. YOLO9tr is based on the YOLOv9 architecture, incorporating a partial attention block that enhances feature extraction and attention mechanisms, leading to improved detection performance in complex scenarios. The model is trained on a comprehensive dataset comprising road damage images from multiple countries, including an expanded set of damage categories beyond the standard four. This broadened classification range allows for a more accurate and realistic assessment of pavement conditions. Comparative analysis demonstrates YOLO9tr's superior precision and inference speed compared to state-of-the-art models like YOLO8, YOLO9 and YOLO10, achieving a balance between computational efficiency and detection accuracy. The model achieves a high frame rate of up to 136 FPS, making it suitable for real-time applications such as video surveillance and automated inspection systems. The research presents an ablation study to analyze the impact of architectural modifications and hyperparameter variations on model performance, further validating the effectiveness of the partial attention block. The results highlight YOLO9tr's potential for practical deployment in real-time pavement condition monitoring, contributing to the development of robust and efficient solutions for maintaining safe and functional road infrastructure.
A Study on Unsupervised Anomaly Detection and Defect Localization using Generative Model in Ultrasonic Non-Destructive Testing
In recent years, the deterioration of artificial materials used in structures has become a serious social issue, increasing the importance of inspections. Non-destructive testing is gaining increased demand due to its capability to inspect for defects and deterioration in structures while preserving their functionality. Among these, Laser Ultrasonic Visualization Testing (LUVT) stands out because it allows the visualization of ultrasonic propagation. This makes it visually straightforward to detect defects, thereby enhancing inspection efficiency. With the increasing number of the deterioration structures, challenges such as a shortage of inspectors and increased workload in non-destructive testing have become more apparent. Efforts to address these challenges include exploring automated inspection using machine learning. However, the lack of anomalous data with defects poses a barrier to improving the accuracy of automated inspection through machine learning. Therefore, in this study, we propose a method for automated LUVT inspection using an anomaly detection approach with a diffusion model that can be trained solely on negative examples (defect-free data). We experimentally confirmed that our proposed method improves defect detection and localization compared to general object detection algorithms used previously.
