歷史雜誌

購物提醒：當您要離開或跳轉頁面時，請先將您想要購買的文章加入購物車，以便快速紀錄完成購買流程!

自主無人機即時深度感知與全向避障系統之設計與驗證

自主無人機、障礙物避讓、深度估測、MAVLink、嵌入式人工智慧、360 度感知

刊登日期：2026/06/26

摘要

本文介紹一套於工研院 (ITRI，Industrial Technology Research Institute) 實驗室所開發自主無人機的即時偵測與避障系統。本文介紹 ORBIT (Omnidirectional Reactive Behavior with Integrated Tracking) 透過多相機視覺輸入、深度學習式深度估測、相機校正、IMU 輔助時間穩定化，以及 MAVLink 通訊介面，在單一嵌入式電腦上即時生成可供飛控使用的障礙物距離資訊。相較於 GNSS 僅能提供全域定位而無法感知近距離障礙，ORBIT 能在複雜、障礙物密集的環境中建立連續且具物理一致性的空間感知能力。系統採用基於 DINOv2 ViT-Small 與 Dense Prediction Transformer 的深度估測模型，並透過推論引擎與高效率深度處理，將高解析度深度圖轉換為多扇區極座標障礙物陣列，再以 MAVLink 傳送至飛控平台，可由單一前視相機擴展至 360 度全向感知，適用於真實自主無人機避障任務。

Abstract

This article introduces ORBIT—Omnidirectional Reactive Behavior with Integrated Tracking—a real-time detect-and-avoid system for autonomous unmanned aerial vehicles. ORBIT converts visual input from scalable multi-camera configurations into metric depth and obstacle-distance data that can be used by PX4 and ArduPilot through the MAVLink protocol. By integrating low-latency image capture, preprocessing, transformerbased depth estimation, calibration, Gaussian-weighted strip processing, IMU-assisted stabilization, and onboard MAVLink communication, ORBIT enables drones to perceive nearby hazards beyond GNSS-based navigation. Built on a DINOv2 ViT-Small backbone with a Dense Prediction Transformer decoder, ORBIT provides a modular path from single-camera sensing to full 360-degree awareness for safer autonomous flight in cluttered environments.

Introduction

Today’s drones excel at navigating open airspace，but their capabilities falter when confronted with complex，obstacle-filled environments。Omnidirectional Reactive Behavior with Integrated Tracking（ORBIT）is a Detect and Avoid（DAA）system that converts visual input from a multi-camera array into an accurate，real-time spatial awareness solution for autonomous aircraft。Everything image capture，neural-network inference，calibration，and communication with the autopilot happens onboard，on a single embedded computer，in real-time。While GNSS provides global positioning，it offers no information about immediate hazards such as objects just meters away。ORBIT addresses this challenge by delivering real-time depth sensing through advanced deep learning algorithms，enabling safe autonomous flight through cluttered environments。The system integrates directly with standard autopilot platforms PX4 and ArduPilot via the MAVLink protocol，achieving obstacle detection with sub-100ms latency。Its flexible camera architecture supports scalable configurations，from single-camera setups to full omnidirectional 360-degree coverage，demonstrating adaptability across different aircraft designs。The resulting perception output is illustrated in Figure 1，where raw visual input is transformed into a depth-aware representation of the surrounding environment。This visualization highlights how ORBIT converts an ordinary camera feed into measurable spatial information that can support real-time obstacle detection。

Vision-based depth perception is not new。Researchers have been working on depth estimation for more than a decade［1］［2］［3］［4］。What distinguishes ORBIT is the integration work the chain of decisions and optimizations that takes a research-grade neural network and turns it into a reliable，deployable，real-time system on affordable embedded hardware。The hardware specific code compilation pipeline reduces a network that would take seconds per frame on general-purpose processors to tens of milliseconds on the Graphics Processing Unit（GPU）。The GStreamer camera stack eliminates latency that carries critical consequences in flight。The Gaussian-weighted strip processor collapses high-resolution depth maps into a multi-sector polar obstacle array with minimal computational overhead。The MAVLink interface communicates with the autopilots，requiring no firmware modifications on the flight controller side。Each of these components exists because of a real constraint weight，power，latency，or cost encountered during development and testing。The result is a system that is not just academically interesting but practically usable on a real aircraft today。

ORBIT couples its depth pipeline to IMU data from the autopilot over MAVLink at 400 Hz。During aggressive maneuvers sharp banking turns，rapid climbs，wind gust responses the camera’s field of view sweeps through the scene fast enough that a depth map computed during the sweep becomes geometrically stale by the time it reaches the autopilot。An IMU-aided temporal stabilizer uses inertial data to predict how the obstacle map shifts between frames in real time，applying motion-compensated corrections to each sector distance before transmission。This maintains metric accuracy under the demanding maneuver conditions where the baseline system operates most conservatively。

We built ORBIT around three principles that we held without compromise throughout development。Speed，because an obstacle avoidance system that takes 500 milliseconds to respond is not an obstacle avoidance system，it is a logging system for crashes。Modularity，because the same software stack that protects a single forward-facing camera must scale to four or even eight cameras covering 360 degrees by changing configuration files，not rewriting code。And reliability，because a system trusted with autonomous flight must perform predictably under real-world conditions，not just in controlled tests。Figure 2 demonstrates ORBIT’s ability to combine visual perception with depth-aware target localization。By comparing the original camera frame with the corresponding depth map，the system shows how nearby objects can be identified，separated from the background，and interpreted as actionable spatial information。

Methodology

The system operates as a sequential pipeline of five specialized processing stages，each optimized for a single task and passing its output to the successor。This chain extends from initial frame capture through to the autopilot’s reception of distance information。The overall architecture of this processing chain is shown in Figure 3。The diagram summarizes how camera inputs are captured，synchronized，processed by AI inference modules，fused into a 360-degree depth representation，post-processed into obstacle data，and finally transmitted to the autopilot through MAVLink。

Stage 1 — Capture

We access the Camera Serial Interface（CSI）camera through a pipeline that reads directly from the Direct Memory Access（DMA）buffer the hardware buffer where the sensor deposits raw pixel data before the operating system ever touches it。This bypasses the generic V4L2 video stack entirely。The latency saving versus a standard OpenCV VideoCapture call on the same hardware is approximately ten milliseconds per frame small in isolation，significant when considering to close a 100-millisecond end-to-end budget。

Stage 2 — Preprocessing

We convert the BGR frame to RGB，resize it from 1920×1080 to 308×308 using bicubic interpolation，normalize each channel using the mean and standard deviation values that the model expects，and transpose the result to layout so it matches the tensor format that AI framework requires。The resize is the expensive operation；bicubic interpolation at this scale costs about eight milliseconds on the CPU。We preserve the aspect ratio during resizing and pad with reflected border pixels rather than zeros，because zero-padding creates artificial depth boundaries at the image edges that the model interprets as nearby surfaces。

Stage 3 — Neural-Network（AI）Inference

This is the core of the system，and it deserves a careful description。A model based on a transformer architecture is used。Its backbone is DINOv2［5］ViT-Small［6］，a self-supervised vision transformer with more than 20 million parameters。The depth head is a Dense Prediction Transformer decoder that upsamples the patch embeddings through four stages back to the input resolution。We convert the PyTorch［7］model to ONNX and then compile it to a engine with FP16 precision［8］。The resulting engine runs the full 24.8-million-parameter model in 40.11 milliseconds with a standard deviation of 0.16 milliseconds — a consistency that tells you the GPU execution path is fully deterministic with no scheduling jitter。

Stage 4 — Calibration ＆ Strip Processing

The model produces relative depth values it understands scene geometry but its absolute scale depends on camera calibration parameters that differ from the training distribution。We apply a learned linear correction：a scale factor of 0.347 and an offset of −0.533 meters，derived by comparing model output against known-distance reference targets。

DOI:10.30256/JIM.202607_(520).0012

「如欲訂購單篇，請至「華藝線上圖書館」

更完整的內容歡迎訂購 2026年07月號 (單篇費用：參考材化所定價)

3篇450元

NT$450元

訂閱送出

10篇1200元

NT$1,200元