Object detection in remote sensing images has gained prominence alongside advancements in sensor technology and earth observation systems. A
Object detection in remote sensing images has gained prominence alongside advancements in sensor technology and earth observation systems. Although current detection frameworks demonstrate remarkable achievements in natural imagery analysis, their performance degrades when applied to remote imaging scenarios due to two inherent limitations: (1) complex background interference, which causes object features to be easily obscured by noise, leading to reduced detection accuracy; (2) the variation in object scales leads to a decrease in the model’s generalization ability. To address these issues, we propose a progressive semantic-aware fusion network (ProSAF-Net). First, we design a shallow detail aggregation module (SDAM), which adaptively integrates features across different channels and scales in the early Neck stage through dynamically adjusted fusion weights, fully exploiting shallow detail information to refine object edge and texture representation. Second, to effectively integrate shallow detail information and high-level semantic abstractions, we propose a deep semantic fusion module (DSFM), which employs a progressive feature fusion mechanism to incrementally integrate deep semantic information, strengthening the global representation of objects while effectively complementing the rich shallow details extracted by SDAM, enhancing the model’s capability in distinguishing objects and refining spatial localization. Furthermore, we develop a spatial context-aware module (SCAM) to fully exploit both global and local contextual information, effectively distinguishing foreground from background and suppressing interference, thus improving detection robustness. Finally, we propose auxiliary dynamic loss (ADL), which adaptively adjusts loss weights based on object scales and utilizes supplementary anchor priors to expedite parameter convergence during coordinate regression, thereby improving the model’s positioning accuracy for targets. Extensive experiments on the RSOD, DIOR, and NWPU VHR-10 datasets demonstrate that our method outperforms other state-of-the-art methods.