Jerry Zhu Home

                                    Spatially-Enhanced Recurrent Memory for Long-Range
                                    Mapless Navigation via End-to-End Reinforcement Learning
                                    Fan Yang1 , Per Frivik1 , David Hoeller1 , Chen Wang2 , Cesar Cadena1 , and Marco Hutter1



                                    Abstract
                                    Recent advancements in robot navigation, particularly with end-to-end learning approaches such as reinforcement
                                    learning (RL), have demonstrated remarkable efficiency and effectiveness. However, successful navigation still depends
                                    on two key capabilities: mapping and planning, whether implemented explicitly or implicitly. Classical approaches rely
                                    on explicit mapping pipelines to transform and register egocentric observations into a coherent map for the planning
                                    module. In contrast, end-to-end learning often achieves this implicitly—through recurrent neural networks (RNNs) that
                                    fuse current and historical observations into a latent space for planning. While existing architectures, such as LSTM
                                    and GRU, can capture temporal dependencies, our findings reveal a critical limitation: their inability to effectively arXiv:2506.05997v2 [cs.RO] 4 Sep 2025




                                    perform spatial memorization. This capability is essential for transforming and integrating sequential observations
                                    from varying perspectives to build spatial representations that support planning tasks. To address this, we propose
                                    Spatially-Enhanced Recurrent Units (SRUs)—a simple yet effective modification to existing RNNs—that enhance
                                    spatial memorization. To improve navigation performance, we introduce an attention-based network architecture
                                    integrated with SRUs, enabling long-range mapless navigation using a single forward-facing stereo camera. Additionally,
                                    we employ regularization techniques to facilitate robust end-to-end recurrent training via RL. Experimental results
                                    demonstrate that our approach improves long-range navigation performance by 23.5% overall compared to existing
                                    RNNs. Furthermore, when equipped with SRU memory, our method outperforms both RL baseline approaches—one
                                    relying on explicit mapping and the other on stacked historical observations—achieving overall improvements of 29.6%
                                    and 105.0%, respectively, in diverse environments that require long-horizon mapping and memorization capabilities.
                                    Finally, we address the sim-to-real gap by leveraging large-scale pretraining on synthetic depth data, enabling zero-
                                    shot transfer for deployment across diverse and complex real-world environments.


                                    Keywords
                                    Spatial Memory, End-to-End Mapless Navigation, Recurrent Neural Networks, Reinforcement Learning



                                    1   Introduction                                                 approaches are not easily deployable on smaller robotic
                                                                                                     platforms and often struggle to generalize to environments
                                    End-to-end learning for robot navigation has recently gained     beyond structured road networks. In contrast, embedded
                                    significant attention with its potential to address two major    robots often rely on end-to-end learning approaches, either
                                    challenges inherent in classical modular approaches: (a)         by imitating behaviors from datasets (Shah et al. 2023;
                                    system delays and (b) the difficulty of modeling complex         Karnan et al. 2022; Cèsar-Tondreau et al. 2021; Loquercio
                                    kinodynamic environmental interactions. These challenges         et al. 2021) or by optimizing policies through reinforcement
                                    have traditionally hindered the development of high-             learning (RL) (Wijmans et al. 2019; Zhu et al. 2017; Sur-
                                    speed platforms with intricate dynamics, such as legged-         mann et al. 2020). These methods typically employ specific
                                    wheeled robots. However, end-to-end learning approaches          network architectures, such as recurrent neural networks
                                    face their own challenges, particularly in achieving efficient   (RNNs), to implicitly learn spatial-temporal mappings (Wij-
                                    spatial mapping. Unlike classical mapping pipelines, which       mans et al. 2019, 2023). While these approaches have
                                    explicitly transform historical ego-centric observations into    demonstrated success in structured indoor environments with
                                    a coherent map frame for downstream planning, end-to-end         discretized action and observation spaces, their performance
                                    learning relies on neural networks to implicitly learn this      often diminishes in more complex, real-world scenarios that
                                    process. This requires the network to iteratively build and      involve continuous action spaces and dynamic motions.
                                    update an environmental representation of the surroundings          Recently, for real-world deployments, researchers have
                                    and understand the spatial-temporal relationships between        started integrating explicit mapping pipelines (Miki et al.
                                    observations.                                                    2022b) to fuse ego-centric observations and provide
                                       In autonomous driving, large-scale mapping mod-               environmental information to learning modules for tasks
                                    ules (Mohajerin and Rohani 2019; Mescheder et al. 2019;
                                    Wei et al. 2023; Wang et al. 2025) are trained on thou-
                                                                                                     1 Robotic Systems Lab, ETH Zurich, Zurich, Switzerland
                                    sands of hours of data, enabling robust spatial mapping
                                                                                                     2 Spatial AI & Robotics Lab, University at Buffalo, Buffalo, NY, USA
                                    with specifically designed architectures, such as occu-
                                    pancy networks (Mescheder et al. 2019) or occupancy              Corresponding author:
                                    grid maps (Mohajerin and Rohani 2019). However, such             Fan Yang, fanyang1@ethz.ch

2 ()

such as perceptive locomotion (Miki et al. 2022a) and regularizations during end-to-end RL training is crucial. navigation (Lee et al. 2024; Francis et al. 2020; Weerakoon Furthermore, to address the sim-to-real gap caused by noisy et al. 2022). This raises an important question: can end-to- depth images, we pretrain the image encoder on a large- end learning networks with implicit memory mechanisms, scale synthetic dataset and augment the data using a fully such as RNNs, match or surpass the performance of parallelized depth-noise model, adapted from Handa et al. approaches that rely on explicit mapping pipelines? (2014a); Barron and Malik (2013a); Bohg et al. (2014a). In Specifically, do RNNs have inherent limitations in learning summary, our main contributions are as follows: spatial-temporal mappings? • Addressing Spatial Mapping Limitations with While RNNs excel at capturing temporal dependencies, SRUs: We identify that standard RNNs, while effec- showcased by their success in various sequential tasks, tive in capturing temporal dependencies, can strug- such as natural language processing (Sutskever et al. gle with spatial registration of observations from 2014) and time-series prediction (Siami-Namini et al. different perspectives. To overcome this, we intro- 2019), their ability to learn spatial transformations and duce Spatially-Enhanced Recurrent Units (SRUs) that memorization remains a topic of research. RNNs are enhance the ability to learn implicit spatial transforma- designed to process sequences of data by maintaining an tions from sequences of ego-centric observations. internal state that captures temporal dependencies. However, • End-to-End Reinforcement Learning with SRUs it is not yet clear to what extent they can effectively and Attention-based Policy: We integrate the SRU learn spatial transformations and integrate observations from unit into a proposed attention-based network archi- different perspectives. Classical approaches achieve spatial tecture, enabling improved end-to-end reinforcement registration through homogeneous transformations in three- learning for long-range mapless navigation tasks using dimensional space, aligning observations into a consistent only ego-centric observations. local or global frame. For RNNs to achieve effective spatial • Large-Scale Pretraining for Zero-Shot Sim-to-Real registration, they must not only memorize sequences but also Transfer in Long-Range Mapless Navigation: By learn to transform and integrate observations across time and leveraging large-scale synthetic pretraining and a space. parallelizable depth-noise model, our system bridges In this work, we examine the spatial-temporal mem- the sim-to-real gap, enabling zero-shot deployment ory capabilities of several recurrent architectures, including on a legged-wheel platform in diverse real-world Long Short-Term Memory (LSTM) (Hochreiter and Schmid- environments, using a single forward-facing stereo huber 1997), Gated Recurrent Unit (GRU) (Cho et al. 2014), camera for long-range mapless navigation. and recent State-Space Models (SSMs) such as S4 (Gu et al. 2021) and Mamba-SSM (Gu and Dao 2023). We evaluate these models on two criteria: (i) their ability to 2 Related Works memorize temporal sequences and (ii) their capacity to The navigation and planning problem has been studied register and transform sequential observations across varying extensively for decades. Early approaches relied on classic spatial perspectives. Our findings indicate that, while these search-based methods, including Dijkstra’s algorithm and models perform well in capturing temporal dependencies, A* (Dijkstra 1959; Hart et al. 1968) operating on pre- they exhibit limitations in spatial registration, particularly discretized grids, as well as on sample-based techniques under conditions of dynamic ego-motion and rapidly chang- such as the Rapidly-exploring Random Tree (RRT) fam- ing perspectives. ily—including variants like RRT*, RRT-Connect (LaValle To address this limitation, we introduce Spatially- et al. 2001; Karaman and Frazzoli 2011; Kuffner and LaValle Enhanced Recurrent Units (SRUs), a simple yet effective 2000), etc.—and probabilistic roadmap (PRM) methods, modification to standard LSTM and GRU units that enhances such as Lazy PRM and SPARS (Kavraki et al. 1996; Bohlin their spatial registration capabilities when processing and Kavraki 2000; Dobson and Bekris 2014). While these sequences of ego-centric observations. Unlike classical techniques have achieved significant success in robotics and mapping pipelines that rely on explicit homogeneous real-world applications (Wellhausen and Hutter 2023), they transformations, our approach enables the recurrent units to depend on building or the existence of a predefined naviga- implicitly learn the transformations from varying observation tion or occupancy map. Consequently, they often struggle perspectives effectively. To further enhance the performance in unknown or dynamic environments (Yang et al. 2022a), of long-range navigation tasks, we propose an attention- particularly when planning under static-world assumptions based network architecture integrated with SRUs, allowing or when complex kinodynamic constraints are present (Webb the model to learn long-range mapless navigation policies and Berg 2012; Ortiz-Haro et al. 2024). Moreover, these using only ego-centric observations via end-to-end RL. Our classical methods typically require an additional perception experiments demonstrate improvements in spatial awareness and mapping module, and the predetermined traversability or compared to the baselines. With the SRU memory, the occupancy maps are usually based on heuristic designs rather implicit recurrent approach via RL with sparse rewards than being optimized for a specific robotic platform. promotes robust exploration in complex 3D and maze- To address these limitations, recent research has increas- like environments, outperforming the baseline that rely on ingly turned to learning-based approaches, especially for explicit mapping and memory modules. more complex robotic agents (e.g., quadrupeds or wheel- To prevent premature convergence to suboptimal strategies legged systems). For instance, recent works in imitation and fully exploit the capabilities of the proposed attention- learning leverage large-scale video data or demonstra- based recurrent structure, we find that incorporating tions (Pfeiffer et al. 2017; Bojarski et al. 2016; Loquercio

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 3

et al. 2021; Shah et al. 2022, 2023) to directly map raw egocentric sensory inputs to navigation actions. Given the challenges of capturing dynamic, closed-loop interactions from purely offline data, researchers have also explored model-free reinforcement learning (RL) methods (Shi et al. 2019; Wijmans et al. 2019; Choi et al. 2019; Hoeller et al. 2021; Truong et al. 2021; Wu et al. 2021; Ruiz-Serra et al. 2022; Fu et al. 2022; Huang et al. 2023; Bhattacharya et al. 2024; Lee et al. 2024) that train navigation policies end-to- end by simulating the entire robot dynamics. By replacing the traditional perception, mapping, and planning pipeline with a tailored network—such as architectures based on recurrent networks (Wijmans et al. 2019; Choi et al. 2019; Hoeller et al. 2021; Wu et al. 2021) or Transformers with attention mechanisms (Ruiz-Serra et al. 2022; Huang et al. 2023; Bhattacharya et al. 2024; Zeng et al. 2024)—these approaches have achieved improvements in navigation tasks as well as in robotic locomotion (Miki et al. 2022a; Yang et al. 2022b; Kareer et al. 2023). A key challenge with end-to-end approaches is learning a robust state representation from the partial observations provided by egocentric sensors. Recent studies have attempted to mitigate this challenge by incorporating explicit mapping and memory mechanisms (Savinov et al. 2018; Cimurs et al. 2021; Fu et al. 2022; Lee et al. 2024) or by employing specialized network architectures like RNNs (Hoeller et al. 2021; Wijmans et al. 2019; Choi et al. 2019). However, RNNs—originally designed to capture temporal sequences in language tasks (Cho et al. 2014)—are not inherently well-suited for spatial mapping, particularly when processing sequential egocentric observations from continuously changing perspectives. For instance, while prior studies in indoor navigation have shown that spatial cues can be decoded from RNN memories, this effect is limited to employing binary contact sensing and does not extend to high-dimensional visual inputs (Wijmans et al. 2023). Moreover, recent findings suggest that variations in recurrent network architectures have minimal impact on the final task-level rewards achieved through reinforcement learning (Duarte et al. 2023). This indicates that, architectural differences, some fundamental limitations may persist across these RNN units. In this paper, we explore a key limitation of existing RNN- based architectures in addressing partial observability—their spatial memorization capabilities—highlighting their short- comings in learning spatial transformations and integrating observations from different perspectives. We then introduce Spatially-Enhanced Recurrent Units (SRUs) and demon- strate their effectiveness in improving long-range mapless navigation tasks with a specifically designed attention-based network structure via end-to-end reinforcement learning. 3 Problem Statement Consider a robot operating in a three-dimensional (3D) environment E ⊂ R3 . At each time step t, the robot is located at a configuration defined by its position and orientation in SE(3), and receives an observation its egocentric sensors. The navigation objective is defined as starting from an initial relative goal position p1 ∈ R3 in the robot’s egocentric frame and reaching a designated goal region G ⊂ R3 , such that the relative goal position satisfies ∥pt ∥ < ϵ, where ϵ > 0 represents a specified tolerance, within a finite time horizon t ≤ Tmax . The robot follows a policy π that maps its current state st to an action at . However, due to the egocentric setup, the agent’s current state st is not fully observable from a single sensor snapshot. Formally, we model the navigation task as a Partially Observable Markov Decision Process (POMDP), characterized by the tuple: (S, A, T , R, O, Z, γ), with the following components: • S: the set of all possible states of the environment. • A: the set of actions available to the robot. • T : S × A × S → [0, 1]: the state transition function, where T (s, a, s′ ) denotes the probability of transition- ing from state s to state s′ when action a is taken. • R : S × A → R: the reward function, assigning a scalar reward to each state-action pair. • O: the set of all possible observations. • Z : S × A × O → [0, 1]: the observation function, specifying the probability of receiving an observation given the current state and action. • γ ∈ [0, 1]: the discount factor that balances immediate reward and future payoffs. The action at ∈ A is executed in the robot’s local frame at time t. At each time step t, the robot receives an observation ot ∈ O and combines it with its historical observations Ht−1 to determine the current state, where Ht−1 is defined as: Ht−1 := {o1 , o2 , . . . , ot−1 }. We define a function f that fuses current and historical observations into an estimate ŝt of the unobservable state st ∈ S: ŝt = f (ot , Ht−1 ). The policy π then maps the estimated state ŝt to an action at : despite at = π(ŝt ). Due to the robot’s ego-motion, the current observation ot can be captured from a different perspective or observation frame compared to historical observations in Ht−1 . Therefore, to fuse observations from different viewpoints, the function f typically involves a spatial transformation that aligns the observations into a coherent reference frame to estimate the current state st for the policy π. In classical mapping pipelines, spatial transformations are typically achieved through homogeneous transformations that combine rotations and translations. However, in end- to-end learning approaches, the function f , which can be parameterized by neural network weights, is learned implicitly. 4 Methodology ot through 4.1 Overview To tackle the long-range navigation task, we first examine and demonstrate the limitations of existing recurrent

Prepared using sagej.cls 4 ()

architectures (e.g., LSTM, GRU, S4, and Mamba-SSM) and h̃t are the update, reset, and candidate hidden states in in a spatial-temporal memory task. We then introduce GRU, respectively. The weights W and biases b are learnable the Spatially-Enhanced Recurrent Units (SRUs). Next, parameters, and σ denotes the sigmoid activation function. we integrate SRUs into an attention-based network Note that for all the letters used in the RNN formulations, architecture to learn long-range mapless navigation via end- we use a “monospaced” font style to prevent confusion with to-end reinforcement learning. Furthermore, we discuss the the symbols used in the remainder of the paper. importance of incorporating regularization techniques to More recently, the State-Space Model (Gu et al. 2021) was prevent early overfitting, which we find to be crucial to introduced, which is inspired by the general form of state enhance SRUs’ spatial memorization. Finally, we address space models widely used in control theory. Such models the sim-to-real gap by pretraining the depth image encoder take the following form in discrete time: on large-scale synthetic depth data and incorporating a parallelizable depth-noise model, enabling zero-shot transfer xt = Ā xt−1 + B̄ ut , to real-world environments. yt = C̄ xt + D̄ ut , where xt represents the state at time t, ut the input, and yt 4.2 Background: Recurrent Neural Networks the output. The matrices Ā, B̄, C̄, and D̄ define the system Recurrent Neural Networks (RNNs) are a class of neural dynamics. networks designed to process sequential data by maintaining In Gu et al. (2020a), a so-called HiPPO matrix is proposed a hidden state that captures temporal dependencies. Given a for Ā and B̄ to optimally project historical information into sequence of inputs x := (x1 , x2 , . . . , xT ), an RNN computes the current state via a polynomial basis. Although this a sequence of hidden states h := (h1 , h2 , . . . , hT ) using the approach has led to a new family of RNN models (e.g., S4, following recursive formula: S5, Mamba-SSM) that excel at capturing long-term temporal dependencies, their emphasis on long temporal processing is ht := f (xt , ht−1 ), not the focus of this paper; therefore, we omit further details on these models. where f is a function that combines the previous hidden state ht−1 with the current input xt to compute the current hidden state ht . This is analogous to the function mentioned 4.3 Spatial Mapping Limitations in RNNs earlier, which fuses the current observation ot with the Achieving long-range mapless navigation from ego-centric historical observations Ht−1 into the estimated current state observations requires the robot to perform effective spatial ŝt . The hidden state ht captures the network’s internal mapping. In three-dimensional (3D) space, spatial mapping representation at time t, encoding information from the is commonly achieved using homogeneous transformations, entire input sequence up to that point. However, the vanilla which combine rotations and translations. A general version of RNN can suffer from gradient vanishing and representation of such a transformation is expressed as: explosion issues (Hochreiter and Schmidhuber 1997; Cho ′ p R t p et al. 2014). To address these problems, several variants have = ⊤ , 1 0 1 1 been proposed, including LSTM and GRU, as commonly used in sequential tasks nowadays. These models introduce where p and p′ are the coordinates of a point in the original gating mechanisms that control the flow of information and transformed frames, R is the rotation matrix, and t is through the network, enabling better long-term memory the translation vector. In the context of RNNs, this spatial retention and gradient flow. Those types of RNNs adopted mapping and transformation is implicitly learned through the gates and residual connections across temporal sequences, function f , which integrates the current observation ot with resulting in strong performance in various sequential tasks. the historical observations Ht−1 to estimate the current state The standard LSTM unit takes the following form: st . In this section, we assess the spatial and temporal map- it = σ(Wxi xt + Whi ht−1 + bi ), ping performance of existing recurrent structures—namely ft = σ(Wxf xt + Whf ht−1 + bf ), LSTM, GRU, and the recent S4 and Mamba-SSM—on two ot = σ(Wxo xt + Who ht−1 + bo ), fronts: (i) temporal memorization and (ii) spatial transfor- gt = tanh(Wxg xt + Whg ht−1 + bg ), mation and memorization. Consider an abstract scenario relevant to the navigation task, in which a robot is initialized ct = ft ⊙ ct−1 + it ⊙ gt , at a pose in SE(3) and moves randomly within the three- ht = ot ⊙ tanh(ct ), dimensional environment E ⊂ R3 . At each time step t, the robot receives an observation ot containing the coordinates and the GRU unit: of observed landmarks lti ∈ R3 , defined relative to the robot’s current frame (indicated by the subscript t). Each landmark zt = σ Wxz xt + Whz ht−1 + bz , is also associated with a binary categorical label ci , which rt = σ Wxr xt + Whr ht−1 + br , is temporally relevant and independent of the observation h̃t = tanh Wxh xt + Whh (rt ⊙ ht−1 ) + bh , frame. Additionally, the robot is provided with its ego- ht = (1 − zt ) ⊙ h̃t + zt ⊙ ht−1 , motion transformation matrix Mtt−1 , representing the trans- formation from the previous pose at time step t − 1 to the where it , ft , ot , and gt are the input, forget, output, and pose at time step t. This transformation enables the network cell gate activations in LSTM, respectively. Similarly, zt , rt , to align and integrate observations into a unified reference

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 5

                       (a) Temporal Training Loss

                                                                                (a) Spatial Mapping with LSTM

                        (b) Spatial Training Loss

Figure 1. Training for the Spatial-temporal Memorization: (a) Temporal memorization loss shows that standard RNN units (LSTM, GRU, S4, and Mamba-SSM) effectively recall sequential information. (b) Spatial memorization loss indicates that these units struggle with accurate spatial transformations and memorization under changing observation perspectives, resulting in misaligned landmark coordinates. (b) Spatial Mapping with SRU-LSTM

frame, akin to classical homogeneous transformations. Over a sequence of T time steps, the RNN processes these observations. At the final step T , the network is evaluated simultaneously on its ability to:

 • Memorize and accurately predict the sequence of
   binary categorical labels associated with the observed
   landmarks, ensuring temporal association and order                              (c) Spatial Memory Error
   preservation (Temporal Task).
 • Transform and register the spatial coordinates of all      Figure 2. Spatial Mapping Comparison: (a) and (b) depict the
   observed landmarks into the final robot frame at t =       spatial mapping performance of LSTM and SRU-LSTM units on
   T , achieving spatial alignment and memorization of        synthetic data, respectively, as the robot follows a spiral path,
                                                              observing landmarks from different perspectives. At the end of
   their positions (Spatial Task).
                                                              the path, the robot is tasked with memorizing and transforming
                                                              the observed landmark coordinates into the final robot frame.
                                                              Numbers indicate observation time steps. (c) illustrates the The training details are provided in Appendix A. The              mean spatial memory errors (log scale) across observation step results indicate that while LSTM, GRU, S4, and Mamba-             indices, ordered from the final (15) to the initial step (1), SSM effectively encode temporal sequences and retain              averaged over various randomly generated trajectories and landmark categories, as shown in Figure 1(a), they                observations. face significant challenges in accurately memorizing and transforming landmark coordinates from sequential ego-centric observations. This limitation is reflected in the higher mean squared error (MSE) when recalling                4.4    Spatially-Enhanced Recurrent Unit (SRU) observed landmark positions during training, as depicted in       To address the limitations of spatial mapping of existing Figure 1(b).                                                      RNN units, we propose a modification to the standard LSTM

Prepared using sagej.cls 6 ()

and GRU architectures by introducing an additional spatial landmark coordinates into the final frame, as well as recalling transformation operation. This enhancement results in a the associated categories of the observed landmarks. Our new class of units, termed Spatially-Enhanced Recurrent experiments demonstrate that the SRU modification enables Units (SRUs). The added operation enables the network the network to effectively learn spatial transformations. In to implicitly learn spatial transformations, aligning and contrast, the baseline models struggle to align and memorize memorizing observations from varying perspectives while landmark coordinates observed in earlier steps, resulting in preserving robust temporal memorization capabilities. higher spatial errors, particularly for earlier observations, as The effectiveness of this approach is demonstrated depicted in Figure 2(c). However, both SRU and baseline by the training results of the spatial mapping task models achieve 100% accuracy in recalling the categories mentioned above and illustrated in Figure 1. With the of the observed landmarks. Since the temporal task results SRU modification, the network effectively transforms and are identical across all models, they are not visualized memorizes observed landmark coordinates from different Specifically in Figure 2. Furthermore, the latest recurrent perspectives, as indicated by the spatial loss curve in units, such as S4 and Mamba-SSM, which excel at long- Figure 1(b), while preserving similar temporal memorization term temporal memorization, exhibit even worse spatial performance compared to standard LSTM and GRU units, as memorization capabilities, as shown in both Figure 1(b) and shown in Figure 1(a). The design of SRUs emerged through Figure 2(c). The SRUs exhibit consistently low spatial errors iterative experimentation and analysis of spatial mapping across all observation steps, underscoring their superior performance. The final formulation draws inspiration from spatial memorization capabilities. the multiplicative form of homogeneous transformations and recent research on the use of the ”star operation” (element- 4.5 Attention-Based Recurrent Network wise multiplication) to enhance the representational capacity Architecture for Navigation of neural networks (Ma et al. 2024). The following equations detail the modifications to To leverage the SRUs in the navigation context, we propose incorporate spatial transformations into both LSTM and an attention-based recurrent network architecture for long- GRU units, ensuring a balance between spatial memorization range mapless navigation tasks using raw front-facing stereo and temporal dependency learning. In each case, we compute depth input, as illustrated in Figure 3. The network consists an additional spatial transformation term, denoted as st , of a pretrained depth image encoder, two spatial attention which acts as a mechanism to implicitly transform and align layers incorporating both self-attention and cross-attention the candidate state with the current observation’s perspective. mechanisms to enhance and compress the encoded visual For the modified LSTM, referred to as SRU-LSTM, we features, and a recurrent unit (SRU) that learns a spatial- define: temporal representation of the state by fusing the current observation with historical observations. Finally, a multilayer st = Wxs xt + bs , perceptron (MLP) head computes the actions from the gt = tanh st ⊙ Wxg xt + Whg ht−1 + bg . recurrent hidden state, outputting velocity commands for the robot’s locomotion controller. Similarly, for the modified GRU, referred to as SRU-GRU, Depth Encoder Pretraining and Simulated Perception the formulation is enhanced as follows: Noise For the depth encoder, we adopt a convolutional neu- st = Wxs xt + bs , ral network (CNN) backbone based on RegNet (Radosavovic et al. 2020), chosen for its simplicity and efficiency. h̃t = tanh st ⊙ Wxh xt + Whh (rt ⊙ ht−1 ) + bh . This is further enhanced with a Feature Pyramid Net- work (FPN) (Lin et al. 2017) to capture spatial features To further enhance spatial-temporal memorization, we across multiple scales. The encoder is pretrained for self- extend the SRU-LSTM with a refined gating mechanism (Gu reconstruction on large-scale synthetic depth image data et al. 2020b), simply referred to as SRU-Ours in the from TartanAir (Wang et al. 2020) using a variational following sections. This mechanism introduces an additional autoencoder (VAE) framework. This pretraining enables the refining function to address gating saturation issues during encoder to learn and extract robust and generalizable fea- recurrent training. The final modification, compared to the tures from depth images, facilitating effective downstream vanilla LSTM unit, is as follows: navigation learning and deployment. However, depth images st = Wxs xt + bs , captured in simulation often differ from those obtained in real-world environments due to various sensor artifacts gt = tanh st ⊙ Wxg xt + Whg ht−1 + bg , and noise. To address this sim-to-real gap, we integrate a parallelized depth-noise model, adapted from (Handa et al. rt = it ⊙ 1 − (1 − ft )2 + (1 − it ) ⊙ f2t , 2014b; Barron and Malik 2013b; Bohg et al. 2014b), which ct = rt ⊙ ct−1 + (1 − rt ) ⊙ gt . introduces configurable noise to the depth images, such as:

The effectiveness of SRU is further validated and compared • Edge Noise: Distortions at sharp depth discontinuities to the standard LSTM unit in the designed spatial mapping due to abrupt changes in the scene. task, as illustrated in Figure 2. In this task, the robot follows • Filling Noise: Blurring artifacts introduced during the a spiral path, observing landmarks from varying perspectives interpolation of missing or unregistered pixels. along its trajectory. At the end of the path, the robot • Rounding Noise: Quantization effects resulting from is tasked with memorizing and transforming the observed sensor resolution limitations, causing rounding errors.

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 7

Figure 3. Attention-based recurrent network architecture for navigation: The network integrates a pretrained image encoder and an attention mechanism to compress and emphasize relevant features from encoded observations. These features, combined with proprioceptive inputs, are processed by the SRU unit, which learns spatial transformations and temporal dependencies and fuses them with historical observations to estimate the robot’s state. The state is then mapped to actions using an MLP-based head integrated with the temporally consistent (TC) dropout layer for improved robustness and generalization.

                                                                  features while suppressing less important ones. Next, Ft′
                                                                  is processed by a cross-attention layer, where the query
                                                                  is derived from the robot’s egocentric proprioceptive state
                                                                  oprop
                                                                    t —which includes linear velocity vt , angular velocity
                                                                  ωt , projected gravity nt , and the previous action at−1 —as
                                                                  well as the current relative goal position pt . This procedure
                                                                  is illustrated in Figure 3. This operation compresses the
 (a) Original Synthetic Image   (b) Image with Artificial Noise
                                                                  2-dimensional feature map into a 1-dimensional latent
                                                                  representation Figure 4. Simulated stereo depth noise: (a) Synthetic depth                                    F̂t ∈ RC×1 , image from the TartanAir dataset (Wang et al. 2020), (b) Image with augmented artificial noise. The depth-noise model
                                                                  which preserves the most relevant spatial features while introduces edge, filling, and rounding noise to the depth images,     reducing the dimensionality. This compressed feature F̂t is simulating realistic sensor artifacts.                                then concatenated with the proprioceptive state oprop    t   and
                                                                  the relative goal position pt before being passed to the
                                                                  recurrent memory unit. There, it is fused with historical
                                                                  observations to form a spatial-temporal representation of Figure 4 illustrates an example of the simulated stereo depth         the robot’s current state, integrating both exteroceptive and noise. The depth-noise model is designed for efficient batch          proprioceptive information. processing, enabling parallelized pretraining on large-scale             Figure 5 visualizes the output attention weights of the datasets and during RL with simulated depth images. For               cross-attention layer across four distinct attention heads implementation details, refer to Appendix B.                          (depicted in different colors), overlaid on the depth input. Attention Layers for Feature Compression During nav-                  It highlights how the attention mechanism focuses on depth igation, humans and animals tend to focus on the most                 features relevant to the robot’s current state. When the robot’s relevant spatial cues rather than attempting to memorize              movement direction is manually altered, the output attention all available information (Matthis et al. 2018). This selec-          weights shift accordingly, emphasizing spatial regions and tive attention enables more efficient and effective memory            obstacles in the new direction. These behaviors emerge usage. To emulate this, we combine self-attention and cross-          naturally during end-to-end learning, demonstrating the attention mechanisms in our architecture. These spatial atten-        policy’s ability to effectively acquire critical spatial cues for tion layers process high-dimensional visual inputs, extracting        navigation. the information most relevant to the robot’s current state.
                                                                  Spatially-Enhanced Recurrent Unit (SRU) The Spatially- Specifically, given the feature map encoded by the pretrained
                                                                  Enhanced Recurrent Unit (SRU), as described in Sec. 4.4, depth encoder:
                                                                  processes the compressed feature map F̂t , along with
                   Ft ∈ RC×H×W ,
                                                                  the robot’s current proprioceptive state opropt     and the each spatial feature Ftij (where i ∈ {1, . . . , H} and j ∈           previous hidden state ht−1 , to generate a spatial-temporal {1, . . . , W }) is enriched with global context via a self-          representation of the surrounding environment from the attention mechanism, resulting in a refined feature map               robot’s egocentric observations. The observation of the Ft′ . Here, H and W denote the height and width of the                current proprioceptive state oprop
                                                                                                 t    provides essential ego- feature map, respectively, while C represents the number              motion information, equivalent to the transformation matrix of channels. The self-attention mechanism computes the                Mtt−1 in Sec. 4.3. The SRU learns to implicitly perform attention weights across the spatial dimensions of the feature        spatial transformations, aligning the current observation map, enabling the network to fuse and emphasize relevant              feature map F̂t with the previous hidden state ht−1 . The

Prepared using sagej.cls 8 ()

        (a) Robot State — Turning Left           (b) Robot State — Going Straight             (c) Robot State — Turning Right

Figure 5. Visualization of cross-attention weights corresponding to different robot states over raw real-world depth input: (a) When the robot turns left, the attention weights highlight the left region, focusing on the left pillar in the depth image; (b) When the robot moves straight, the attention weights emphasize the central region, capturing both pillars; (c) When the robot turns right, the attention weights focus on the right region, concentrating on the right pillar. Distinct colors in the attention weights visualizations represent different attention heads.

resulting hidden state ht , which encapsulates the integrated adopt two reward configurations: a tight reward with a small environmental information to estimate the robot’s current σ to encourage precise goal-reaching behavior, and a loose state st , is subsequently passed through a multi-layer reward with a larger σ to promote exploration and stabilize perceptron (MLP) head to compute the robot’s action at . training through intermediate guidance. The regularization term rtreg encourages smooth behaviors 4.6 Learning Navigation with Sparse Rewards by penalizing rapid action changes and excessive joint and Regularizations accelerations. This is implemented using L1 regularization on the difference between the current action at and a The final attention-based network with the SRU is trained momentum-filtered version of previous actions am t , as end-to-end using RL to achieve long-range mapless defined below: navigation, with the objective of maximizing the cumulative reward over the episode. The reward function for the am m t = λ · at−1 + (1 − λ) · at , navigation task is designed as a combination of task-level rewards rtask , regularization rreg , and penalty rpen terms, as where λ is the momentum factor. The regularization reward follows: is then given by: rt = α1 rttask − α2 rtreg − α3 rtpen . rtreg = β1 · ∥at − am acc t ∥1 + β2 · ∥jt ∥1 , Here, α1 , α2 , and α3 are coefficients used to balance the contributions of the task-level reward, regularization, and where β1 and β2 are regularization coefficients, and jtacc penalty terms, respectively. The task-level reward rttask is the represents joint-level accelerations from the simulation reward signal that encourages the agent to reach the goal. environment. The penalty term rtpen discourages unsafe To promote exploration in complex environments, we adopt behaviors such as collisions or excessive tilt: time-based rewards, similar to Rudin et al. (2022); Zhang et al. (2024); He et al. (2024), that provide feedback at the rtpen = η1 · 1(collision) + η2 · max(0, |θt | − θsafe ) end of the episode. This approach provides a sparse reward signal, encouraging the agent to reach the goal without where η1 and η2 are penalty coefficients, θt is the robot’s being distracted by intermediate rewards. However, with a current tilt angle, and θsafe defines the safe tilt threshold. long episode length Tmax , e.g., 60 seconds, and a rewarding The reward formulation described above is consistently period Tr of only 2 seconds, the network may learn to delay applied throughout the entire RL training process, which progress until the final step. To mitigate this, we introduce is conducted end-to-end using an Asymmetric Actor- a random check during the episode with a small probability Critic (Pinto et al. 2017) setup and trained with PPO, without δcheck . This check incentivizes the agent to attempt reaching employing any additional environment or reward curricula. the goal earlier, without compromising the overall sparsity Further training details are provided in Appendix C. of the reward signal. The final reward formulation is adapted Training Regularization To mitigate overfitting and enhance from He et al. (2024) and is given as follows: robustness, we incorporate two additional regularization strategies during training. These strategies are crucial for 1(t > Tmax − Tr ∨ random < δcheck ) training a robust spatial-temporal representation with SRUs, rttask = 1 + ∥ pσt ∥2 as explained below and demonstrated in the experimental results. The regularization techniques are as follows: where 1(·) is the binary indicator function, Tr is the rewarding period, ∨ represents the logical OR operation, pt • Deep Mutual Learning (DML): As described in Xie is the relative goal position at time t with respect to the et al. (2025), DML involves training two policies current robot pose, and σ is a scaling factor controlling the simultaneously, enabling them to mutually distill reward’s spatial sensitivity. Similar to He et al. (2024), we knowledge from each other. This approach enhances

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 9

   generalization and mitigates the risk of convergence       success rates (SR). Additionally, we compare the SRU
   to suboptimal solutions. The mutual distillation is        policy, integrated with our proposed network structure and
   achieved by incorporating a Kullback–Leibler (KL)          trained using recurrent RL, against two current state-of-the-
   divergence loss between the two policies, both of          art (SOTA) baselines (Huang et al. 2023; Lee et al. 2024)
   which are trained using standard proximal policy           for robot navigation with RL. Our evaluation highlights the
   optimization (PPO) (Schulman et al. 2017).                 advantages of the implicit recurrent memory provided by
 • Temporally Consistent Dropout (TC-Dropout): Adapt-         SRU in solving long-range mapless navigation tasks across
   ing from Hausknecht and Wagener (2022), we apply a         diverse environments.
   consistent dropout mask across time steps during both         Furthermore, we ablate the role of our proposed
   rollout and training, ensuring stable memory learning      spatial attention layers in compressing features from
   within the recurrent structure.                            encoded observations to improve memorization and overall
                                                              navigation performance. We compare our approach against    As shown in Figure 1, SRUs exhibit a slower convergence
                                                              the convolution and average pooling method used in rate for spatial memorization compared to learning temporal
                                                              Wijmans et al. (2019), as well as the attention mechanism dependencies, highlighting the inherent complexity and
                                                              introduced in Huang et al. (2023). We also investigate the slower pace of learning spatial transformations and forming
                                                              impact of regularization techniques on training the SRU unit spatial memory. This discrepancy can lead the network to
                                                              end-to-end in RL, evaluating their effectiveness in preventing favor easier-to-learn solutions early in training, relying on
                                                              early convergence to suboptimal solutions and enhancing temporal features while neglecting the formation of good
                                                              navigation performance. Finally, we explain and validate the spatial memorization, resulting in suboptimal performance.
                                                              pretrained image encoder’s ability to bridge the sim-to-real To address this, it is crucial to incorporate regularization
                                                              gap by demonstrating zero-shot transfer across diverse and techniques that mitigate early overfitting and promote the
                                                              complex real-world environments. exploration and development of more challenging spatial- temporal features during policy optimization. To tackle this challenge, we employ deep mutual learning (DML)              5.1    Experimental Setup strategies tailored for reinforcement learning (RL) (Xie et al.   We conduct our experiments in simulated 3D environments 2025). DML involves training multiple policies in parallel,       using NVIDIA IsaacLab (Mittal et al. 2023), which allowing them to distill knowledge from one another.              provides a realistic physics engine and fast, parallelizable This mutual distillation process enhances the network’s           simulation capabilities. The environments are designed to generalization capabilities and fosters the learning of robust    challenge the robot’s navigation capabilities and include and essential features. By regularizing each other, the           maze-like structures, randomly generated pillars, stairs, and models are less likely to converge prematurely to suboptimal      environments with negative obstacles, such as holes and pits, solutions that rely solely on easy-to-learn features, such        as shown in Figure 6. The robot is equipped with front-facing as temporal dependencies. Instead, DML encourages the             depth sensors as the only exteroceptive input, capturing the formation of spatial-temporal representations, leading to         surrounding environment from an egocentric perspective. improved overall performance. This approach is critical           Additionally, a state estimation and localization module for leveraging the full potential of the SRU network, as          provides the robot’s proprioceptive state oprop t , including demonstrated in the experimental results.                         linear and angular velocities (vt and ωt ), projected gravity    Secondly, compared to standard dropout layers, consistent      (nt ), and the relative goal position (pt ) with respect to the dropout addresses a critical issue in on-policy reinforcement     robot’s frame at time step t. Given the limited field of view learning, where standard dropout introduces inconsistent          (FoV) of the depth camera (Horizontal FoV: 105◦ , Vertical masks between the rollout and training stages (Hausknecht         FoV: 78◦ ) and a maximum range of 10 meters, the robot must and Wagener 2022). Building on this, we extend consistent         rely on its spatial-temporal mapping capabilities to navigate dropout with temporal consistency for training the recurrent      through the terrain and reach the designated goal region structure. Specifically, during the data rollout stage, we        effectively. The robot’s motion is controlled by a set of linear maintain the same dropout mask across all time steps,             and angular velocities, referred to as the action at , which ensuring temporal consistency. During the training stage,         is the output of the policy network. The navigation policy the same dropout mask is applied to the policy network.           operates at a frequency of 5 Hz. The robot is equipped with This approach promotes stable memory learning through             a learning-based locomotion controller (Lee et al. 2024), recurrent connections and enhances the robustness of the          operating at 50hz. This controller takes the action output learned policy.                                                   at from the high-level navigation policy and executes it to
                                                              control the robot. The policies are trained end-to-end using 5     Experiments                                                 reinforcement learning, without employing any distillation or
                                                              teacher-student setups. To evaluate the effectiveness of the proposed Spatially- Enhanced Recurrent Unit (SRU) and the attention-based network architecture in enhancing long-horizon robot              5.2    Comparsion with Recurrent Units navigation, we conduct experiments in both simulated and          We evaluate the performance of the proposed Spatially- real-world environments. We compare the SRU against               Enhanced Recurrent Units (SRUs) compared to standard standard LSTM and GRU units in long-range mapless                 LSTM and GRU units. Given the superior spatial navigation tasks, focusing on their performance in end-to-        memorization capability of SRUs, as demonstrated in end reinforcement learning (RL) training and navigation           Figure 1, we hypothesize that integrating SRUs will improve

Prepared using sagej.cls 10 ()

                                                                                                                            A

                                                                                                                            B

                                                                                                                            C

                                                                                                                            D Figure 6. Simulated environments used for training and testing RL-based navigation tasks: (A) Maze, (B) Random Pillars, (C) Stairs, and (D) Pits. These environments are parameterizable and can be randomly generated during both training and testing using the NVIDIA IsaacLab (Mittal et al. 2023) simulation framework.

                                                                                                       START
                                                                   GOAL          B
                                                         START  GOAL
     C           B

                         START                                                                         A

                             A

                                                                                      A

                                                                                          START
                                        GOAL

                                                                                                           GOAL
                                 MAZE                    PILLAR                           STAIR                             PIT

                                               (a) Navigation Policy with LSTM Unit

                                                       START   D                                                                                                        START
                                                                          GOAL
     C               B  GOAL
                         START

                         A

                                        GOAL                                              START

                                                                                                           GOAL
                                 MAZE                     PILLAR                           STAIR                            PIT

                                               (b) Navigation Policy with SRU Unit

Figure 7. Comparison of navigation trajectories using (a) Navigation Policy with LSTM Unit and (b) Navigation Policy with SRU-Ours. The traversed trajectories are shown in yellow. In maze environments, the LSTM policy becomes trapped in a dead-end corridor, repeatedly looping between points B and C, while the SRU policy successfully navigates through the corridor, traverses region D, and reaches the goal. In stair-like environments, the LSTM policy exhibits frequent back-and-forth movements in areas A and B, indicating unreliable spatial-temporal mapping. In pit environments, the LSTM policy fails to avoid previously encountered pits during turns at area A, whereas the SRU policy effectively recalls their locations and avoids them, even during backward motion.

the performance of navigation policies in addressing long- are equipped with the same components (attention, training range mapless navigation tasks. To test this hypothesis, we regularization, and a pretrained encoder); only the recurrent train, under same conditions, policy networks integrated network structure differs. We then evaluate their navigation with different recurrent units end-to-end using RL in the performance. simulated 3D environments shown in Figure 6. All policies

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 11

                                                                 the performance in success rate compared to standard LSTM
                                                                 and GRU units.
                                                                    Figure 7 presents example traversed trajectories compari-
                                                                 son between the SRU and standard LSTM policies. In maze
                                                                 environments, the LSTM policy gets trapped in a dead-
                                                                 end corridor, looping between points B and C, while the
                                                                 SRU policy successfully passes through the corridor, demon-
                                                                 strating better spatial memorization capability. In stair-like
                                                                 environments, although the LSTM eventually reaches the
                                                                 destination, it exhibits frequent back-and-forth movements
                                                                 (areas A and B), indicating less reliable spatial-temporal
                                                                 memorization and estimation of the current state compared to
                                                                 the policy with SRU. In pit environments, the LSTM policy Figure 8. Training curve comparison between policies                 fails to avoid the previously encountered pits that are no integrated with different recurrent units: The average return from   longer visible in the current depth observation, when turning three random seeds during training. The architecture with SRU
                                                                 at area A. In contrast, the SRU policy effectively recalls the units achieves a higher return compared to the baseline LSTM and GRU units.
                                                                 locations of the pit and other previously observed obstacles,
                                                                 enabling it to avoid them during turns and backward motion.


              Navigation Success Rate - SR %
                                                                 5.3    Comparsion against RL-based Navigation
                                                                        Baselines   Model                Maze       Pillar   Stair   Pit    Overall
                                                                 Next, we compare our proposed network structure, trained   GRU                      68.1   73.6     35.7    66.7    61.0      using recurrent reinforcement learning with SRU, against   LSTM                     70.3   78.2     33.1    72.7    63.5      two state-of-the-art RL baseline methods: the Goal-guided
                                                                 Transformer-based RL approach (GTRL) (Huang et al.   SRU-GRU                  73.1   78.8     74.1    74.8    75.2
                                                                 2023) and the RL approach with Explicit Mapping and   SRU-LSTM                 75.9   76.7     79.3    74.1    76.5
                                                                 Historical Path (EMHP) (Lee et al. 2024). GTRL employs a   SRU-Ours                 76.0   81.0     82.8    75.6    78.9
                                                                 goal-guided Transformer (GoT) architecture to extract task- Table 1. Navigation success rate (SR) for policies integrated        relevant visual features from stacked historical observations, with different recurrent units across four environment types:        enabling mapless navigation using only egocentric input. Maze, Random Pillars, Stairs, and Pits. The table compares           EMHP employs an external mapping pipeline to integrate standard LSTM and GRU units with our proposed spatially              historical observations for local mapping (Miki et al. 2022b) enhanced counterparts: SRU-GRU, SRU-LSTM, and SRU-Ours.              and uses an explicit historical traversed path to address
                                                                 POMDP challenges in long-range mapless navigation. While
                                                                 explicit mapping (EMHP) can theoretically achieve high    As shown in Figure 8, the policy with the SRU memory              accuracy in spatial-temporal registration, it has two major unit is able to outperform those with standard LSTM and              drawbacks: (i) it introduces significant delays that hinder GRU units in terms of average return episode rewards                 real-time performance, especially on high-speed, agile during training, with results averaged across multiple random        platforms (Lee et al. 2024), and (ii) it relies on heuristic rules seeds. (Note: GRU training can exhibit instability, so only          (e.g., fixed context window lengths) to select information, its successful runs are included in the analysis.), Table 1          limiting its ability to capture complex spatial-temporal provides a summary of the navigation performance for                 dependencies and abstract information beyond the selected policies using different architectures. The best-performing          context window. model from each unit (determined by the highest average                 Figure 9(a) presents a comparison between the EMHP return rewards) is selected for comparison. The data is              baseline and our SRU-based approach. The EMHP policy averaged over 4800 episodes across 120 randomly generated            collects historical paths for approximately 20 meters, which environments, which are different from the training set.             is insufficient to navigate the long corridor spanning around The results are presented in terms of success rate (SR) for          30 meters. In contrast, recurrent neural networks offer an each environment. The SRU units consistently outperform              unlimited context window and can learn intricate spatial- the standard LSTM and GRU units, achieving an average                temporal dependencies optimized for the given task. This 21.8% improvement in SR across all environments with                 allows the SRU-based policy to adapt to long-horizon the SRU modification alone. Furthermore, incorporating the           navigation challenges more effectively. Moreover, our end- refined gating mechanism in the SRU-Ours model further               to-end architecture processes raw depth sensor inputs, boosts the results, achieving an overall 23.5% increase in           reducing latency and better supporting the agile and fast SR, demonstrating the effectiveness of these enhancements            motion of legged-wheel platforms during deployment. For in improving navigation performance. Notably, in stair-              the GTRL baseline, temporal history is provided by stacking like environments, where the 3D structure and significant            several past observation frames, which are then fused using occlusions pose challenges for navigation without precise            the transformer-based architecture as described in (Huang spatial memorization and registration capabilities, the              et al. 2023). Following the original approach in Huang et al. navigation policy with SRU units demonstrate over double             (2023), we use the 4-frame history in our experiments.

Prepared using sagej.cls 12 ()

However, the choice of the number of stacked frames Method SR (%) remains heuristic: a short history may miss important context, while a longer history increases computational cost GTRL (w. historical obs.) 38.2 quadratically. For a fair comparison, we retrain the GTRL EMHP (w. explicit mapping / path) 60.4 baseline, as described in Huang et al. (2023), within our GTRL* (w. SRU memory) 66.3 environment. We replace the RGB input with depth images Ours (w. SRU memory) 78.3 and utilize the same on-policy optimization method (PPO) for end-to-end RL training. To ensure consistency with the Table 2. Overall navigation success rate (SR) comparison platform utilized in Lee et al. (2024), this comparison is against baselines. Policies with SRU implicit recurrent memory conducted using the simulated wheeled ANYmal (Hutter outperform: (i) GTRL (stacked historical observations) and (ii) et al. 2016) robot model, which differs from the robot model EMHP (explicit mapping and historical path). GTRL* denotes our modified GTRL variant where stacked observations are used in the other comparisons in this paper. All policies replaced by SRU implicit memory, yielding a substantial gain are trained under identical conditions and evaluated on an over the original GTRL baseline. independent test environment set to ensure a fair comparison. Note that our policy and the GTRL baseline rely solely on a front-facing camera with a limited field of view, whereas the EMHP approach incorporates a local height scan with a Attention Configuration SR (%) similar range for environmental detection but benefits from a w/o. Attention 50.5 complete 360-degree field of view for mapping. GoT (GTRL) 68.4 The experimental results (see Table 2) indicate that our Spatial Attention (Ours) 78.9 architecture, leveraging an implicit memory representation Table 3. Navigation success rate (SR) for policies integrated within the recurrent module, outperforms both baselines. with different attention configurations. The policy with the Compared to EMHP, it achieves a 29.6% relative improve- proposed spatial attention (Ours) achieves the highest SR, ment in SR. Against GTRL, it demonstrates a remarkable outperforming (i) the baseline without attention Wijmans et al. 105.0% relative improvement. These results underscore the (2019) and (ii) the Goal-guided Transformer (GoT) Huang et al. limitations of explicit mapping or stacked-frame histories for (2023) integrated with SRU memory (GTRL approach in long-horizon navigation under limited context. Figure 9(a) Sec. 5.3), highlighting the importance of an attention illustrates a representative comparison: in a long-horizon mechanism for training mapless navigation end-to-end and the effectiveness of the proposed spatial attention structure. maze environment, the EMHP approach, despite achieving a relatively higher SR among the two baselines, is constrained by the limited horizon of its explicit memory, fails to navigate through the maze and eventually loops in a long corridor. In is extended to 120 seconds, the EMHP’s SR still declines contrast, our policy successfully traverses a dead-end corri- to below 60% at the same 40-meter distance, constrained dor and reaches the goal, demonstrating the effectiveness of by its fixed context window. Conversely, our SRU-based the SRU unit in learning implicit spatial-temporal mapping approach sustains an SR of over 70% for distances up to for long-horizon navigation tasks. To further validate the 120 meters, demonstrating the SRU’s superior ability to effectiveness of SRU for implicit spatial-temporal memo- implicitly learn spatial-temporal mappings and generalize rization in mapless navigation, we replaced GTRL’s stacked to distances beyond the training range. The baseline’s historical observations with our SRU-based recurrent mem- reliance on a fixed explicit memory window limits its ory while keeping all other components unchanged. This capacity to capture long-range dependencies, hindering its modified variant, denoted GTRL, achieves a 73.6% relative generalization in extended long-distance navigation tasks. improvement over the original GTRL baseline, increasing Furthermore, we observed that the EMHP policy struggles the SR from 38.2% to 66.3% (Table 2). This substantial to effectively learn to climb staircases unless dense reward gain highlights the advantage of SRU’s implicit recurrent guidance is provided. We believe this limitation arises from memory over heuristic stacking of historical observations the inherent difficulty explicit memory mechanisms face in capturing spatial-temporal dependencies for improved in capturing intricate spatial-temporal features, which are navigation performance. Notably, compared to GTRL, our essential for the robot to develop the maneuvers required complete SRU-based approach (Ours) achieves an additional to overcome 3D obstacles effectively. Lastly, the end-to- 18.1% relative improvement, attributable to our proposed end recurrent setup offers a simpler and more maintainable spatial attention layers. The impact of these attention layers solution compared to baseline methods. In contrast, baselines is further discussed later. rely on an external mapping pipeline and the storage of Quantitatively, while the EMHP approach achieves additional historical paths or observations for each robot, comparable navigation performance, we also analyze the which can introduce complexity and overhead during both success rate (SR) as a function of travel distance to evaluate training and real-world deployment. its long-range memorization and generalization capabilities. As shown in Figure 10, with a maximum episodic time of 60 seconds (consistent with training) and the robot’s maximum 5.4 Importance of Spatial Attention Layers speed set to 1.5 m/s, the EMHP approach’s SR drops We now examine the role of the proposed spatial attention significantly when the travel distance exceeds 40 meters. In layers in the network architecture and evaluate their impact contrast, our SRU-based approach maintains an SR of over on navigation performance. These layers are designed to 80% up to 50 meters. When the maximum episode time compress and emphasize relevant features from encoded

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 13

                                                                 observations, addressing a key challenge faced by recurrent
                                                                 structures: the difficulty of retaining long-term information
                                          B
                                                                 due to the exponential decay of memory over time. By
                       C                                         selectively focusing on the most salient features, the attention
      GOAL                                                       mechanism emphasizes the most relevant spatial cues for
                                                      START
                                                                 navigation based on the robot’s state and reduces the
                                                                 information density passed into the recurrent memory at
                                                                 each step. We hypothesize that this mechanism can improve
                                               A                 the network’s memorization and navigation capabilities,
                                                                 enabling it to handle complex, long-range tasks more
                                                                 effectively.
                                                                    To test this, we conduct an ablation study by: (i)
                                                      MAZE
                                                                 removing the attention layers from our network architecture
   (a) Policy with Explicit Mapping and Historical Path (EMHP)   and replacing them with convolution followed by average
                                                                 pooling for feature compression, as implemented in Wijmans
           D
                                                                 et al. (2019), and (ii) comparing the performance of our
                                                                 proposed spatial attention layers against the Goal-guided
                       C                   B                     Transformer (GoT) architecture proposed in Huang et al.
                                                                 (2023). The GoT architecture utilizes a modified Vision
        GOAL                                          START      Transformer (ViT) that integrates the goal state as an
                                                                 additional token. It performs self-attention across both
                                                                 visual feature tokens and the goal token to extract goal-
                                               A                 relevant features. In contrast, our approach first applies
                                                                 self-attention exclusively to visual tokens to enhance
                                                                 spatial features. Subsequently, the goal and proprioceptive
                                                                 state are used as queries in the cross-attention layer
                                                      MAZE       to compress and extract the most relevant features. For
               (b) Policy with SRU Recurrent Memory
                                                                 a fair comparison in the ablation experiments, we use
                                                                 identical training settings for all approaches, integrating Figure 9. Comparison of proposed mapless method with SRU
                                                                 the SRU memories and the pretrained encoder while recurrent memory against the EMHP baseline approach in a             varying only the attention layers used to process visual maze environment. The robot’s traversed trajectory is shown in       features during RL. The GoT integrated with SRU is yellow, with traversal order marked as A, B, C, and D. (a) The       the the GTRL* approach, as described in Sec. 5.3. EMHP approach starts looping in the long corridor between            Figure 12 shows the average return rewards during points B and C, failing to navigate through the dead-end             training. The network without the attention layers exhibits corridors. (b) Our approach, with SRU recurrent memory,              significantly lower performance compared to the two policies successfully navigates from start to goal, rerouting through the dead-end corridors and reaching the goal through area D.
                                                                 utilizing attention mechanisms. Additionally, our proposed
                                                                 spatial attention layers outperform the GoT attention
                                                                 mechanism. Table 3 shows the SR performance of the
                                                                 three configurations: (i) without attention, (ii) with GoT
                                                                 attention, and (iii) with our proposed spatial attention (Ours).
                                                                 Our method achieves a 56.2% relative SR improvement
                                                                 over the no-attention baseline, highlighting the importance
                                                                 of selectively compressing and extracting spatial features
                                                                 for long-range mapless navigation when utilizing implicit
                                                                 recurrent memory. Furthermore, it achieves a 15.4% relative
                                                                 improvement (18.1% when trained with the ANYmal robot
                                                                 model, as shown in Table 2) over the policy utilizing GoT
                                                                 attention. This demonstrates that the proposed two-stage
                                                                 spatial attention mechanism more effectively extracts task-
                                                                 relevant cues, enhancing recurrent memorization and policy
                                                                 optimization. Figure 10. Success rate sorted by travel distance: comparison           Notably, the attention effect emerges naturally during between the EMHP baseline approach, which uses explicit              the end-to-end RL training without requiring additional mapping and a fixed-length historical path, and our approach,        supervision or auxiliary losses. Figure 11 illustrates the which employs the implicit recurrent memory of SRU. Our              attention weights generated by the cross-attention layer over method maintains a high success rate over longer distances           raw visual inputs in three distinct real-world deployment and extends effectively with longer episodic times. In contrast, the baseline’s success rate drops significantly for longer travel
                                                                 scenarios: an indoor office, an outdoor terrace, and a forest distances, even when the maximum episodic time is doubled,           environment. The attention weights, with four attention due to its fixed context window limitation.                          heads (depicted in different colors), dynamically emphasize
                                                                 the most relevant spatial cues, such as obstacles and

Prepared using sagej.cls 14 ()

            (a) Office Environment                  (b) Terrace Environment                     (c) Forest Environment

Figure 11. Visualization of attention weights for the cross-attention layer in three distinct real-world deployment scenarios over raw depth inputs: (a) Office environment, (b) Outdoor terrace environment, and (c) Forest environment. The attention weights dynamically highlight relevant spatial cues for navigation based on the robot’s state. The RGB images in the top corners are included for visualization purposes only.

                                                                 overly rely on easier-to-learn temporal features to solve
                                                                 navigation tasks, thereby failing to establish robust spatial-
                                                                 temporal memorization. To test this hypothesis, we conduct
                                                                 an ablation study by removing the regularization techniques,
                                                                 specifically deep mutual learning (DML), from the standard
                                                                 PPO training setup and comparing the performance against
                                                                 the setup with DML regularization.
                                                                    Figure 13 illustrates that the network without DML
                                                                 exhibits lower average return rewards during training.
                                                                 Notably, the performance difference between standard
                                                                 LSTM and SRU modifications becomes more pronounced
                                                                 when regularization techniques are applied. As shown in
                                                                 Table 4, the SR performance improves from 61.8% to Figure 12. Average training return rewards for attention             65.7% (a 6.3% increase) without DML and from 63.5% ablations (all using SRU recurrent memory): (1) without              to 78.9% (a significant 24.3% increase) with DML. This attention (w/o.) Wijmans et al. (2019); (2) Goal-guided              finding underscores that, in certain RL tasks, the network’s Transformer (GoT) attention Huang et al. (2023); and (3) the
                                                                 architecture alone may not be the sole limiting factor. proposed two-stage spatial attention (Ours). The proposed spatial attention achieves the highest returns, indicating more      Instead, the optimization process plays a critical role in effective extraction of task-relevant spatial cues for improved      fully leveraging the network’s potential, highlighting the recurrent memorization.                                              importance of effective training strategies.
                                                                    Additionally, we observe that incorporating the consistent
                                                                 dropout layer with temporal consistency into the recurrent
                                                                 training can also positively impact navigation performance, navigable free space, based on the robot’s state at the time the     as shown in Table 5. This enhancement improves the SR depth input was recorded. This highlights the effectiveness          when tested in new, randomly generated environments. of training the spatial attention mechanism end-to-end and           These findings align with the discussion in Hausknecht its ability to generalize across diverse and challenging             and Wagener (2022), which highlights the benefits of environments.                                                        using dropout in RL training to enhance the network’s
                                                                 generalization and robustness. 5.5      Training with Regularizations We evaluate the role of regularization techniques in the end-        5.6     Large-Scale Pretraining for Sim-to-Real to-end training of the recurrent network using reinforcement                 Transfer learning. Firstly, as shown in Figure 1 and discussed in             In this section, we evaluate the pretrained image encoder, Sec. 4.6, while the SRU unit effectively enhances the                trained on a large-scale synthetic dataset, for its ability network’s ability to learn implicit spatial memorization             to bridge the sim-to-real gap in real-world perception. from sequential observations, the learning curve indicates           Additionally, we assess the effectiveness of the proposed that spatial memory learning can converge significantly              depth noise model in reducing discrepancies between slower than temporal memorization. This discrepancy,                 synthetic and real-world data. To this end, we conduct combined with the inherent properties of standard policy             zero-shot transfer experiments on legged-wheel platforms optimization algorithms like PPO—which restrict deviations           across diverse real-world environments to demonstrate the from previous optimization steps—and the complex structure           generalization of our approach. of attention networks with RNNs prone to overfitting, suggests that without proper regularization, the network may         Pretrain and Depth Noise. Here, we analyze the latent converge to suboptimal strategies. Such strategies might             space distribution of encoders trained under two distinct

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 15

                                                               distribution range that encompasses the features extracted
                                                               from real-world data when derived from the encoder
                                                               pretrained on large-scale synthetic data (Figure 15(a)). This
                                                               indicates that the encoder pretrained on large-scale synthetic
                                                               data effectively captures a wide range of features, enabling
                                                               it to generalize well to real-world scenarios. In contrast, the
                                                               encoder trained solely on RL images (Figure 15(b)) exhibits
                                                               a narrower latent space distribution, failing to cover many
                                                               real-world features. This suggests that an encoder trained
                                                               exclusively on simulated depth images collected during RL
                                                               navigation training may struggle to generalize effectively to
                                                               real-world data when deployed.
                                                                   Additionally, Figure 14 provides a qualitative comparison Figure 13. Training curve comparison between policies trained      of depth reconstruction using features extracted from the using PPO with deep mutual learning (DML) regularization and       same two pretrained encoders. The comparison is based on a PPO: The network with DML regularization techniques achieves
                                                               real-world stereo depth input captured during deployment. higher returns compared to the network trained with vanilla PPO.                                                               The encoder pretrained on large-scale data demonstrates
                                                               effective reconstruction of the depth image, with only minor
                                                               blurring effects (Figure 14(b)). In contrast, the encoder
                                                               trained solely on RL images struggles to reconstruct the input
               RL Training            SR %                     depth image effectively, resulting in outputs with significant
               LSTM w/o. DML           61.8                    artifacts and noise (Figure 14(c)).
               LSTM w. DML             63.5                        To quantitatively assess the distributional disparity of
                                                               features from encoders trained on different sources, we
               SRU-Ours w/o. DML       65.7                    adapt the method from Lee et al. (2018) to measure the
               SRU-Ours w. DML         78.9                    Mahalanobis distance (MD) for the latent distributions
                                                               derived from each encoder. In addition to the two pretraining Table 4. Comparison of the overall navigation success rate (SR) with and without DML regularization for policies using
                                                               sources mentioned earlier, we also analyze the latent LSTM and SRU units. DML significantly enhances SR for SRU          distribution of the encoder pretrained on large-scale synthetic (over 20%) and provides a marginal improvement for LSTM            data without noise augmentation. This allows us to evaluate (2.8%), showcasing DML’s effectiveness in unlocking SRU’s          the effectiveness of the proposed depth noise model in potential for long-range mapless navigation.                       further reducing the sim-to-real gap between synthetic depth
                                                               images and real-world stereo depth. The MDs are computed
                                                               between the latent features of real-world images and the
              RL Training            SR (%)                    latent feature distributions of RL images extracted from
                                                               each encoder. As shown in Figure 16, pretraining on large-
              SRU-Ours w/o. TC-D       77.2                    scale synthetic data effectively reduces the MD, lowering the
              SRU-Ours w. TC-D         78.9                    median from 1.15 (RL images) to 0.82 (large-scale synthetic
                                                               data without noise). This demonstrates the pretrained Table 5. Evaluation of the overall navigation success rate (SR) with and without temporally consistent dropout (TC-D): The         encoder’s effectiveness in covering the distribution of real- network with TC-D is able to maintain a similar (or even higher)   world perception inputs. Furthermore, incorporating the SR compared to the network without TC-D, while improving           proposed depth noise model further reduces the MD to robustness and generalization.                                     0.69, underscoring its role in narrowing the differences
                                                               between synthetic and real-world depth data. These results
                                                               highlight that the encoder, pretrained on large-scale data
                                                               and augmented with the proposed depth noise model, can conditions: (i) an encoder trained exclusively on simulated        effectively minimize the sim-to-real gap, enabling improved depth images generated during RL navigation training               generalization to real-world environments. (RL images), (ii) an encoder pretrained on large-scale synthetic data from Wang et al. (2020), augmented with             Real-world Tests on Legged-wheel Robot. To evaluate the proposed parallelizable depth noise model (Figure 4).          the pretrained image encoder’s with the proposed attention- To evaluate these encoders, we compare the latent features         based recurrent network’s ability to generalize across in real- extracted from their outputs using two data sources: (i) RL        world environments, we conduct several zero-shot transfer images, and (ii) real-world stereo depth images captured by        experiments, on a Unitree B2W robot with a learning- the ZEDX camera during deployment (real-world images).             based locomotion policy from RIVR. The robot is mounted This analysis highlights their differences in latent space         with a ZEDX, front-facing stereo depth sensor, and NVIDIA distributions depending on the pretraining data source used        Jetson AGX Orin for onboard compute for the policy. The for the encoder.                                                   pretrained encoder and network are directly deployed on    Figure 15 illustrates a 2D principal components analysis        the robot without any fine-tuning with real-world data. (PCA) (Dunteman 1989) projection of the latent features.           For all the test, the robot receives no prior information The latent space distribution of RL-images shows a larger          about the environment, and receives only the stereo depth

Prepared using sagej.cls 16 ()

     (a) Real-world Stereo Depth Image              (b) Reconstruction with Large-scale Pretrain   (c) Reconstruction with RL-images Pretrain

Figure 14. Comparison of depth image reconstruction using features from encoders pretrained on different data sources. (a) Original input stereo depth image from real-world deployment, captured using the ZEDX camera. (b) Reconstructed depth image using features extracted from the encoder pretrained on large-scale synthetic data with noise augmentation. (c) Reconstructed depth image using features extracted from the encoder trained exclusively on simulated images collected during RL navigation training.

                                                                                                   1.15

                                                                                                                     0.82
                                                                                                                                     0.69




   (a) Latent space distribution with large-scale data pretraining
                                                                           Figure 16. Comparison of Mahalanobis distances between the
                                                                           latent features of real-world images and the latent feature
                                                                           distributions of RL images, using encoders pretrained on
                                                                           different sources: (i) RL images, (ii) large-scale synthetic data
                                                                           without noise augmentation, and (iii) large-scale synthetic data
                                                                           with noise augmentation.




                                                                           vt and ωt , projected gravity nt , and relative goal position pt
                                                                           with respect to the robot’s frame. The robot is controlled by
                                                                           a set of linear and angular velocities, referred to as action at ,
                                                                           which is the input to the locomotion policy.
                                                                              Firstly, we conduct an experiment in an office environ-
                                                                           ment, as shown in Figure 17, to compare the navigation
      (b) Latent space distribution with RL images pretraining             performance of our policy with the SRU memory module
                                                                           against a baseline model using a standard LSTM unit. In Figure 15. Comparison of latent space distributions: (a) The                   this experiment, the robot is tasked with navigating from feature distribution from the encoder pretrained on large-scale
                                                                           one side of the office to the other while avoiding obstacles. synthetic data effectively covers the distribution of real-world data, indicating better generalization. (b) The feature distribution           To evaluate the long-term spatial-temporal memorization from the encoder trained solely on simulated data collected                    capabilities of the SRU module, several passageways are during RL fails to cover the distribution of real-world depth                  temporarily blocked, requiring the robot to backtrack and images, posing challenges in generalizing to real-world data.                  search for alternative routes to reach the goal. Additionally,
                                                                           dynamic changes are introduced by unblocking certain areas
                                                                           during navigation to further assess the robustness of the
                                                                           SRU-enhanced policy. The policy with SRU demonstrates images from the front-facing camera as the exteroceptive                       the ability to explore dead ends, navigate around obsta- input. Additionally, a LiDAR-based state estimation and                        cles, and re-evaluate its path to adapt to dynamic changes localization module (Chen et al. 2023) provide the robot’s                     in the environment (Figure 17(a)). The robot successfully proprioceptive state, including linear and angular velocities                  reaches the goal, showcasing the effectiveness of utilizing

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 17

                                                                    START
                                                GOAL                (GOAL)       A
                                       C                                                                              B

                                                B
                                 A                                                    C
         START
                                                                                                                    GOAL
                                                                                                                   (START)

  A                    B                   C                        A                 B                        C

                   (a) With SRU Memory Module                                             (a) Main Hall

                                                                                                                      START
                                                                        B             GOAL                            (GOAL)
                                                                                     (START)        C

                                                                                                A

                                                                    A                 B                        C
             (b) With Standard LSTM Memory Module

Figure 17. Comparison of navigation trajectories (orange) in an office environment. A, B, and C indicate areas that the robot traverses in sequence. (a) shows that the robot using the SRU (b) Terrace memory module successfully navigates through two dead ends and reaches the goal while adapting to changes in the environment (the blocker located in area A was initially set and later removed). (b) illustrates that the baseline model with a standard LSTM fails to reach the goal and repeatedly loops A START between the dead-end areas C and B. (GOAL) B GOAL (START) C the SRU memory module to learn robust spatial-temporal memorization from sequential observations. In contrast, the baseline model with a standard LSTM fails to reach the goal A B C and repeatedly loops between dead-end areas, as shown in Figure 17(b). To further evaluate the generalization and performance of the proposed network architecture in long-horizon (c) Forest navigation tasks, we conduct experiments in a variety of real-world environments—including an indoor campus main Figure 18. Evaluation of long-range mapless navigation in hall, outdoor terrace areas, and forest environments—using diverse real-world environments: (a) Main Hall of a university, the same pretrained encoder and navigation policy (see (b) outdoor terrace, and (c) forest environment with natural Figure 18). In these experiments, the robot is tasked with obstacles. In each scenario, the robot is tasked with two navigating to a designated goal and returning to its starting separate navigation goals (memory reset between goals), point.Note that the policy is designed to maintain episodic resulting in two trajectory segments (orange and blue). Labels memory only between the start and the goal and is reset A, B, and C mark key areas traversed by the robot. when a new goal is given. The results demonstrate that the policy generalizes effectively to unseen environments, handling diverse obstacles such as walls, stairs, vegetation, bushes, and trees, as well as navigating uneven terrains. Note that the maximum start-goal distance during RL Additionally, the policy adapts to larger-scale scenarios, training is 30 meters. The figures show the trajectories of the including extended goal distances of more than 70 meters robot successfully navigating through these environments, and traversing over 100 meters, as shown in Figure 18(c). with point clouds generated from the state estimation

Prepared using sagej.cls 18 ()

module (Chen et al. 2023) provided solely for visualization. which integrate an implicit spatial transformation operation Note that, due to the absence of a dedicated mapping module into standard GRU and LSTM structures. These SRUs or loop closure mechanism, the trajectories shown may are incorporated into a novel attention-based architecture, exhibit some drift and errors. trained end-to-end via reinforcement learning, achieving long-horizon navigation tasks with a single forward- facing depth camera. Our research further highlights 6 Limitations and Future Work the importance of regularization strategies in end-to-end While the proposed SRUs in this paper demonstrate signifi- reinforcement learning frameworks. Techniques such as cant improvements in spatial-temporal learning capabilities, temporally consistent dropout and deep mutual learning are their recurrent nature remains subject to exponential memory crucial for fully leveraging SRUs’ potential and preventing decay, which can limit their ability to retain global context early overfitting. Experiments demonstrate SRUs’ superior over extended sequences. As a result, the long-range nav- navigation performance compared to standard LSTM and igation capabilities presented in this paper are centered on GRU models. Moreover, we compare our implicit recurrent local, mapless navigation using egocentric sensing. In this memory-based approach with a state-of-the-art baseline context, ”long-range” refers to planning horizons that extend that utilizes explicit mapping and historical paths. Our well beyond the local perception radius (e.g., 10 m), enabling findings illustrate the superior effectiveness of recurrent rerouting from local dead ends without reliance on an explicit memory structures for long-range mapless navigation tasks. global map. Extending this approach to global-scale navi- Additionally, through ablation studies, we demonstrate the gation—spanning kilometers or hours—would likely require role of specific design choices, particularly the spatial additional mechanisms or architectural enhancements, such attention mechanism, in enhancing overall navigation as the integration or maintenance of a global map. performance. Lastly, we analyze and address the challenge of Furthermore, while SRUs enhance the network’s capacity sim-to-real transfer for stereo depth perception by integrating for implicit spatial memorization and improve long-range large-scale pretraining. This approach enables successful navigation performance, the precise characteristics of the zero-shot transfer and robust generalization across diverse information retained and utilized during end-to-end navi- real-world environments, including indoor, outdoor, and gation training remain unclear. This highlights a broader forest scenarios, underscoring the practical applicability and challenge in explainable artificial intelligence, where under- effectiveness of our proposed methodology. standing the internal representations and decision-making processes of neural networks continues to be an active area of research (Mi et al. 2024). Future work could explore inte- Acknowledgements grating SRUs with recent advancements in foundation pre- The authors acknowledge Nikita Rudin, Takahiro Miki, training, such as DINO (Caron et al. 2021), to combine their Jonas Frey, Pascal Roth, and Chong Zhang for their valuable strengths in scene understanding with the efficiency of recur- feedback and discussions. The authors also extend their rent structures, further enhancing the policy’s performance gratitude to Marco Trentini for his assistance in conducting in complex real-world environments. Investigating auxil- real-world experiments and testing the LiDAR-inertial state iary losses or additional regularization techniques to further estimation module. Additionally, the authors recognize the leverage the potential of spatial-temporal memorization in RIVR team for their technical support with the legged-wheel SRUs could also be beneficial. Additionally, extending the robot platform utilized in this research. application of SRUs to other domains, such as robotic manip- ulation and 3D reconstruction, could unlock new possibilities and advancements in spatial-temporal learning. In summary, Declaration of Conflicting Interests while SRUs are effective, they represent a simple yet practi- The authors declared no potential conflicts of interest with cal solution—not necessarily unique or optimal—that proves respect to the research, authorship, and/or publication of this successful in our end-to-end mapless navigation context. article. More importantly, this work aims to highlight the potential of implicit spatial memory mechanisms in addressing complex navigation challenges while identifying opportunities for References further exploration and optimization in both methodology and application domains. Barron JT and Malik J (2013a) Intrinsic scene properties from a single rgb-d image. CVPR . Barron JT and Malik J (2013b) Intrinsic scene properties from a 7 Conclusion single rgb-d image. In: Proceedings of the IEEE Conference In this study, we identify and address a limitation of on Computer Vision and Pattern Recognition. pp. 17–24. existing recurrent neural network architectures in the context Bhattacharya A, Rao N, Parikh D, Kunapuli P, Wu Y, Tao Y, of navigation: while RNNs excel at modeling temporal Matni N and Kumar V (2024) Vision transformers for end-to- sequences, they are not inherently designed for spatial end vision-based quadrotor obstacle avoidance. arXiv preprint memorization or transforming observations from varying arXiv:2405.10391 . perspectives. This limitation makes them less effective in Bohg J, Romero J, Herzog A and Schaal S (2014a) Robot arm pose building the spatial representations required for mapless estimation through pixel-wise part classification. ICRA . navigation using egocentric perception. To address this, Bohg J, Romero J, Herzog A and Schaal S (2014b) Robot arm we propose Spatially-Enhanced Recurrent Units (SRUs), pose estimation through pixel-wise part classification. In: 2014

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 19

 IEEE International Conference on Robotics and Automation           Gu A, Goel K and Ré C (2021) Efficiently modeling long sequences
 (ICRA). IEEE, pp. 3143–3150.                                           with structured state spaces. arXiv preprint arXiv:2111.00396 Bohlin R and Kavraki LE (2000) Path planning using lazy                     .
 prm. In: Proceedings 2000 ICRA. Millennium conference.             Gu A, Gulcehre C, Paine T, Hoffman M and Pascanu R (2020b)
 IEEE international conference on robotics and automation.              Improving the gating mechanism of recurrent neural networks.
 Symposia proceedings (Cat. No. 00CH37065), volume 1.                   In: International conference on machine learning. PMLR, pp.
 IEEE, pp. 521–528.                                                     3800–3809. Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B,              Handa A, Whelan T, McDonald J and Davison AJ (2014a) A
 Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al.                benchmark for rgb-d visual odometry, 3d reconstruction and
 (2016) End to end learning for self-driving cars. arXiv preprint       slam. ICRA .
 arXiv:1604.07316 .                                                 Handa A, Whelan T, McDonald J and Davison AJ (2014b) A Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P and           benchmark for rgb-d visual odometry, 3d reconstruction and
 Joulin A (2021) Emerging properties in self-supervised vision          slam. In: 2014 IEEE international conference on Robotics and
 transformers. In: Proceedings of the IEEE/CVF international            automation (ICRA). IEEE, pp. 1524–1531.
 conference on computer vision. pp. 9650–9660.                      Hart P, Nilsson N and Raphael B (1968) A formal basis for Cèsar-Tondreau B, Warnell G, Stump E, Kochersberger K                      the heuristic determination of minimum cost paths. IEEE
 and Waytowich NR (2021) Improving autonomous robotic                   Transactions on Systems Science and Cybernetics 4(2): 100–
 navigation using imitation learning. Frontiers in Robotics and         107.
 AI 8: 627730.                                                      Hausknecht M and Wagener N (2022) Consistent dropout for Chen K, Nemiroff R and Lopez BT (2023) Direct lidar-                        policy gradient reinforcement learning.        arXiv preprint
 inertial odometry: Lightweight lio with continuous-time                arXiv:2202.11818 .
 motion correction. 2023 IEEE International Conference on           He T, Zhang C, Xiao W, He G, Liu C and Shi G (2024)
 Robotics and Automation (ICRA) : 3983–3989DOI:10.1109/                 Agile But Safe: Learning Collision-Free High-Speed Legged
 ICRA48891.2023.10160508.                                               Locomotion.      In: Proceedings of Robotics: Science and Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares                 Systems. Delft, Netherlands. DOI:10.15607/RSS.2024.XX.
 F, Schwenk H and Bengio Y (2014) Learning phrase                       059.
 representations using rnn encoder-decoder for statistical          Hochreiter S and Schmidhuber J (1997) Long short-term memory.
 machine translation. arXiv preprint arXiv:1406.1078 .                  Neural computation 9(8): 1735–1780. Choi J, Park K, Kim M and Seok S (2019) Deep reinforcement              Hoeller D, Wellhausen L, Farshidian F and Hutter M (2021)
 learning of navigation in a complex and crowded environment            Learning a state representation and navigation in cluttered and
 with a limited field of view. In: 2019 International Conference        dynamic environments. IEEE Robotics and Automation Letters
 on Robotics and Automation (ICRA). IEEE, pp. 5993–6000.                6(3): 5081–5088. Cimurs R, Suh IH and Lee JH (2021) Goal-driven autonomous               Huang W, Zhou Y, He X and Lv C (2023) Goal-guided transformer-
 exploration through deep reinforcement learning.           IEEE        enabled reinforcement learning for efficient autonomous
 Robotics and Automation Letters 7(2): 730–737.                         navigation. IEEE Transactions on Intelligent Transportation Dijkstra EW (1959) A note on two problems in connexion with                 Systems 25(2): 1832–1845.
 graphs. Numerische mathematik 1(1): 269–271.                       Hutter M, Gehring C, Jud D, Lauber A, Bellicoso CD, Tsounis V, Dobson A and Bekris KE (2014) Sparse roadmap spanners                       Hwangbo J, Bodie K, Fankhauser P, Bloesch M et al. (2016)
 for asymptotically near-optimal motion planning.             The       Anymal-a highly mobile and dynamic quadrupedal robot. In:
 International Journal of Robotics Research 33(1): 18–47.               2016 IEEE/RSJ international conference on intelligent robots Dozat T (2016) Incorporating nesterov momentum into adam .                  and systems (IROS). IEEE, pp. 38–44. Duarte FF, Lau N, Pereira A and Reis LP (2023) Lstm, convlstm,          Karaman S and Frazzoli E (2011) Sampling-based algorithms for
 mdn-rnn and gridlstm memory-based deep reinforcement                   optimal motion planning. The international journal of robotics
 learning. In: ICAART (2). pp. 169–179.                                 research 30(7): 846–894. Dunteman GH (1989) Principal components analysis, volume 69.            Kareer S, Yokoyama N, Batra D, Ha S and Truong J (2023) Vinl:
 Sage.                                                                  Visual navigation and locomotion over obstacles. In: 2023 Francis A, Faust A, Chiang HTL, Hsu J, Kew JC, Fiser M and Lee              IEEE International Conference on Robotics and Automation
 TWE (2020) Long-range indoor navigation with prm-rl. IEEE              (ICRA). IEEE, pp. 2018–2024.
 Transactions on Robotics 36(4): 1115–1134.                         Karnan H, Warnell G, Xiao X and Stone P (2022) Voila: Fu Z, Kumar A, Agarwal A, Qi H, Malik J and Pathak D (2022)                 Visual-observation-only imitation learning for autonomous
 Coupling vision and proprioception for navigation of legged            navigation. In: 2022 International Conference on Robotics and
 robots. In: Proceedings of the IEEE/CVF Conference on                  Automation (ICRA). IEEE, pp. 2497–2503.
 Computer Vision and Pattern Recognition. pp. 17273–17283.          Kavraki LE, Svestka P, Latombe JC and Overmars MH (1996) Gu A and Dao T (2023) Mamba: Linear-time sequence modeling                  Probabilistic roadmaps for path planning in high-dimensional
 with selective state spaces. arXiv preprint arXiv:2312.00752 .         configuration spaces. IEEE transactions on Robotics and Gu A, Dao T, Ermon S, Rudra A and Ré C (2020a) Hippo:                      Automation 12(4): 566–580.
 Recurrent memory with optimal polynomial projections.              Kuffner JJ and LaValle SM (2000) Rrt-connect: An efficient
 Advances in neural information processing systems 33: 1474–            approach to single-query path planning. In: Proceedings 2000
 1487.                                                                  ICRA. Millennium conference. IEEE international conference
                                                                        on robotics and automation. Symposia proceedings (Cat. No.

Prepared using sagej.cls 20 ()

 00CH37065), volume 2. IEEE, pp. 995–1001.                               arXiv preprint arXiv:1710.06542 . LaValle SM, Kuffner JJ, Donald B et al. (2001) Rapidly-exploring         Radosavovic I, Kosaraju RP, Girshick R, He K and Dollár P
 random trees: Progress and prospects.          Algorithmic and          (2020) Designing network design spaces. In: Proceedings
 computational robotics: new directions 5: 293–308.                      of the IEEE/CVF conference on computer vision and pattern Lee J, Bjelonic M, Reske A, Wellhausen L, Miki T and Hutter M                recognition. pp. 10428–10436.
 (2024) Learning robust autonomous navigation and locomotion         Rudin N, Hoeller D, Bjelonic M and Hutter M (2022) Advanced
 for wheeled-legged robots. Science Robotics 9(89): eadi9641.            skills by learning locomotion and local navigation end-to-end. Lee K, Lee K, Lee H and Shin J (2018) A simple unified framework             In: 2022 IEEE/RSJ International Conference on Intelligent
 for detecting out-of-distribution samples and adversarial               Robots and Systems (IROS). IEEE, pp. 2497–2503.
 attacks. Advances in neural information processing systems          Ruiz-Serra J, White J, Petrie S, Kameneva T and McCarthy C
 31.                                                                     (2022) Towards self-attention based visual navigation in the Lin TY, Dollár P, Girshick R, He K, Hariharan B and Belongie                real world. arXiv preprint arXiv:2209.07043 .
 S (2017) Feature pyramid networks for object detection. In:         Savinov N, Dosovitskiy A and Koltun V (2018) Semi-
 Proceedings of the IEEE conference on computer vision and               parametric topological memory for navigation. arXiv preprint
 pattern recognition. pp. 2117–2125.                                     arXiv:1803.00653 . Loquercio A, Kaufmann E, Ranftl R, Müller M, Koltun V and               Schulman J, Wolski F, Dhariwal P, Radford A and Klimov O
 Scaramuzza D (2021) Learning high-speed flight in the wild.             (2017) Proximal policy optimization algorithms. arXiv preprint
 Science Robotics 6(59): eabg5810.                                       arXiv:1707.06347 . Ma X, Dai X, Bai Y, Wang Y and Fu Y (2024) Rewrite the stars. In:        Shah D, Sridhar A, Bhorkar A, Hirose N and Levine S (2022) Gnm:
 Proceedings of the IEEE/CVF Conference on Computer Vision               A general navigation model to drive any robot. arXiv preprint
 and Pattern Recognition. pp. 5694–5703.                                 arXiv:2210.03370 . Matthis JS, Yates JL and Hayhoe MM (2018) Gaze and the control           Shah D, Sridhar A, Dashora N, Stachowicz K, Black K, Hirose
 of foot placement when walking in natural terrain. Current              N and Levine S (2023) ViNT: A foundation model for visual
 Biology 28(8): 1224–1233.                                               navigation. In: 7th Annual Conference on Robot Learning. Mescheder L, Oechsle M, Niemeyer M, Nowozin S and Geiger                 Shi H, Shi L, Xu M and Hwang KS (2019) End-to-end navigation
 A (2019) Occupancy networks: Learning 3d reconstruction in              strategy with deep reinforcement learning for mobile robots.
 function space. In: Proceedings of the IEEE/CVF conference              IEEE Transactions on Industrial Informatics 16(4): 2393–
 on computer vision and pattern recognition. pp. 4460–4470.              2402. Mi JX, Jiang X, Luo L and Gao Y (2024) Toward explainable                Siami-Namini S, Tavakoli N and Namin AS (2019) The
 artificial intelligence: A survey and overview on their intrinsic       performance of lstm and bilstm in forecasting time series. In:
 properties. Neurocomputing 563: 126919.                                 2019 IEEE International conference on big data (Big Data). Miki T, Lee J, Hwangbo J, Wellhausen L, Koltun V and                         IEEE, pp. 3285–3292.
 Hutter M (2022a) Learning robust perceptive locomotion for          Surmann H, Jestel C, Marchel R, Musberg F, Elhadj H and Ardani
 quadrupedal robots in the wild. Science robotics 7(62):                 M (2020) Deep reinforcement learning for real autonomous
 eabk2822.                                                               mobile robot navigation in indoor environments. arXiv preprint Miki T, Wellhausen L, Grandia R, Jenelten F, Homberger T                     arXiv:2005.13857 .
 and Hutter M (2022b) Elevation mapping for locomotion               Sutskever I, Vinyals O and Le QV (2014) Sequence to sequence
 and navigation using gpu. In: 2022 IEEE/RSJ International               learning with neural networks. Advances in neural information
 Conference on Intelligent Robots and Systems (IROS). IEEE,              processing systems 27.
 pp. 2273–2280.                                                      Truong J, Yarats D, Li T, Meier F, Chernova S, Batra D and Rai Mittal M, Yu C, Yu Q, Liu J, Rudin N, Hoeller D, Yuan JL, Singh              A (2021) Learning navigation skills for legged robots with
 R, Guo Y, Mazhar H, Mandlekar A, Babich B, State G, Hutter              learned robot embeddings. In: 2021 IEEE/RSJ International
 M and Garg A (2023) Orbit: A unified simulation framework               Conference on Intelligent Robots and Systems (IROS). IEEE,
 for interactive robot learning environments. IEEE Robotics and          pp. 484–491.
 Automation Letters 8(6): 3740–3747. DOI:10.1109/LRA.2023.           Wang W, Zhu D, Wang X, Hu Y, Qiu Y, Wang C, Hu Y, Kapoor A
 3270034.                                                                and Scherer S (2020) Tartanair: A dataset to push the limits of Mohajerin N and Rohani M (2019) Multi-step prediction of                     visual slam. In: 2020 IEEE/RSJ International Conference on
 occupancy grid maps with recurrent neural networks. In:                 Intelligent Robots and Systems (IROS). IEEE, pp. 4909–4916.
 Proceedings of the IEEE/CVF Conference on Computer Vision           Wang Y, Huang X, Sun X, Yan M, Xing S, Tu Z and Li J
 and Pattern Recognition. pp. 10600–10608.                               (2025) Uniocc: A unified benchmark for occupancy forecasting Ortiz-Haro J, Hönig W, Hartmann VN and Toussaint M (2024) idb-              and prediction in autonomous driving.           arXiv preprint
 a*: Iterative search and optimization for optimal kinodynamic           arXiv:2503.24381 .
 motion planning. IEEE Transactions on Robotics .                    Webb DJ and Berg Jvd (2012) Kinodynamic rrt*: Optimal motion Pfeiffer M, Schaeuble M, Nieto J, Siegwart R and Cadena C (2017)             planning for systems with linear differential constraints. arXiv
 From perception to decision: A data-driven approach to end-to-          preprint arXiv:1205.5088 .
 end motion planning for autonomous ground robots. In: IEEE          Weerakoon K, Sathyamoorthy AJ, Patel U and Manocha D (2022)
 International Conference on Robotics and Automation (ICRA).             Terp: Reliable planning in uneven outdoor environments
 IEEE, p. 1527–1533.                                                     using deep reinforcement learning. In: 2022 International Pinto L, Andrychowicz M, Welinder P, Zaremba W and Abbeel P                  Conference on Robotics and Automation (ICRA). IEEE, pp.
 (2017) Asymmetric actor critic for image-based robot learning.          9447–9453.

Prepared using sagej.cls F. Yang, P. Frivik, D. Hoeller, C. Wang, C. Cadena, and M. Hutter 21

Wei Y, Zhao L, Zheng W, Zhu Z, Zhou J and Lu J (2023) Appendix Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In: Proceedings of the IEEE/CVF A Training Details for Spatial-Temporal International Conference on Computer Vision. pp. 21729– Memorization Task 21740. To evaluate the spatial-temporal memorization capabilities of Wellhausen L and Hutter M (2023) Artplanner: Robust legged robot different recurrent neural network architectures, we conduct navigation in the field. Field Robotics 3: 413–434. a case study using an abstract version of the spatial-temporal Wijmans E, Kadian A, Morcos A, Lee S, Essa I, Parikh D, memorization task shown in Figure 1 and Figure 2. Below, Savva M and Batra D (2019) Dd-ppo: Learning near-perfect we provide the training details for this task. The task pointgoal navigators from 2.5 billion frames. arXiv preprint simulates a scenario where, at each time step t, the recurrent arXiv:1911.00357 . agent receives the following inputs: Wijmans E, Savva M, Essa I, Lee S, Morcos AS and Batra D (2023) Emergence of maps in the memories of blind navigation agents. • The coordinates of an observed landmark, lti , AI Matters 9(2): 8–14. represented in the robot’s current frame. • A binary categorical label, ci , associated with the Wu K, Wang H, Esfahani MA and Yuan S (2021) Learn to navigate landmark. autonomously through deep reinforcement learning. IEEE • The ego-centric motion transformation matrix, Mtt−1 , Transactions on Industrial Electronics 69(5): 5342–5352. representing the transformation from the previous Xie Z, Cao J, Zhang Q, Zhang J, Wang C and Xu R frame to the current frame. (2025) The meta-representation hypothesis. arXiv preprint arXiv:2501.02481 . These inputs are concatenated into a 1D vector, passed Yang F, Cao C, Zhu H, Oh J and Zhang J (2022a) Far planner: Fast, through a Multi-Layer Perceptron (MLP) layer, and then fed attemptable route planner using dynamic visibility update. In: into the recurrent unit. After T steps, the output MLP layer 2022 ieee/rsj international conference on intelligent robots and is tasked with: systems (iros). IEEE, pp. 9–16. Yang R, Zhang M, Hansen N, Xu H and Wang X (2022b) Learning • Regressing all observed landmark coordinates {lTi , i = vision-guided quadrupedal locomotion end-to-end with cross- 1, 2, . . . , T } with respect to the robot’s frame at the modal transformers. In: International Conference on Learning final time step T (spatial memorization task). Representations. • Predicting all observed binary labels {ci , i = Zeng KH, Zhang Z, Ehsani K, Hendrix R, Salvador J, Herrasti 1, 2, . . . , T } associated with the landmarks, which A, Girshick R, Kembhavi A and Weihs L (2024) Poliformer: are independent of the observation frame and depend Scaling on-policy rl with transformers results in masterful only on the sequential order of observations (temporal navigators. arXiv preprint arXiv:2406.20083 . memorization task). Zhang C, Jin J, Frey J, Rudin N, Mattamala M, Cadena C and Figure A.1 illustrates the network structure used for training. Hutter M (2024) Resilient legged local navigation: Learning The spatial task is optimized using the Mean Squared Error to traverse with compromised perception end-to-end. In: 2024 (MSE) loss, while the temporal task is optimized using the IEEE International Conference on Robotics and Automation Binary Cross-Entropy (BCE) loss. The network is trained (ICRA). IEEE, pp. 34–41. using the Nesterov Momentum Adam optimizer (Dozat Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L and 2016) with an initial learning rate of 2 × 10−3 , which is Farhadi A (2017) Target-driven visual navigation in indoor reduced to 4 × 10−4 after 800 epochs, continuing until 1000 scenes using deep reinforcement learning. In: 2017 IEEE epochs are completed. international conference on robotics and automation (ICRA). To ensure the recurrent network does not overfit IEEE, pp. 3357–3364. or memorize specific patterns of observed landmarks and associated ego-motion trajectories, the following randomization is applied:

                                                                       • The ego-motion Mtt−1 for each step is uniformly
                                                                         sampled, with translation in the range [−2, 2] meters
                                                                         and orientation in the range [−π, π] radians.
                                                                       • The observed landmark coordinates {lTi , i =
                                                                         1, 2, . . . , T } are uniformly sampled within the
                                                                         range [−5, 5] meters relative to the observation frame,
                                                                         and the binary categorical labels {ci , i = 1, 2, . . . , T }
                                                                         are uniformly sampled from the set {0, 1}.


                                                                   B    Parallelizable Stereo Depth Perception
                                                                        Noise Implementation
                                                                   The following pseudocode B.1 injects synthetic stereo depth
                                                                   images with edge noise, filling noise, and round noise to
                                                                   simulate realistic sensor artifacts. This implementation is

Prepared using sagej.cls 22 ()

                                                                   C    Training Details for Navigation with
                                                                        Reinforcement Learning
                                                                   This section provides the training and parameter details
                                                                   for the end-to-end navigation task using reinforcement
                                                                   learning. We utilize an asymmetric actor-critic setup, training
                                                                   with the NVIDIA IsaacLab simulation framework. The
                                                                   actor processes noisy observations, including depth input Figure A.1. Network architecture for the spatial and temporal          with noise augmentation, using the spatial attention-based memorization task. At each step t, the agent receives landmark coordinates lti , a binary label ci , and ego-centric motion Mtt−1 .
                                                                   recurrent structure. In contrast, the critic has access to These inputs are concatenated, passed through an MLP layer,            additional 360-degree height scan information alongside the and processed by a recurrent unit. After T steps, the MLP head         depth input. These inputs are processed separately through is tasked with recalling, from the final hidden state of the           two attention layers, which are then concatenated before recurrent unit, all observed landmark positions {lTi } with            being passed to the SRU unit. Unlike the actor, the critic does respect to the final frame T (spatial task) and sequentially           not use noise-augmented observations. To handle the height predicting the associated labels {ci } (temporal task).                scan input for the critic, we pretrain a height scan encoder
                                                                   with the same architecture as the depth encoder, using height
                                                                   scan images collected from the RL simulation environments. used both during pretraining on synthetic depth data and               To improve the network’s generalization for handling large during online reinforcement learning to better mimic real-             distance values in long-range navigation, we convert the goal world sensor imperfections.                                            position pt ∈ R3 into a unit directional vector and a log-
                                                                   transformed distance value. This transformation allows the   Inputs: The function takes depth (RB×H×W ), a batch of               network to generalize better to varying distances. raw depth image tensors, where B indicates the batch size,                To enhance sim-to-real transfer, we introduce randomiza- and H and W represent the spatial dimensions of the depth              tion noise to the actor’s observations. The noise parameters image, specifically its height and width.                              applied during training are summarized in Table C.1. The   Parameters: The parameters include f (focal length                   critic, in contrast, receives clean observations without any of the camera), b (baseline between stereo cameras),                   noise or delay, ensuring stable and accurate value estimation filt size (local window size for filtering), τ min                     during training. These randomization strategies, combined and τ max (edge noise threshold range), ρ min and ρ max (pseudo-stereo match probability range), and                            Observation Parameter         Noise Range (U) invalid disp (value to mark dropped disparities).                             Linear Velocity (vt )             ±0.2 m/s
                                                                          Angular Velocity (ωt )           ±0.1 rad/s
                                                                          Projected Gravity (nt )            ±0.1 Algorithm B.1 Stereo Depth Noise Algorithm                                    Goal Position (pt )           ±0.5 m, ±0.1 rad Require: depth ∈ RB×H×W                                                       Observation Delay              0 ms to 600 ms   (Kmean , Ksub ) ← C OMPUTE K ERNELS(f ilt size)
        f ·b                                                       Table C.1. Noise parameters (uniform distribution U ) applied to   disp ← depth                                                         the actor’s observations during RL training for navigation.   f iltered disp ← F ILTER D ISP(disp, Kmean , Ksub )
                      f ·b   f iltered depth ← f iltered  disp   return f iltered depth                                               with the pretraining of encoders and the use of spatial
                                                                   attention-based recurrent structures, enable robust training    function C OMPUTE K ERNELS(s)                                       and effective sim-to-real transfer for the navigation policy.    Compute kernel Kmean    Compute kernel Ksub    return (Kmean , Ksub )

function F ILTER D ISP(disp, Kmean , Ksub ) ρ ∼ Uniform(ρ min, ρ max) R ∼ BernoulliMask(shape = disp, p = ρ) m ← Conv2D(disp, Kmean ) τ ∼ Uniform(τ min, τ max) M ← (|disp − m| < τ ) ∧ R v ← Quantize(disp) masked ← if M then v else invalid disp num ← Conv2D(masked, Ksub ) den ← Conv2D(M, Ksub ) + ϵ f illed ← if den > 0 then num den else masked return f illed

Prepared using sagej.cls