Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning

Oluwaseyi Giwa^{1, 2}, Jonathan Shock^{2, 3, 4}, Jaco Du Toit^{5, 6}, Tobi Awodumila^{1, 2}

¹African Institute for Mathematical Sciences, ²University of Cape Town, ³NiTheCS, Stellenbosch University, ⁴INRS, Montreal, ⁵Vodacom, South Africa, ⁶EEE, Stellenbosch University

PDF arXiv Code

We consider a downlink HetNet operating within an O-RAN architecture. The network consists of a set of BSs, \(B = \{1, \dots, N_B\}\), comprising \(N_M\) macro BSs and \(N_S\) micro BSs. These serve a set of user equipments (UEs) \(U = \{1, \dots, N_U\}\) distributed stochastically within the coverage area. The system is controlled by a centralised Near-RT RIC that hosts an xApp responsible for optimising radio resources at discrete time intervals \(t\)

We formulate the problem as a MDP \((\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R})\). The agent (xApp) interacts with the environment (E2 nodes) as follows:

\textbf{State Space \(\mathcal{S}\)}: The state \(s_t\) aggregates global network observables available at the RIC: \begin{equation} s_t = \left\{\mathbf{p}_{t-1}, \{\mathbf{I}_{u}^{\rm{est}}\}_{u \in U}, \mathbf{L}_{\rm{geo}}\right\}, \end{equation} where \(\mathbf{p}_{t - 1}\) is the previous power allocation, \(\mathbf{I}_{u}^{\rm{est}}\) is the estimated interference measurement from UE channel quality indicator (CQI) reports, and \(\mathbf{L}_{\rm{geo}}\) encapsulates the fixed topology geometry.

We instantiate BS locations from real BS location data in Cape Town, provided by a local telecom operator and place \(50\) users within the deployment polygon. The dataset includes three macro BSs and ten micro BSs. Colors in all figures follow the evaluation convention: Macro BS (red), Micro BS (blue), Users (yellow).

Abstract

Dynamic resource allocation in Open RAN (O-RAN) HetNets presents a complex optimisation challenge under varying user loads. We propose a Near-Real-Time RAN Intelligent Controller (Near-RT RIC) xApp utilising Deep Reinforcement Learning (DRL) to jointly optimise transmit power, bandwidth slicing, and user scheduling. Leveraging real-world network topologies, we benchmark Proximal Policy Optimisation (PPO) and Twin Delayed Deep Deterministic Policy Gradient (TD3) against standard heuristics. Our results demonstrate that the PPO-based xApp achieves a superior trade-off, reducing network energy consumption by up to \(70\%\) in dense scenarios while improving user fairness by over \(30\%\) compared to throughput-greedy baselines. These findings validate the feasibility of centralised, energy-aware AI orchestration in future 6G architectures.

Results

1 / 6

Dense Urban

2 / 6

Hotspot Scenario

3 / 6

Mixed Scenario

4 / 6

Sparse Suburban Scenario

5 / 6

Time Complexity

6 / 6

Mean reward across seeds

Related Works

There is a lot of excellent work that was very useful in completing this work. You can find them in our paper.

BibTeX

@article{wirelessoptim2026,
  author    = {Oluwaseyi, Giwa and Jonathan, Shock and Jaco, Du Toit and Tobi, Awodumila},
  title     = {Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning},
  journal   = {European Conference on Networks and Communications (EuCNC) & 6G Summit},
  year      = {2026}
}