Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning

1African Institute for Mathematical Sciences, 2University of Cape Town, 3NiTheCS, Stellenbosch University, 4INRS, Montreal, 5Vodacom, South Africa, 6EEE, Stellenbosch University

We consider a downlink HetNet operating within an O-RAN architecture. The network consists of a set of BSs, \(B = \{1, \dots, N_B\}\), comprising \(N_M\) macro BSs and \(N_S\) micro BSs. These serve a set of user equipments (UEs) \(U = \{1, \dots, N_U\}\) distributed stochastically within the coverage area. The system is controlled by a centralised Near-RT RIC that hosts an xApp responsible for optimising radio resources at discrete time intervals \(t\)

We formulate the problem as a MDP \((\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R})\). The agent (xApp) interacts with the environment (E2 nodes) as follows:

\textbf{State Space \(\mathcal{S}\)}: The state \(s_t\) aggregates global network observables available at the RIC: \begin{equation} s_t = \left\{\mathbf{p}_{t-1}, \{\mathbf{I}_{u}^{\rm{est}}\}_{u \in U}, \mathbf{L}_{\rm{geo}}\right\}, \end{equation} where \(\mathbf{p}_{t - 1}\) is the previous power allocation, \(\mathbf{I}_{u}^{\rm{est}}\) is the estimated interference measurement from UE channel quality indicator (CQI) reports, and \(\mathbf{L}_{\rm{geo}}\) encapsulates the fixed topology geometry.

We instantiate BS locations from real BS location data in Cape Town, provided by a local telecom operator and place \(50\) users within the deployment polygon. The dataset includes three macro BSs and ten micro BSs. Colors in all figures follow the evaluation convention: Macro BS (red), Micro BS (blue), Users (yellow).

Abstract

Dynamic resource allocation in Open RAN (O-RAN) HetNets presents a complex optimisation challenge under varying user loads. We propose a Near-Real-Time RAN Intelligent Controller (Near-RT RIC) xApp utilising Deep Reinforcement Learning (DRL) to jointly optimise transmit power, bandwidth slicing, and user scheduling. Leveraging real-world network topologies, we benchmark Proximal Policy Optimisation (PPO) and Twin Delayed Deep Deterministic Policy Gradient (TD3) against standard heuristics. Our results demonstrate that the PPO-based xApp achieves a superior trade-off, reducing network energy consumption by up to \(70\%\) in dense scenarios while improving user fairness by over \(30\%\) compared to throughput-greedy baselines. These findings validate the feasibility of centralised, energy-aware AI orchestration in future 6G architectures.

Results

1 / 6
Dense Urban
2 / 6
Hotspot Scenario
3 / 6
Mixed Scenario
4 / 6
Sparse Suburban Scenario
5 / 6
Time Complexity
6 / 6
Mean reward across seeds

Related Works

There is a lot of excellent work that was very useful in completing this work. You can find them in our paper.

BibTeX

@article{wirelessoptim2026,
  author    = {Oluwaseyi, Giwa and Jonathan, Shock and Jaco, Du Toit and Tobi, Awodumila},
  title     = {Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning},
  journal   = {European Conference on Networks and Communications (EuCNC) & 6G Summit},
  year      = {2026}
}