Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning

1African Institute for Mathematical Sciences, 2University of Cape Town, 3NiTheCS, Stellenbosch University, 4INRS, Montreal, 5Vodacom, South Africa, 6EEE, Stellenbosch University

We model control as an MDP and the RA controller as a single agent that interacts with the HetNet environment at discrete time steps \(t, \quad \text{for } t \in [0, 1, \dots, T - 1]\). This is done because the network's state at each time step fully captures the necessary information to model the system's evolution based on the agent's action, satisfying the Markov property.

\(\mathcal{M}=\left(\mathcal{S},\mathcal{A},\mathcal{P},R,\gamma\right)\)

where \(s_t\!\in\!\mathcal{S}\) is the state, \(a_t\!\in\!\mathcal{A}\) is the action, \(\mathcal{P}\) is the transition kernel, \(R\) is the reward, and \(\gamma\!\in\!(0,1)\) is the discount factor.

State space: \(s_t = \left[\vec{p_t}, \vec{I_t}, \mathbf{A}_t, \vec{x_t}^{\text{BS}}, \vec{x_t}^{\text{U}}\right]_{t \in T,\text{BS} \in N_B, \text{U} \in N_U}\)


Action space: \(a_t = \left\{\left(p_{\text{BS}}^{\text{adj}}, w_{\text{BS}}^{\text{alloc}}, s_{\text{U}}^{\text{score}}\right)\right\}_{\text{BS} = 1}^{N_B}\)


Channel dynamics: The log-normal shadowing (\(\Psi_{\text{BS-U}}\)) and downlink effective power gain (\(H_{\text{BS-U}}\)) \begin{align*} \Psi_{\text{BS-U}} = 10^{\frac{X_{\text{BS-U}}}{10}}, \quad X_{\text{BS-U}} \sim \mathcal{N}\left(0, \sigma_{\text{sh}}^{2}\right) \; \text{in dB}\\ H_{\text{BS-U}} = \frac{S_{\text{BS-U}}}{\left(d_{\text{BS-U}}\right)^{\eta} \cdot \Psi_{\text{BS-U}}} \end{align*}


SINR: \(\text{SINR}_\text{U} = \frac{p_{\text{BS-U}} \cdot H_{\text{BS-U}}}{\sum_{\text{BS' \(\neq\) BS}} \left(p_{\text{BS-U}} \cdot H_{\text{BS-U}}\right) + N_0}\)


Throughput:\(T_\text{U} = B_\text{U} \cdot \log_{2}\left(1 + \text{SINR}_\text{U}\right)\)


Reward function \(r_t = \kappa \cdot \sum_{\text{u} = 1}^{N_U} T_\text{U} \; - \; \beta \cdot \sum_{\text{BS} = 1}^{N_B} P_{\text{BS}} \; + \; \phi \cdot \text{Fairness}_{t}\)


Optimisation goal: \(\pi* = \arg \max_{\pi} \mathbb{E}\left[\sum_{t = 0}^{L} \gamma^{t}r_t | \pi\right]\)

We instantiate BS locations from real BS location data in Cape Town, provided by a local telecom operator and place \(50\) users within the deployment polygon. The dataset includes three macro BSs and ten micro BSs. Colors in all figures follow the evaluation convention: Macro BS (red), Micro BS (blue), Users (yellow).

Abstract

Dynamic resource allocation in heterogeneous wireless networks (HetNets) is challenging for traditional methods under varying user loads and channel conditions. We propose a deep reinforcement learning (DRL) framework that jointly optimises transmit power, bandwidth, and scheduling via a multi-objective reward balancing throughput, energy efficiency, and fairness. Using real base station coordinates, we compare Proximal Policy Optimisation (PPO) and Twin Delayed Deep Deterministic Policy Gradient (TD3) against three heuristic algorithms in multiple network scenarios. Our results show that DRL frameworks outperform heuristic algorithms in optimising resource allocation in dynamic networks. These findings highlight key trade-offs in DRL design for future HetNets.

Results

1 / 4
Normalised bandwidth allocation
2 / 4
Normalised power allocation
3 / 4
Normalised scheduling score
4 / 4
Mean reward across seeds

Related Works

There is a lot of excellent work that was very useful in completing this work. You can find them in our paper.

BibTeX

@article{wirelessoptim2026,
  author    = {Oluwaseyi, Giwa and Jonathan, Shock and Jaco, Du Toit and Tobi, Awodumila},
  title     = {Optimisation of Resource Allocation in Heterogeneous Wireless Networks Using Deep Reinforcement Learning},
  journal   = {IEEE Wireless Communications and Networking Conference},
  year      = {2026}
}