当前位置：首页 > news >正文

User Allocation In MEC: A DRL Approach 论文笔记

news 2026/2/10 17:53:44

论文：ICWS 2021 移动边缘计算中的用户分配：一种深度强化学习方法

代码地址：使用强化学习在移动边缘计算环境中进行用户分配

Ⅰ.Introduction

II. MOTIVATION-A.验证假设的观察结果

II. MOTIVATION-A Motivating Example

数据驱动方法的基本思想

III.强化学习分配

RL框架

RL Allocation Algorithm代码

IV.确定性方法作为BASELINE

ILP Algorithm整数线性规划（ILP）算法代码

Nearest Neighbourhood Greedy Algorithm代码

整体框架代码：

框架中用到的函数：

对比实验：

V.实验与结果分析

Ⅰ.Introduction

随着对低延迟需求的增加，边缘计算或雾计算逐渐成为主流。当前最先进的技术假设边缘服务器上的总资源利用率等于从边缘服务器提供的所有服务的资源利用率的总和。然而，边缘服务器的资源利用率与服务请求数量之间通常存在高度非线性的关系，尤其CPU-GPU协同执行，使资源利用率的数学建模异常复杂。

Motivation：目前用于解决边缘用户分配（EUA）问题的算法普遍假设服务的资源利用率与边缘服务器上提供的服务数呈线性关系，假设服务器的总资源利用率是每个服务请求的资源占用量的累积总和。然而，实际服务过程中资源使用量是高度动态的，难以通过数学建模精确描述。

Method：提出一种设备端深度强化学习（DRL）框架来解决边缘用户分配（EUA）问题，基于与 MEC 系统的经验和交互逐步学习适当的资源分配。DRL Agent在服务延迟阈值约束下学习在某边缘服务器上服务的用户数量。DRL Agent通过同时观察系统参数，直接从边缘服务器中学习非线性依赖关系。

Conclusion：实验评估表明，DRL框架在用户分配数量方面优于传统的确定性方法。

II. MOTIVATION-A.验证假设的观察结果

分别使用 GPU 和不使用 GPU 观察目标检测应用程序 YOLO处理图像的服务执行时间:

非线性关系：YOLO的执行时间与CPU和GPU参数之间的关系是非线性的。表明执行时间不仅仅取决于单一的参数变化，还受到许多隐含因素的影响，例如CPU/GPU的可用性。

复杂性：由于存在多个隐藏参数，精确建模YOLO执行时间与系统资源之间的关系是相对困难的。

建模服务执行时间困难:

•可用处理器资源和执行时间之间的非线性关系:可用处理器资源和执行时间之间存在非线性关系，内核数量少时减少内核数量对执行时间的影响更显著，在高工作负载情况下增加后台工作负载会显著减慢执行速度。

• 跨时间的变化：相同配置的同一台机器上执行时间也存在显着差异，受服务调用模式和温度等多个隐藏参数影响。

• 跨服务的变化：不同服务的执行时间模式各异，如 Yolo 执行时间随用户增加近似线性增长，而 MobileNet 表现为非线性，使建模任务复杂化。

II. MOTIVATION-A Motivating Example

七个移动用户 U1、U2 .. U7和两个边缘服务器 e1 和 e2，每个用户请求边缘服务器上可用的两个服务 s1 和 s2 之一，用户U1、U2、U4和U6请求服务s1，其余用户请求服务s2，每个边缘服务器由一个资源向量4 元组（Available RAM、Core 的数量、CPU 后台工作负载%、GPU Utilized%）。

解释传统的确定性方法如何导致用户到服务器的分配效率低下：

如对于边缘服务器 e1，服务s1的单个用户请求的执行时间为 3.12s，四个用户的预期执行时间用线性插值得到12.48 s。然而实际 3.468 s。

假设延迟阈值6.58 秒，用仅考虑服务单个请求的执行时间的确定性方法将U1、U2和U3分配给e1，只会给s1分配2个用户（每个用户3.12s），给s2分配1个用户（6.32s）

使用确定性方法分配的用户总数为 3

数据驱动方法的基本思想

数据驱动方法基于实际的执行时间数据和更精确的资源利用模型进行用户到服务器的分配，克服确定性方法的缺点，提供更有效的资源分配。

运行 YOLO 的四个用户的执行时间低于 6.55 的延迟阈值。实际上可以在 e1 上分配用户 U1、U2、U4(只需 3.35)。e2 可以容纳 U5 和 U7（只需 6.12）。

使用数据驱动的方法分配总共 5 个用户（比确定性方法多 2 个用户），更准确地建模资源利用率。

III.强化学习分配

MEC环境 $E=\{e_1,e_2...e_j\}$ 中每个边缘服务器 $e_j$ 的覆盖率半径为 $r_j$ 。边缘服务器覆盖半径内的移动users $U=\{U_1,U_2...U_i\}$ 可以请求托管在该服务器上的服务 $S=\{s_1,s_2...s_k\}$ 。每个边缘服务器上可用资源（RAM、Cores、CPU 背景工作负载%、GPU Utilized%）

分配策略根据动态因素确定用户与特定边缘服务器的绑定：

a)新用户加入边缘服务器的覆盖区域 (b)用户远离边缘服务器的覆盖区域 (c)用户服务请求更改 (d)边缘服务器或移动用户离线

分配策略目标是在遵守服务执行的延迟阈值𝛤Γ的同时尽可能多满足服务请求。传统的确定性方法依赖历史数据预测执行时间，但由于执行时间的动态特性，可能导致资源分配过度或不足。提出的RL学习框架通过从边缘服务器的实际经验中学习服务执行模式，实时优化用户-服务器绑定决策，更有效地应对动态环境。

RL框架

RL 框架中的Agent通过探索环境并从动作接收反馈来学习环境以选择更好的动作选择，RL可以在不需要大量标记数据的情况下学习底层环境。

在这个 RL 框架中，Agent不断地与边缘服务器交互以采取行动，执行多个服务请求并根据执行占用空间获得相应的奖励。

状态：边缘服务器的资源向量以及服务请求的数量

动作：等待在边缘服务器上执行的服务请求集。

Reward的计算：

#compute latencydef get_reward(self, state, action):#将动作 action 转换为两个用户数量 u1 和 u2u1 = action//5 + 1u2 = (action+1) - (u1-1)*5#sample time from dataframegram = state[0]gcores = state[1]gwl_c = state[2]gwl_g = state[3]gs1 = u1*100gs2 = u2*100#查找与当前状态和动作匹配的记录fetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) & (self.df['workload_gpu']==gwl_g) & (self.df['users_yolo']==gs1) & (self.df['users_mnet']==gs2)]#计算奖励：        if fetch_state.empty:#找不到匹配的状态信息，则返回较大的负奖励，表示这是一个不利的动作选择return -20 # 计算网络延迟：time1 = fetch_state.sample().iloc[0]['time_yolo'] #从匹配的状态信息中随机选择一个样本time2 = fetch_state.sample().iloc[0]['time_mnet']#获取 time_yolo 和 time_mnet 的延迟时间tm = max(time1, time2)#两者中较大的延迟时间用作网络延迟的阈值#add total latencies due to network based on number of u1 and u2if (tm <= latency_threshold): #用户数量的变化对应的奖励，以及动作本身的基础奖励（u1 + u2）return  0.01*(gs1 - state[4]) +  0.01*(gs2 - state[5]) + u1 + u2 else:return -5 - u1 - u2

训练RL Agent

RL环境：

import numpy as np
import gym
from gym import spaces
from gym.utils import seedingclass yolosystem(gym.Env):metadata = {'render.modes': ['human']}def __init__(self, n_actions, filename):super(yolosystem, self).__init__()self.n_actions = n_actions #total number of action space after ranging [10, 20, 30 ...]self.action_space = spaces.Discrete(self.n_actions) #total number of users in the action space; starts with zeroself.observation_space = spaces.Box(low=np.array([0,0,0,0,0,0]), high=np.array([11000]*6), shape=(6, ), dtype=np.int32) #<RAM, Core, Workload>self.seed()self.current_obs = np.array( [3000, 2, 40, 2, 100, 100] ) #current observation = <ram, cores, workload%>#Load datasetself.df = pd.read_csv(filename)# computer percentage of GPU usage from actual useself.df['workload_gpu'] = self.df['workload_gpu'].multiply(1/80).round(0).astype(int) #round gpu workload#get unique data in setself.ram = self.df.ram.unique()self.cores = self.df.cores.unique()self.workload_cpu = self.df.workload_cpu.unique()print(self.df) #print datasetdef seed(self, seed=1010):self.np_random, seed = seeding.np_random(seed)return [seed]def step(self, action):assert self.action_space.contains(action) #action should be in action spacestate = self.current_obsdone = True #Episodes ends after each action#compute latecy from the number of usersreward = self.get_reward(state, action) #linear latency           
#         print(action, reward)self.current_obs = self.get_random_state() #go to a random state#         print(self.current_obs)return self.current_obs, reward, done, {} #no-states, reward, episode-done, no-infodef reset(self):self.current_obs = self.get_random_state()return self.current_obs #current state of the system with no loaddef render(self, mode='human', close=False):print(f"Current State:<{self.current_obs}>")#compute latencydef get_reward(self, state, action):#将动作 action 转换为两个用户数量 u1 和 u2u1 = action//5 + 1u2 = (action+1) - (u1-1)*5#sample time from dataframegram = state[0]gcores = state[1]gwl_c = state[2]gwl_g = state[3]gs1 = u1*100gs2 = u2*100#查找与当前状态和动作匹配的记录fetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) & (self.df['workload_gpu']==gwl_g) & (self.df['users_yolo']==gs1) & (self.df['users_mnet']==gs2)]#计算奖励：        if fetch_state.empty:#找不到匹配的状态信息，则返回较大的负奖励，表示这是一个不利的动作选择return -20 # 计算网络延迟：time1 = fetch_state.sample().iloc[0]['time_yolo'] #从匹配的状态信息中随机选择一个样本time2 = fetch_state.sample().iloc[0]['time_mnet']#获取 time_yolo 和 time_mnet 的延迟时间tm = max(time1, time2)#两者中较大的延迟时间用作网络延迟的阈值#add total latencies due to network based on number of u1 and u2if (tm <= latency_threshold): #用户数量的变化对应的奖励，以及动作本身的基础奖励（u1 + u2）return  0.01*(gs1 - state[4]) +  0.01*(gs2 - state[5]) + u1 + u2 else:return -5 - u1 - u2     #get to some random state after taking an actiondef get_random_state(self):#generate state randomlygram = np.random.choice(self.ram, 1)[0]gcores = np.random.choice(self.cores, 1)[0]gwl_c = np.random.choice(self.workload_cpu, 1)[0]#fetch gamma for the statefetch_state = self.df.loc[ (self.df['ram'] == gram) & (self.df['cores']== gcores) & (self.df['workload_cpu']==gwl_c) ]gwl_g = fetch_state.sample().iloc[0]['workload_gpu'] #fetch workload randmolygs1 = random.randrange(50, 550, 50)gs2 = random.randrange(50, 550, 50)return np.array( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )

from stable_baselines3.common.monitor import Monitor
import os
# Create log dir
log_dir = './agent_tensorboard/'
os.makedirs(log_dir, exist_ok=True)env = Monitor(env, log_dir)from stable_baselines3 import DQN
from stable_baselines3.dqn import MlpPolicy
from stable_baselines3.common.vec_env import DummyVecEnv# wrap it非向量化的环境 env 转换为一个向量化的环境 env
env = DummyVecEnv([lambda: env])

训练

model = DQN(MlpPolicy, env, verbose=0, tensorboard_log = log_dir, exploration_fraction=0.4, learning_starts=150000,  train_freq=30, target_update_interval=30000, exploration_final_eps=0.07)begin = time.time()
model.learn(total_timesteps=500000) #reset_num_timesteps=False
end = time.time()
training_time = end-begin

RL Allocation Algorithm代码

#Load model
def rl_algo():#对于每台服务器，使用RL预测每个服务器的容量server_capacity = np.zeros((N, S))for server_id in range(N):state = server_state[server_id]
#         if model_type == 'lin':action = model_rl.predict(np.array(state), deterministic=True)
#         if model_type == 'exp':
#             action = model_exp.predict(np.array(state), deterministic=True)
# 根据action计算两种服务的预测容量 (u1 和 u2)u1 = action[0]//5 + 1u2 = (action[0]+1) - (u1-1)*5server_capacity[server_id][0] = u1*100 #model outputserver_capacity[server_id][1] = u2*100 #model output#根据每个用户的服务器数进行排序col1 = np.array([np.sum(ngb,axis=1)])col2 = np.array([np.arange(U)])sorted_ngb = np.concatenate((ngb, col1.T, col2.T), axis=1) #add rowsum and index column添加行和索引列sorted_ngb = sorted_ngb[np.argsort(sorted_ngb[:, N])] #sort the rows based on rowsum column根据行和列对行进行排序#分配用户到服务器#run allocation algorithmrl_allocation = []
# 遍历用户，根据用户连接的服务器列表和服务请求，选择最大预测容量的服务器分配。服务器有足够容量则更新服务器容量并记录分配结果for i in range(U):server_list = np.where(sorted_ngb[i, :N] == 1)[0] #获取用户连接到的服务器列表if len(server_list) == 0: #跳过没有服务器的用户continueser = int(service[i]) #用户正在请求哪个服务choosen_server = server_list[np.argmax(server_capacity[server_list, ser])] #找到所选服务器的 IDif  server_capacity[choosen_server][ser] > 0: #将用户分配给choosen_serverserver_capacity[choosen_server][ser] -= 1 #减少服务器容量rl_allocation.append( (int(sorted_ngb[i, N+1]), choosen_server) ) #(user, server) alloc pairprint('RL Num of allocation: {}'.format(len(rl_allocation)))return rl_allocation

IV.确定性方法作为BASELINE

使用历史服务执行数据的平均值确定边缘服务器 $e_j$ 上服务 $S_k$ 的执行时间 $\gamma _{kj}$ ，进而确定可以分配到边缘服务器的用户数量的相应代码：allocation.ipynb def generate_server_state(num_server)

#获取与选择的 GPU 工作负载值匹配的行 计算YOLO 和 MNet 的平均时间time_yolo = fetch_time['time_yolo'].mean() #average of time for particular statetime_mnet = fetch_time['time_mnet'].mean()# 根据每个服务器的服务请求分配状态gs1 = server_service[s_id][0]gs2 = server_service[s_id][1] server_state.append( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )# 追加每个服务器的 gamma 值gamma.append((time_yolo, time_mnet)) #append the gamma value of each server

ILP Algorithm整数线性规划（ILP）算法代码

解决用户分配到服务器的问题，目标是最大化覆盖的用户数量

def ilp_algo():## ===================================ILP with python mip# >> solver_name=GRB# >> Currently using CBCI = range(U) #user 用户的范围J = range(N) #server服务器的范围alloc = Model(sense=MAXIMIZE, name="alloc", solver_name=CBC)alloc.verbose = 0def coverage(user_ix, server_ix):if ngb[user_ix][server_ix]==1:return 1else:return 0#U: num of users, N: num of servers# 创建二进制变量矩阵 x，其中 x[i][j] 表示用户 i 是否被分配到服务器 jx = [[ alloc.add_var(f"x{i}{j}", var_type=BINARY) for j in J] for i in I]#Objective Equation# 目标函数：最大化分配的用户数量alloc.objective = xsum( x[i][j]  for i in I for j in J )#1. 覆盖约束for i in I:for j in J:        if not coverage(i,j):alloc += x[i][j] == 0# 2. 每个用户只能被分配到一个服务器for i in I:alloc += xsum( x[i][j] for j in J ) <=1# 3. 延迟约束for j in J:alloc += xsum( gamma[j][int(service[i])]*x[i][j] for i in I ) <=latency_threshold-network_latency[j] #alloc.write("test-model.lp")#===========Start Optimization=========alloc.optimize(max_seconds=25)# 优化模型#==========ILP Ends here#print(f"Number of Solutions:{qoe.num_solutions}")ilp_allocation = [ (i,j) for i in I for j in J if x[i][j].x >= 0.99] # 获取分配结果#print(f"Number of Solutions:{qoe.num_solutions}")#print(f"Objective Value:{qoe.objective_value}")allocated_num_users = len(ilp_allocation)print("ILP Allocated Num of Users: {}".format(allocated_num_users))# 输出分配的用户数量# selected.sort()return ilp_allocation

Nearest Neighbourhood Greedy Algorithm代码

def greedy_algo():server_capacity = np.zeros(N)# 初始化服务器容量数组rl_allocation = []for user_id in range(U):#获取与用户连接的服务器列表server_ngb_list = np.where(ngb[user_id, :N] == 1)[0] #get the list of server to which user is connectedif len(server_ngb_list) == 0: #ignore the users which are not under any serverscontinue # 计算每个用户到各个服务器的距离并排序#find the distance to each users in the server_ngb_listdist_list = np.array([ server_ngb_list, [server.iloc[i]['geometry'].centroid.distance(user.iloc[user_id]['geometry']) for i in server_ngb_list] ])# sorted list of servers based on the distance from userssorted_distance_list = dist_list[ :, dist_list[1].argsort()]#get the list of servers arranged in least to max distanceserver_list = sorted_distance_list[0].astype(int)# 分配算法lat = 0for server_id in server_list:lat = gamma[server_id][int(service[user_id])]#根据用户请求的服务类型和服务器，获取相应的服务延迟if server_capacity[server_id]+lat <= latency_threshold-network_latency[server_id]:server_capacity[server_id] += lat #increment the server_capacity of serverrl_allocation.append( (user_id, server_id) ) #(user, server) alloc pairbreakprint('Greedy-Ngb Num of allocation: {}'.format(len(rl_allocation)))return rl_allocation

整体框架代码：

固定服务器数量更改用户数量的情况：

对于不同用户数量，先拿到用户和服务器之间的邻居矩阵ngb，并计算网络延迟network_latency

之后生成服务器状态server_state 计算每个服务器的gamma 值 generate_server_state(num_server)

之后分别使用三种算法计算成功分配的数量

if alloc_type == 'server': #服务器固定，变化用户数量"for U in range(100, 600, 100):#用户数量100-500for epoch in range(50):print("User:", U, 'Server:', N, 'Epoch:', epoch)ngb, user, server, service, server_service, network_latency = ngb_matrix(U, N, S) #从EUA数据生成服务器和用户 # 确定邻域矩阵server_state, gamma = generate_server_state(N) #为每个用户分配状态和γ值#=======ILP startsstart = 0stop = 0execution_time_ilp = 0start = timeit.default_timer()ilp_aloc = ilp_algo() #call ILP algorithmstop = timeit.default_timer()execution_time_ilp = stop - start#========ILP ends#=======Greedy startsstart = 0stop = 0execution_time_greedy = 0start = timeit.default_timer()greedy_aloc = greedy_algo() #call ILP algorithmstop = timeit.default_timer()execution_time_greedy = stop - start#========Greedy ends#=======RL_linear startsstart = 0stop = 0execution_time_rl = 0start = timeit.default_timer()rl_aloc = rl_algo() #call ILP algorithmstop = timeit.default_timer()execution_time_rl = stop - start#========RL_linear ends#========Store results to fileto_append = [U, N,len(ilp_aloc), execution_time_ilp,len(greedy_aloc), execution_time_greedy,len(rl_aloc), execution_time_rl,] dseries = pd.Series(to_append, index = result_user.columns)result_user = result_user.append(dseries, ignore_index=True)print("epoch:", epoch)result_user.to_csv(result_file, index=False)

框架中用到的函数：

一、生成服务器状态计算每个服务器的 gamma 值

def generate_server_state(num_server):#生成服务器状态计算每个服务器的 gamma 值df = pd.read_csv(filename_base)# 将 GPU 工作负载的数值进行缩放和四舍五入   
#     df['ram'] = df['ram'].div(1000).round(0).astype(int)
#     df['workload_cpu'] = df['workload_cpu'].div(10).round(0).astype(int)df['workload_gpu'] = df['workload_gpu'].multiply(1/80).round(0).astype(int) #round gpu workload
#     df['users_yolo'] = df['users_yolo'].div(100).round(0).astype(int)
#     df['users_mnet'] = df['users_mnet'].div(100).round(0).astype(int)#get unique data in set获取数据集中唯一的 RAM、核心数和 CPU 工作负载值ram = df.ram.unique()cores = df.cores.unique()workload_cpu = df.workload_cpu.unique()server_state = []#服务器状态gamma = []for s_id in range(num_server):#对于每一个服务器，随机选择一个 RAM、核心数和 CPU 工作负载值gram = np.random.choice(ram, 1)[0]gcores = np.random.choice(cores, 1)[0]gwl_c = np.random.choice(workload_cpu, 1)[0]#fetch gamma for the state获取对应状态的行fetch_state = df.loc[ (df['ram'] == gram) & (df['cores']== gcores) & (df['workload_cpu']==gwl_c) ]#    从匹配的状态中随机选择一个 GPU 工作负载值gwl_g = fetch_state.sample().iloc[0]['workload_gpu'] #fetch workload randmolyfetch_time = fetch_state.loc[ (df['workload_gpu'] == gwl_g) ]#获取与选择的 GPU 工作负载值匹配的行 计算YOLO 和 MNet 的平均时间time_yolo = fetch_time['time_yolo'].mean() #average of time for particular statetime_mnet = fetch_time['time_mnet'].mean()# 根据每个服务器的服务请求分配状态gs1 = server_service[s_id][0]gs2 = server_service[s_id][1] server_state.append( [gram, gcores, gwl_c, gwl_g, gs1, gs2] )# 追加每个服务器的 gamma 值gamma.append((time_yolo, time_mnet)) #append the gamma value of each serverreturn server_state, gamma

二、计算邻接矩阵

#================neighbourhood Computing
def ngb_matrix(U, N, S):
# 生成用户和服务器之间的邻居矩阵，并计算网络延迟 #U: number of users#N: number of servers#S: number of services# U X N matrixuser = load_users(U)server = load_servers(N)neighbourhood = np.zeros([U, N]) #用户和服务器之间的邻居矩阵network_latency = np.zeros(N) #每个服务器的网络延迟latency_data = load_planetlab() #加载 PlanetLab 数据，返回一个距离矩阵（bin size 150）# 检查每个用户是否在服务器的缓冲区内，并计算网络延迟for u in range(0, U):for n in range(0, N):#检查用户是否在服务器的缓冲区内（使用几何空间的 contains 方法）if server.iloc[n].geometry.contains(user.iloc[u].geometry):neighbourhood[u,n]=1#邻居矩阵中相应位置设为 1# 计算距离并分配延迟distance = server.iloc[n].geometry.centroid.distance(user.iloc[u].geometry)rep_lat = fetch_network_lat(int(distance), latency_data) #根据距离从 latency_data 中获取网络延迟if network_latency[n] < rep_lat:#最大可能延迟network_latency[n] = rep_latelse:neighbourhood[u,n]=0service = np.zeros(U)#为用户分配服务请求for u in range(0, U):#为每个用户随机分配一个从 0 到 S-1 的服务请求service[u] = random.randrange(0, S, 1) server_service = np.zeros((N, S))#统计每个服务器的服务请求数量for n in range(0, N):for u in range(0, U):if neighbourhood[u][n] == 1:server_service[n][int(service[u])] += 1return neighbourhood, user, server, service, server_service, network_latency

三、计算邻接矩阵中用到的函数

#================Load Planet Lab data
# 加载 PlanetLab 数据并转换为一个矩阵格式
def load_planetlab():#Convert to triangleldata = np.loadtxt('eua/PlanetLabData_1')[np.tril_indices(490)]ldata = ldata[ ldata != 0]#提取下三角矩阵的非零值ldata =np.unique(ldata)#去重并重置数据大小，使其符合150行的矩阵格式np.set_printoptions(suppress=True)length = ldata.shape[0]latency_row = 150latency_col = (length//latency_row) #Global Data usedldata = np.resize(ldata, latency_col*latency_row)latency = ldata.reshape(latency_row,-1)return latency#=================Fetch Network latency
# 根据距离从延迟数据中获取网络延迟
def fetch_network_lat(distance, latency_data):rep_lat = np.random.choice(latency_data[distance], size=1, replace=True)#根据距离从延迟数据中随机选择一个延迟值return rep_lat/1000 #将延迟值转换为秒#===============User Data
# 加载用户数据并转换为地理数据格式
def load_users(num_of_users):user_raw = pd.read_csv("eua/users.csv")user_raw = user_raw.rename_axis("UID")#将数据框的索引轴重命名为 "UID"，即用户的唯一标识符df = user_raw.sample(num_of_users)#随机抽样指定数量的用户数据
# 创建地理数据框，使用Longitude和Latitude列创建点几何对象，并转换坐标参考系统（CRS）gdf = geopandas.GeoDataFrame(df, geometry = geopandas.points_from_xy(df.Longitude, df.Latitude), crs = 'epsg:4326')#创建地理数据框user = gdf [['geometry']] #保留geometry列user = user.to_crs(epsg=28355) #指定数据的坐标参考系统（WGS84投影）#Insert additional data to dataframe#user = user.apply(add_data, axis=1)return user#================Server Data
def load_servers(num_of_servers):# 加载服务器数据，并将其转换为地理数据格式server_raw = pd.read_csv("eua/servers.csv")server_raw = server_raw.rename_axis("SID")#将数据框的索引轴重命名为 "SID"，即服务器的唯一标识符df = server_raw.sample(num_of_servers) #随机抽样指定数量的服务器gdf = geopandas.GeoDataFrame(df, geometry = geopandas.points_from_xy(df.LONGITUDE, df.LATITUDE), crs = 'epsg:4326')#创建地理数据框server = gdf [['geometry']] #Keep Geometry columnserver = server.to_crs(epsg=28355) #Cover to crs in Australian EPSGdef add_radius(series):
#         radius = random.randrange(150, 250, 10)
# 为每个服务器添加一个固定半径的缓冲区radius = 150 #每个服务器的缓冲区半径设为固定值 150series.geometry = series.geometry.buffer(radius)series['radius'] = radius
#         series['resource'] = tcompreturn seriesserver = server.apply(add_radius, axis = 1)return server#绘制用户和服务器数据在地图上的分布情况
def plot_data(user, server):%config InlineBackend.figure_format='retina'%matplotlib inlinecbd = geopandas.read_file('eua/maps', crs = {'init': 'epsg=28355'} ) #read cbd-australia location datafig, ax = plt.subplots(1, 1, figsize=(15,10))ax.set_aspect('equal')ax.set_xlim(319400, 322100)ax.set_ylim(5811900, 5813700)user.plot(ax=ax, marker='o', color='red', markersize=20, zorder=3, label="users")server.plot(ax =ax, linestyle='dashed', edgecolor='green', linewidth=1, facecolor="none", zorder=1)server.centroid.plot(ax=ax, marker='s', color='blue', markersize=50, zorder=2, label="server")cbd.plot(ax=ax, color='grey', zorder=0, alpha = 0.3);ax.set_title("MEC Environment(EUA): CBD Melbourne(Australia)")ax.legend(bbox_to_anchor=(1, 0), loc='lower left')plt.show()

运行结果：

对比实验：

一、对于RL算法使用不同训练回合数：

rl_algo_prop()和rl_algo_und()只是用了两个不同程度训练的agent模型，其余部分完全一致，这里只展示rl_algo_prop()。

分别是训练30,000回合的RL Agent生成的分配数量和训练1,50,000回合

动作空间的量化大小 $\lambda$ = 2

model_und = DQN.load("trained_agents/edge_agent_under_train")
model_prop = DQN.load("trained_agents/edge_agent_proper_train")

#Load model
def rl_algo_prop():#...同rl_algo()#换个模型就OK action = model_prop.predict(np.array(state), deterministic=True)print('Actionprop: {}'.format(action))u1 = (action[0]//10)*2 + 1u2 = (action[0]%10)*2 + 1server_capacity[server_id][0] = u1 #model outputserver_capacity[server_id][1] = u2 #model output#...同rl_algo()

二、对于RL算法使用不同量化因子：

其余两种只在模型和和动作空间映射改变

rl_algo_act() 代码中说是 $\lambda$ =5：

action = model_act.predict(np.array(state), deterministic=True)print('Actionact: {}'.format(action))u1 = (action[0]//5)*4 + 1 #25 action spaceu2 = (action[0]%5)*4 + 1server_capacity[server_id][0] = u1 #model outputserver_capacity[server_id][1] = u2 #model output

rl_algo_thres10()代码中说是 $\lambda$ =100：

 action = model_thres10.predict(np.array(state), deterministic=True)print('Actionthres10: {}'.format(action))u1 = action[0]//5 + 1u2 = (action[0]+1) - (u1-1)*5server_capacity[server_id][0] = u1*100 #model outputserver_capacity[server_id][1] = u2*100 #model output

对于训练不同回合数rl_algo_prop中动作空间的量化大小 $\lambda$ = 2的个人理解，不一定对：

每次agent预测出action之后，从中还原出两个服务s1、s2上的服务请求数（动作）使用的方法不同， $\lambda$ = 2时的映射方法如下，输出action

for action in range(25):#rl_algo_prop()u1 = (action//10)*2 + 1u2 = (action%10)*2 + 1print(f"Action: {action}, u1: {u1}, u2: {u2}")

Action: 0, u1: 1, u2: 1
Action: 1, u1: 1, u2: 3
Action: 2, u1: 1, u2: 5
Action: 3, u1: 1, u2: 7
Action: 4, u1: 1, u2: 9
Action: 5, u1: 1, u2: 11
Action: 6, u1: 1, u2: 13
Action: 7, u1: 1, u2: 15
Action: 8, u1: 1, u2: 17
Action: 9, u1: 1, u2: 19
Action: 10, u1: 3, u2: 1
Action: 11, u1: 3, u2: 3
Action: 12, u1: 3, u2: 5
Action: 13, u1: 3, u2: 7
Action: 14, u1: 3, u2: 9
Action: 15, u1: 3, u2: 11
Action: 16, u1: 3, u2: 13
Action: 17, u1: 3, u2: 15
Action: 18, u1: 3, u2: 17
Action: 19, u1: 3, u2: 19
Action: 20, u1: 5, u2: 1
Action: 21, u1: 5, u2: 3
Action: 22, u1: 5, u2: 5
Action: 23, u1: 5, u2: 7
Action: 24, u1: 5, u2: 9

文章提到：使用大小为 $\lambda$ 的量化减少动作空间的基数， $\lambda$ =10时新的动作元组 (2,2) 表示旧动作空间中范围 (11 - 20,11 - 20) 中的所有动作

于是我暂且认为 $\lambda$ =2指的是，量化后的动作空间中的一个动作，代表原来动作空间中的两个动作，也就是，第一个动作中的 u1: 1是我们选择来代表原来动作空间中u1: 1、u1: 2；同理u2:1代表u2:1、u2:2。下一个动作就从3开始。

但无法解释后边 $\lambda$ =5， $\lambda$ =100情况：

$\lambda$ =5：

for action in range(25):#rl_algo_act()u1 = (action//5)*4 + 1u2 = (action%5)*4 + 1print(f"Action: {action+1}, u1: {u1}, u2: {u2}")

Action: 0, u1: 1, u2: 1
Action: 1, u1: 1, u2: 5
Action: 2, u1: 1, u2: 9
Action: 3, u1: 1, u2: 13
Action: 4, u1: 1, u2: 17
Action: 5, u1: 5, u2: 1
Action: 6, u1: 5, u2: 5
Action: 7, u1: 5, u2: 9
Action: 8, u1: 5, u2: 13
Action: 9, u1: 5, u2: 17
Action: 10, u1: 9, u2: 1
Action: 11, u1: 9, u2: 5
Action: 12, u1: 9, u2: 9
Action: 13, u1: 9, u2: 13
Action: 14, u1: 9, u2: 17
Action: 15, u1: 13, u2: 1
Action: 16, u1: 13, u2: 5
Action: 17, u1: 13, u2: 9
Action: 18, u1: 13, u2: 13
Action: 19, u1: 13, u2: 17
Action: 20, u1: 17, u2: 1
Action: 21, u1: 17, u2: 5
Action: 22, u1: 17, u2: 9
Action: 23, u1: 17, u2: 13
Action: 24, u1: 17, u2: 17

$\lambda$ =100：

for action in range(25):#rl_algo_thres10u1 = action//5 + 1u2 = (action+1) - (u1-1)*5print(f"Action: {action+1}, u1: {u1}, u2: {u2}")

Action: 1, u1: 1, u2: 1
Action: 2, u1: 1, u2: 2
Action: 3, u1: 1, u2: 3
Action: 4, u1: 1, u2: 4
Action: 5, u1: 1, u2: 5
Action: 6, u1: 2, u2: 1
Action: 7, u1: 2, u2: 2
Action: 8, u1: 2, u2: 3
Action: 9, u1: 2, u2: 4
Action: 10, u1: 2, u2: 5
Action: 11, u1: 3, u2: 1
Action: 12, u1: 3, u2: 2
Action: 13, u1: 3, u2: 3
Action: 14, u1: 3, u2: 4
Action: 15, u1: 3, u2: 5
Action: 16, u1: 4, u2: 1
Action: 17, u1: 4, u2: 2
Action: 18, u1: 4, u2: 3
Action: 19, u1: 4, u2: 4
Action: 20, u1: 4, u2: 5
Action: 21, u1: 5, u2: 1
Action: 22, u1: 5, u2: 2
Action: 23, u1: 5, u2: 3
Action: 24, u1: 5, u2: 4
Action: 25, u1: 5, u2: 5