batch manufacturing

实验目的

Write programs to solve batch manufacturing problem

Formulate and solve both discounted and average cost problems
Realize both value iteration and policy iteration algorithms
parameters $c, K, n, p, \alpha$ as inputs

Experiments and analysis

Run the code with different parameter inputs to show the performance of value iteration and policy iteration
For discounted problem, consider exercise 7.8
For average cost problem, write code to find the threshold by stationary analysis
Compare the optimal policies of the discounted problem wit hdifferent values of $\alpha$ and the average cost problem

前置知识

A Manufacturer at each time period receives an order for her product with probability p and receives no order with probability $1 - p$.

At any period, she has a choice of processing all the unfilled orders in a batch, or process no order at all. The maximum number of orders that can remain unfilled is $n$.

The cost per unfilled order at each time period is $c > 0$, the setup cost to process the unfilled orders is $K > 0$.

The manufacturer wants to find a processing policy that minimizes the total expected cost with discount factor $\alpha < 1$.

Discounted Problem

Discounted problems $(0<\alpha<1)$

\[\lim _{N \rightarrow \infty} \underset{\substack{w_k \\ k=0,1, \ldots}}{E}\left\lbrace\sum_{k=0}^{N-1} \alpha^k g\left(x_k, \mu_k\left(x_k\right), w_k\right)\right\rbrace\]

$\alpha$ 表示衰减率
$g(x_k, \mu_k\left(x_k\right), w_k)$ 表示从状态 $x_k$ 执行动作 $\mu_k(x_k)$ 到达状态$w_k$的代价

Value Iteration

给定初始状态 $J_0(1), \ldots, J_0(n)$, 使用动态规划方程计算 $J_k(i)$ 为

\[J_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J_k(j)\right], \forall i\]

最终所有状态会收敛到$J^\ast(i)$。

Policy Iteration

根据bellman方程，状态$i$的最优值函数应满足：

\[J^\ast(i)=\min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J^\ast(j)\right], \forall i\]

对于一个给定的策略 $\mu$, 一个特定解的代价 $J_\mu(1), \cdots, J_\mu(n)$ 满足

\[J_\mu(i)=g(i, \mu(i))+\alpha \sum_{j=1}^n p_{i j}(\mu(i)) J_\mu(j), \forall i\]

给定初始状态 $J_0(1), \ldots, J_0(n)$, 使用动态规划方程计算 $J_k(i)$ 为

\[J_{k+1}(i)=g(i, \mu(i))+\alpha \sum_{j=1}^n p_{i j}(\mu(i)) J_k(j), \forall i\]

最终所有状态会收敛到 $J_{\mu}(i)$。

当且仅当对于每个状态 $i$，$\mu(i)$ 都达到Bellman方程中的最小值时，静态政策 $\mu$ 才是最优的。

因此策略迭代方程为

\[\mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\alpha \sum_{j=1}^n p_{i j}(u) J_{\mu^k}(j)\right], \forall i\]

生成不断改进的策略序列，并以最优策略结束。

Average Cost Problem

Average cost per stage problems

\[\lim _{N \rightarrow \infty} \frac{1}{N} \underset{\substack{w_k \\ k=0,1, \ldots}}{E}\left\lbrace\sum_{k=0}^{N-1} g\left(x_k, \mu_k\left(x_k\right), w_k\right)\right\rbrace\]

$g(x_k, \mu_k\left(x_k\right), w_k)$表示从状态 $x_k$ 执行动作 $\mu_k(x_k)$ 到达状态$w_k$的代价

Value Iteration

Bellman方程如下：

\[\lambda^\ast+h^\ast(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h^\ast(j)\right], \forall i\]

未知变量 $\lambda^\ast, h^\ast(i)$，$\lambda^\ast$ 表示最优平均cost，$h^\ast$ 表示各个状态的最优cost与平均最优cost的差分cost。

给定 $h_k(i)$，以及固定状态 $s$，计算

\[\lambda_k=\min _{u \in U(s)}\left[g(s, u)+\sum_{j=1}^n p_{s j}(u) h_k(j)\right]\]

和

\[h_{k+1}(i)=\min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h_k(j)\right]-\lambda_k\]

Policy Iteration

在第$k$轮，固定策略 $\mu^k$，计算价值函数

\[\begin{gathered} \lambda^k+h^k(i)=g\left(i, \mu^k(i)\right)+\sum_{j=1}^n p_{i j}\left(\mu^k(i)\right) h^k(j), \forall i \\ h^k(n)=0 \end{gathered}\]

策略改进：

\[\mu^{k+1}(i)=\arg \min _{u \in U(i)}\left[g(i, u)+\sum_{j=1}^n p_{i j}(u) h^k(j)\right], \forall i\]

终止状态为 $\lambda^{k+1}=\lambda^k, h^{k+1}(i)=h^k(i), \forall i$。

\section{方法}

马尔科夫决策建模

State $i \in\lbrace0,1, \cdots, n\rbrace$ : number of unfilled orders
Action $u \in\lbrace0,1\rbrace$ : process (1) or not (0)
\[u \in\lbrace0,1\rbrace, \text { if } i<n ; \quad u=1 \text {, if } i=n\]
State Transition $p_{i j}(u)$ :
\[\begin{gathered} p_{i 1}(1)=p_{i(i+1)}(0)=p, \quad p_{i 0}(1)=p_{i i}(0)=1-p, \quad i<n \\ p_{n 1}(1)=p, \quad p_{n 0}(1)=1-p \end{gathered}\]
Per-stage cost
\[g(i, 1)=K, \quad g(i, 0)=c i\]

在Discounted Problem下

Value Iteration

初始值

\[J_0(i) = 0,\forall i\]

Value iteration

\[\begin{aligned} J_{k+1}(i)= & \min \left[K+\alpha(1-p) J_k(0)+\alpha p J_k(1)\right. \\ & \left.c i+\alpha(1-p) J_k(i)+\alpha p J_k(i+1)\right], \quad i=0,1, \cdots, n-1, \\ J_{k+1}(n)= & K+\alpha(1-p) J_k(0)+\alpha p J_k(1) \end{aligned}\]

Convergence:

\[J^\ast(i)=\lim _{k \rightarrow \infty} J_k(i)\]

Policy Iteration

初始值

\[\mu^0(i)=1, \forall i\]

策略评估

\[\begin{aligned} J_{\mu^{k+1}}(i)&=\left\lbrace \begin{aligned} K+\alpha(1-p) J_{\mu^k}(0)+\alpha p J_{\mu^k}(1) \quad \mu^k(i)=0\\ c i+\alpha(1-p) J_{\mu^k}(i)+\alpha p J_{\mu^k}(i+1) \quad \mu^k(i)=1\\ \end{aligned} \right . \quad\text{if} \quad i=0,1, \cdots, n-1,\\ J_{\mu^{k+1}}(n)&= K+\alpha(1-p) J_{\mu^k}(0)+\alpha p J_{\mu^k}(1) \end{aligned}\]

策略改进

\[\begin{aligned} \text{cost}_{process}(i) &= K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)+\lambda_{k}\\ \text{cost}_{unprocess}(i) &= c i+(1-p) J_{\mu^k}(i)+ p J_{\mu^k}(i+1)+\lambda_{k}\\ \mu^{k+1}(i)&=\left\lbrace \begin{aligned} 0 \quad \text{cost}_{process}(i)\geq\text{cost}_{unprocess}(i)\\ 1 \quad \text{cost}_{process}(i)<\text{cost}_{unprocess}(i)\\ \end{aligned} \right. \quad \text{ if } i = 0, \cdots, n-1\\ \mu^{k+1}(n)&=1 \end{aligned}\]

在Average Cost Problem下

Bellman方程为：

\[\begin{aligned} \lambda^\ast+h^\ast(i)= & \min \left[K+(1-p) h^\ast(0)+p h^\ast(1)\right. \\ & \left.c i+(1-p) h^\ast(i)+p h^\ast(i+1)\right], \quad i=0,1, \cdots, n-1 \\ \lambda^\ast+h^\ast(n)= & K+(1-p) h^\ast(0)+p h^\ast(1) \end{aligned}\]

Value Iteration

初始值

\[J_0(i) = 0,\forall i\]

固定状态为$0$，每一轮循环必定会从0开始并返回0。

Value iteration

\[\begin{aligned} \lambda_{k}=&(1 - p) * J_k(0) + p * J_k(1)\\ J_{k+1}(i)= & \min \left[K+(1-p) J_k(0)+ p J_k(1)\right. \\ & \left.c i+(1-p) J_k(i)+ p J_k(i+1)\right]-\lambda_{k}, \quad i=0,1, \cdots, n-1, \\ J_{k+1}(n)= & K+(1-p) J_k(0)+ p J_k(1)-\lambda_{k} \end{aligned}\]

Convergence:

\[J^\ast(i)=\lim _{k \rightarrow \infty} J_k(i),\lambda^\ast=\lim _{k \rightarrow \infty} \lambda_k\]

Policy Iteration

初始值

\[\mu^0(i)=1, \forall i\]

策略评估

\[\begin{aligned} \lambda_{k}&=(1 - p) * J_k(0) + p * J_k(1)\\ J_{\mu^{k+1}}(i)&=\left\lbrace \begin{aligned} K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)-\lambda_{k} \quad \mu^k(i)=0\\ c i+(1-p) J_{\mu^k}(i)+ p J_{\mu^k}(i+1)-\lambda_{k} \quad \mu^k(i)=1\\ \end{aligned} \right . \quad\text{if} \quad i=0,1, \cdots, n-1,\\ J_{\mu^{k+1}}(n)&= K+(1-p) J_{\mu^k}(0)+ p J_{\mu^k}(1)-\lambda_{k} \end{aligned}\]

策略改进

根据value function得到policy

正常来说，我们可以通过value-action function采用贪婪的策略得到policy，但在本次实验中，通过value iteration会得到状态价值函数，状态价值函数只能评估一个状态的好坏程度。已知问题是batch manufacturing problem，决策只有process和unprocess两种，从贪婪的角度来说，如果unporcess能够得到更好的状态，那么我们选择unporcess，否则将会process。因此，在得到value function后，所有拥有最大value的状态，都会得到process的决策，其他状态则为unprocess。

实现

Discounted Problem下的Value Iteration

def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前的Value函数表，初始值为零向量
    value = [0] * (n + 1)
    # 上一轮的Value函数表，初始值为零向量
    last_value = [0] * (n + 1)
    # 记录迭代轮数
    turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据更新方程，更新value[0...n-1]
        for i in range(n):
            value[i] = min(
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        # 根据更新方程，更新value[n]
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        # 判断是否收敛，即两个向量的差小于EPS
        if is_terminal(value, last_value, Value.EPS):
            break
        last_value = value.copy()
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value)
    }

Discounted Problem下的Policy Iteration

# 策略评估：根据当前策略policy计算value表
def _get_value_discounted(self, p, n, c, K, alpha, policy):
    value = [0] * (n + 1)
    last_value = [0] * (n + 1)
    turns = 0
    while True:
        turns = turns + 1
        for i in range(n):
            value[i] = (
                K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
            )
        value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
        if is_terminal(value, last_value, Policy.EPS):
            break
        last_value = value.copy()
    
    return value, turns

# 策略改进：策略迭代
def _iteration_discounted(self, p, n, c, K, alpha):
    # 当前轮的策略，目前是初始策略
    policy = [0] * n + [1]
    # 上一轮的策略
    last_policy = [0] * n + [1]
    # 记录轮数
    turns = 0
    # 价值计算总轮数
    value_turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 根据上一轮的策略计算value表
        value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
        value_turns += turns_i

        # 根据value表计算最新的策略
        for i in range(n):
            # 如果process，代价为
            process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
            # 如果不process，代价为
            unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
            # 选取最小值对应的动作
            policy[i] = 1 if process_cost < unprocess_cost else 0
        # policy[n]的动作空间只有{1}
        policy[n] = 1

        # 判断是否达到终止状态，即value表收敛
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

Average Cost Problem下的Value Iteration

def _iteration_average(self, p, n, c, K):
    # 当前轮次的差分价值函数
    value_h = [0] * (n + 1)
    # 上一轮次的差分价值函数
    last_value_h = [0] * (n + 1)
    # 上一轮次的平均价值函数
    last_lambda_k = 0
    # 记录轮数
    turns = 0
    while True:
        # 轮数 + 1
        turns = turns + 1
        # 固定状态0，计算平均价值函数
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        # 迭代计算当前轮次差分状态函数
        for i in range(n):
            value_h[i] = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        # 计算状态n的当前轮次差分状态函数
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        # 判断是否达到终止状态
        if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
            break
        # 更新轮次状态
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    value = [i + lambda_k for i in value_h]
    
    return {
        "Iteration turns" : turns,
        "Value function" : format_float(value),
        "Policy" : self._get_policy(value),
        "Optimal average cost" : format_float(lambda_k)
    }

Average Cost Problem下的Policy Iteration

# 策略评估：根据当前轮次的策略，计算当前轮次的平均价值和差分价值表
def _get_value_average(self, p, n, c, K, policy):
    value_h = [0] * (n + 1)
    last_value_h = [0] * (n + 1)
    last_lambda_k = 0
    turns = 0
    while True:
        turns = turns + 1
        lambda_k = min(
            K + (1 - p) * last_value_h[0] + p * last_value_h[1],
            (1 - p) * last_value_h[0] + p * last_value_h[1]
        )
        for i in range(n):
            value_h[i] = (
                K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
            ) - lambda_k
        value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
        if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
            break
        last_value_h = value_h.copy()
        last_lambda_k = lambda_k
    
    return lambda_k, value_h, turns

# 策略改进：策略迭代
def _iteration_average(self, p, n, c, K):
    # 初始化当前轮次的策略
    policy = [1] * (n + 1)
    # 初始化上一轮次的策略
    last_policy = [1] * (n + 1)
    # 记录轮次
    turns = 0
    # 价值函数迭代轮数
    value_turns = 0
    while True:
        # 轮次 + 1
        turns = turns + 1
        # 根据当前轮次的策略，计算当前轮次的平均价值和差分价值表
        value, turns_i = self._get_value_average(p, n, c, K, last_policy)
        value_turns += turns_i

        # 策略迭代
        for i in range(n):
            process_cost = K + (1 - p) * value[0] + p * value[1]
            unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
            policy[i] = 1 if process_cost < unprocess_cost else 0
        policy[n] = 1

        # 判断是否达到终止状态
        if policy == last_policy:
            break
        last_policy = policy.copy()

    return {
        "Policy improvement iteration turns" : turns,
        "Calucating value function total iteration turns" : value_turns,
        "Value function" : format_float(value),
        "Policy" : policy
    }

工具函数

# 用与判断两个list：x和y每个元素的距离都小于EPS
def is_terminal(x, y, EPS):
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

# 用语规范list或者float为4位小数
def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

# 根据value function得到policy
def _get_policy(self, value):
    max_value = max(value)
    policy = [0 if i < max_value - Value.EPS else 1 for i in value]
    return policy

实验

实验指令见附件。

Discounted Problem

比较不同的 $n$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

一方面，只要其他参数不改变，当process的阈值小于n，value function和policy并不会随着$n$的增长而改变；另一方面，在这组参数下，value iteration和policy iteration得到的value function和policy是一致的，尽管policy iteration所需要的策略改进轮数更少，但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

比较不同的 $K$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 91
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 378
    Value function : [10.5741, 12.9241, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741, 13.5741]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 407
    Value function : [18.358, 22.4377, 25.2018, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358, 26.358]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

当 $K$ 发生变化时，两种方法得到的结果仍然是一致的；尽管policy iteration所需要的策略改进轮数更少，但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

当 $K$ 增大时，所有状态的价值函数的值都会增大，而得到的最优策略显示，货物会存放更多的天数，这样也是符合常识的。

比较不同的 $c$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 92
    Value function : [10.319, 12.6123, 14.3042, 15.2609, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319, 15.319]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 97
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 387
    Value function : [10.3191, 12.6124, 14.3042, 15.2609, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191, 15.3191]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 398
    Value function : [19.1242, 23.3742, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242, 24.1242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

当 $c$ 发生变化时，两种方法得到的结果仍然是一致的；尽管policy iteration所需要的策略改进轮数更少，但其进行策略评估的总轮数是要大于value iteration所需要的轮数的。

当 $c$ 增大时，所有状态的价值函数的值都会增大，而得到的最优策略显示，货物会存放更少的天数，这样也是符合常识的。

比较不同的$\alpha$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Iteration turns : 2
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Iteration turns : 6
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Iteration turns : 16
    Value function : [0.9615, 2.8845, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.0.
    Policy improvement iteration turns : 2
    Calucating value function total iteration turns : 4
    Value function : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
    Policy : [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.1.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 18
    Value function : [0.0617, 1.1728, 2.2839, 3.3935, 4.4769, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617, 5.0617]
    Policy : [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.5.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 49
    Value function : [0.9615, 2.8846, 4.6538, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615, 5.9615]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应，在discounted problem下，value iteration和policy iteration得到的结果是一致的，只有迭代轮数上的差异。

$\alpha$ 越接近1，代表对未来奖励的重视程度越高；而 $\alpha$ 越接近0，代表对未来奖励的重视程度越低。当$\alpha$增大时，

上一轮迭代的结果对当前轮数的影响会增加，收敛的速度会减缓，迭代的轮数会增加；
上一轮迭代的结果会以一定程度的衰减增加到当前的value function上，因此最终收敛后的value function会增大；
policy会更加注重长期收益，而随着轮数的增加，process的收益是最大的，因此policy中process的决策点会提前；
当$\alpha=1$时，奖励会一直叠加导致iteration无法终止。

比较不同的 $p$

Used value iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 85
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 95
    Value function : [14.6242, 17.8742, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242, 19.6242]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 98
    Value function : [21.1872, 23.803, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

Used policy iteration to solve batch manufacturing problem, formulate with discounted problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 284
    Value function : [4.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991, 9.4991]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 297
    Value function : [14.6241, 17.8741, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241, 19.6241]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 401
    Value function : [21.1872, 23.8031, 25.5072, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872, 26.1872]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

上述实验可以反应，在discounted problem下，value iteration和policy iteration得到的结果是一致的，只有迭代轮数上的差异。

当$p$增大时，获得订单的概率会增大，

因此每一轮unprocess的收益会增大，下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加，因此process的决策点会推移；
由于决策点会往后推移，因此计算的value function会整体增大；
计算的value function会整体增大，收敛的轮数也会增加。

Average cost Problem

比较不同的 $n$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75

Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 8, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 12, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75

可以观察，

当 $n$ 本身已经超过unprocess决策窗口时，随着$n$的增大，value iteration和policy iteration的结果并不会变化，包括迭代的次数也不会变化；
得到的value function可以应证，状态价值函数是一个递增的序列，也是符合预期的，同时value[0]=optimal average cost也是符合预期的；
横向比较看，value iteration和policy iteration得到的状态价值函数和策略都是一样的；
从迭代轮数来看，policy iteration所需要的计算代价会更大。

比较不同的 $K$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [2.3333, 7.0001, 9.6667, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333, 10.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3333

Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 1.0, K = 3.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 42
    Value function : [1.25, 3.75, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25, 4.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.25
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 1.0, K = 8.0, alpha = 0.9.
    Policy improvement iteration turns : 6
    Calucating value function total iteration turns : 251
    Value function : [2.3334, 7.0001, 9.6667, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334, 10.3334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.3334

当 $K$ 增大时，所有状态的价值函数的值都会增大，而得到的最优策略显示，货物会存放更多的天数，这样也是符合常识的。

可以观察，

两种方法得到的结果都是一致的；
当 $K$ 增大时，所有状态的价值函数的值都会增大，而得到的最优策略显示，货物会存放更多的天数，这样也是符合常识的。
当 $K$ 增大时，所需要的计算代价会更大，收敛的时间会更久。

比较不同的$c$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Iteration turns : 18
    Value function : [1.3333, 4.0, 5.6667, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333, 6.3333]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.3333
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Iteration turns : 5
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25

Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.5, n = 10, c = 0.5, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 240
    Value function : [1.375, 4.1251, 5.875, 6.625, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375, 6.375]
    Policy : [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.375
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 96
    Value function : [1.8334, 5.5001, 7.1667, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334, 6.8334]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.8334
    
p = 0.5, n = 10, c = 2.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 28
    Value function : [2.25, 6.75, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25, 7.25]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.25

当 $c$ 增大时，所有状态的价值函数的值都会增大，而得到的最优策略显示，货物会存放更少的天数，这样也是符合常识的。

比较不同的 $p$

Used value iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 8
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 6
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Iteration turns : 62
    Value function : [2.5, 5.2777, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5

Used policy iteration to solve batch manufacturing problem, formulate with average_cost problem:
p = 0.1, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 3
    Calucating value function total iteration turns : 231
    Value function : [0.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5, 5.5]
    Policy : [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 0.5
    
    
p = 0.5, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 5
    Calucating value function total iteration turns : 100
    Value function : [1.75, 5.25, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75, 6.75]
    Policy : [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 1.75
    
    
p = 0.9, n = 10, c = 1.0, K = 5.0, alpha = 0.9.
    Policy improvement iteration turns : 4
    Calucating value function total iteration turns : 338
    Value function : [2.5, 5.2778, 6.9444, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5, 7.5]
    Policy : [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
    Optimal average cost : 2.5

上述实验可以反应，在average cost problem下，value iteration和policy iteration得到的结果是一致的，但迭代轮数上的也有差异。

当 $p$ 增大时，获得订单的概率会增大，

因此每一轮unprocess的收益会增大，下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加，因此process的决策点会推移；
由于决策点会往后推移，因此计算的value function会整体增大；
计算的value function会整体增大，收敛的轮数也会增加。

实验总结

具体的计算结果、比较和分析见上述实验。

在Discounted problem下，Value iteration和Policy iteration得到的结果是一致的，只有计算代价上的区别，后者的计算开销会更大，因为其不仅要策略改进，还需要进行策略的评估，而策略评估的开销是比较大的；
在Average cost problem下，Value iteration和Policy iteration得到的结果也是是一致的，同样的，也是后者的计算开销也会更大；
当 $n$ 增加时，只要threshold在 $n$ 的范围内，对结果是没有影响的；
当 $c$ 增加时，所有状态的价值函数的值都会增大，计算开销会减小，而得到的最优策略显示，货物会存放更少的天数，这样也是符合常识的；
当 $K$ 增加时，所有状态的价值函数的值都会增大，计算开销会增大，而得到的最优策略显示，货物会存放更多的天数，这样也是符合常识的；
当 $\alpha$ 增加时，上一轮迭代的结果对当前轮数的影响会增加，收敛的速度会减缓，迭代的轮数会增加，上一轮迭代的结果会以一定程度的衰减增加到当前的value function上，因此最终收敛后的value function会增大，policy会更加注重长期收益，而随着轮数的增加，process的收益是最大的，因此policy中process的决策点会提前；
当 $p$ 增大时，因此每一轮unprocess的收益会增大，下一个时刻很可能又获得一个订单使得下一个时刻process的收益增加，因此process的决策点会推移，由于决策点会往后推移，因此计算的value function会整体增大，计算的value function会整体增大，收敛的轮数也会增加。

附件

代码

main.py

import argparse

from algorithm import Value, Policy

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Program to solve batch manufacturing problem.')

    parser.add_argument('--p', type=float, nargs='+', default=[0.5], help='Probability of receiving an order for the product.')
    parser.add_argument('--n', type=int, nargs='+', default=[5], help='The maximum number of orders that can remain unfilled.')
    parser.add_argument('--c', type=float, nargs='+', default=[1.0], help='The cost per unfilled order at each time period.')
    parser.add_argument('--K', type=float, nargs='+', default=[3.0], help='The setup cost to process the unfilled orders.')
    parser.add_argument('--alpha', type=float, nargs='+', default=[0.8], help='Discount factor.')
    
    parser.add_argument('--problem', type=str, nargs='+', choices=['discounted', 'average_cost'], default=['discounted', 'average_cost'], help='Problem in [discounted, average_cost].')
    parser.add_argument('--algorithm', type=str, choices=['value', 'policy'], nargs='+', default=['value', 'policy'], help='Algorithm in [value, policy] iteration.')
    
    args = parser.parse_args()

    for algorithm in args.algorithm:

        for problem in args.problem:

            method = {
                'value' : Value(problem),
                'policy' : Policy(problem)
            }.get(algorithm)
            print('Used {} iteration to solve batch manufacturing problem, formulate with {} problem:'.format(algorithm, problem))

            for p in args.p:
                for n in args.n:
                    for c in args.c:
                        for K in args.K:
                            for alpha in args.alpha:
                                result = method.iteration(p, n, c, K, alpha)
                                print('p = {}, n = {}, c = {}, K = {}, alpha = {}.'.format(p, n, c, K, alpha))
                                for key, value in result.items():
                                    print(key, ":", value)
                                print("\n")

utils.py

def is_terminal(x, y, EPS):
    
    if x is None or y is None:
        return False

    for (xi, yi) in zip(x, y):
        if abs(xi - yi) > EPS:
            return False
    return True

def format_float(v):
    if type(v) == type(list()):
        return [float('{:.4f}'.format(i)) for i in v]
    else:
        return round(v, 4)

algorithm.py

from utils import is_terminal, format_float

class Value:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem

    def _get_policy(self, value):
        max_value = max(value)
        policy = [0 if i < max_value - Value.EPS else 1 for i in value]
        return policy
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = min(
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1],
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Value.EPS):
                break
            last_value = value.copy()
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value)
        }
    
    def _iteration_average(self, p, n, c, K):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        lambda_k = last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = min(
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Value.EPS) and abs(last_lambda_k - lambda_k) < Value.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        value = [i + lambda_k for i in value_h]
        
        return {
            "Iteration turns" : turns,
            "Value function" : format_float(value),
            "Policy" : self._get_policy(value),
            "Optimal average cost" : format_float(lambda_k)
        }


    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)


class Policy:

    EPS = 1e-4

    def __init__(self, problem):
        self.problem = problem
    
    def _get_value_discounted(self, p, n, c, K, alpha, policy):
        value = [0] * (n + 1)
        last_value = [0] * (n + 1)
        turns = 0
        while True:
            turns = turns + 1
            for i in range(n):
                value[i] = (
                    K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1] if policy[i] else
                    c * i + alpha * (1 - p) * last_value[i] + alpha * p * last_value[i + 1]
                )
            value[n] = K + alpha * (1 - p) * last_value[0] + alpha * p * last_value[1]
            if is_terminal(value, last_value, Policy.EPS):
                break
            last_value = value.copy()
        
        return value, turns
    
    def _get_value_average(self, p, n, c, K, policy):
        value_h = [0] * (n + 1)
        last_value_h = [0] * (n + 1)
        last_lambda_k = 0
        turns = 0
        while True:
            turns = turns + 1
            lambda_k = min(
                K + (1 - p) * last_value_h[0] + p * last_value_h[1],
                (1 - p) * last_value_h[0] + p * last_value_h[1]
            )
            for i in range(n):
                value_h[i] = (
                    K + (1 - p) * last_value_h[0] + p * last_value_h[1] if policy[i] else
                    c * i + (1 - p) * last_value_h[i] + p * last_value_h[i + 1]
                ) - lambda_k
            value_h[n] = K + (1 - p) * last_value_h[0] + p * last_value_h[1] - lambda_k
            if is_terminal(value_h, last_value_h, Policy.EPS) and abs(last_lambda_k - lambda_k) < Policy.EPS:
                break
            last_value_h = value_h.copy()
            last_lambda_k = lambda_k
        
        return lambda_k, value_h, turns
    
    def _iteration_discounted(self, p, n, c, K, alpha):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            value, turns_i = self._get_value_discounted(p, n, c, K, alpha, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + alpha * (1 - p) * value[0] + alpha * p * value[1]
                unprocess_cost = c * i + alpha * (1 - p) * value[i] + alpha * p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if policy == last_policy:
                break
            last_policy = policy.copy()

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value),
            "Policy" : policy
        }
    
    def _iteration_average(self, p, n, c, K):
        policy = [1] * (n + 1)
        last_policy = [1] * (n + 1)
        turns = 0
        value_turns = 0
        while True:
            turns = turns + 1
            lambda_k, value, turns_i = self._get_value_average(p, n, c, K, last_policy)
            value_turns += turns_i

            for i in range(n):
                process_cost = K + (1 - p) * value[0] + p * value[1]
                unprocess_cost = c * i + (1 - p) * value[i] + p * value[i + 1]
                policy[i] = 1 if process_cost < unprocess_cost else 0
            policy[n] = 1

            if last_policy == policy:
                break
            last_policy = policy.copy()
        
        value_f = [i + lambda_k for i in value]

        return {
            "Policy improvement iteration turns" : turns,
            "Calucating value function total iteration turns" : value_turns,
            "Value function" : format_float(value_f),
            "Policy" : policy,
            "Optimal average cost" : format_float(lambda_k)
        }

    def iteration(self, p, n, c, K, alpha):
        return {
            'discounted' : self._iteration_discounted(p, n, c, K, alpha),
            'average_cost' : self._iteration_average(p, n, c, K)
        }.get(self.problem)

指令

discounted problem实验

指令1.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.4 比较不同的alpha

python main.py --n 10 --K 5 --c 1 --alpha 0 0.1 0.5 0.9 --p 0.5 --problem discounted --algorithm value policy

指令1.5 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem discounted --algorithm value policy

average cost problem实验

指令2.1 比较不同的n

python main.py --n 8 10 12 --K 5 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.2 比较不同的K

python main.py --n 10 --K 3 5 8 --c 1 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.3 比较不同的c

python main.py --n 10 --K 5 --c 0.5 1 2 --alpha 0.9 --p 0.5 --problem average_cost --algorithm value policy

指令2.4 比较不同的p

python main.py --n 10 --K 5 --c 1 --alpha 0.9 --p 0.1 0.5 0.9 --problem average_cost --algorithm value policy

实验目的

前置知识