An intermediate policy training method for agent is provided. This method includes: selecting a source and a target policy from a plurality of policies. Each policy is configured to drive an agent to perform a plurality of actions to be in a plurality of states. Each state includes a plurality of physical properties. The method further includes: respectively selecting one from the source policy and the target policy as a source state and a target state; training an intermediate policy by a reinforcement learning to transition the agent from the source state to the target state over an episode, where the reinforcement learning includes an annealing function for setting a plurality of tolerance boundaries of the plurality of physical properties, and the tolerance boundaries gradually shrink during the episode.
CROSS-REFERENCE TO RELATED APPLICATIONS
This non-provisional application claims priority under 35 U.S.C. § 119 (a) on Patent Application No(s). 202410758217.2 filed in China on Jun. 12, 2024, the entire contents of which are hereby incorporated by reference.
BACKGROUND
1. Technical Field
The present disclosure relates to reinforcement learning, and more particularly to an intermediate policy training system and method for an agent.
2. Related Art
Reinforcement Learning (RL) for quadruped locomotion has achieved state-of-the-art status in recent years. Just a few years ago the best controllers were hand-crafted and well-tuned methods based on kinematic models of the robot. These traditional controllers had to consider approximations of real-world dynamics and required a lot of engineering effort to be effective. Nowadays, RL-based controllers can learn optimal policies from simulation in a data-based manner. These controllers can develop emergent behavior not previously possible and can tackle the most challenging terrains and environments. RL controllers can generate novel gaits that produce energy efficient movement. It can also generate the same visually aesthetic gaits that traditional controllers did with the addition of constraints.
A high-level task is commonly broken down into multiple lower-level tasks. In this scenario each low-level task is executed by a single, independent RL policy. However, an issue of connecting these multiple policies arises. Because each policy is trained independently, switching execution from a task policy to another is not always straightforward and simple.
A common practice to connect multiple low level policy tasks is to use a hierarchical approach, where a high-level policy outputs activation weights for lower-level policies. However, this becomes an issue when a new low-level policy must be added to the system. Because the high-level policy is conditioned on the low-level ones, the addition of a new policy requires retraining the whole system. Another potential downside of this approach is that the number of parameters of the high-level policy might need to be increased as the number of low-level policies grows.
SUMMARY
In light of the above descriptions, the present invention proposes a system and method for training an intermediate policy for an agent, thereby addressing the aforementioned issues.
According to one or more embodiment of the present disclosure, an intermediate policy training method for agent is performed by a processor and includes following steps: selecting a source policy and a target policy from a plurality of policies, where each policy is configured to drive an agent to perform a plurality of actions to be in a plurality of states, and each state includes a plurality of physical properties; respectively selecting one of the plurality of states from the source policy and the target policy as a source state and a target state; and training an intermediate policy by a reinforcement learning to transition the agent from the source state to the target state over an episode, where the reinforcement learning includes an annealing function for setting a plurality of tolerance boundaries of the plurality of physical properties, and the plurality of tolerance boundaries gradually shrink during the episode.
According to one or more embodiment of the present disclosure, an intermediate policy training system for agent include a storage device and a processor. The storage device is configured to store a plurality of instructions. The processor is electrically connected to the storage device to execute the plurality of instruction and cause a plurality of operations. The plurality of operations includes: selecting a source policy and a target policy from a plurality of policies, wherein each of the plurality of policies is configured to drive an agent to perform a plurality of actions to be in a plurality of states, with each of the plurality of state comprising a plurality of physical properties; respectively selecting one of the plurality of states from the source policy and the target policy as a source state and a target state; and training an intermediate policy by a reinforcement learning to transition the agent from the source state to the target state over an episode, wherein the reinforcement learning includes an annealing function for setting a plurality of tolerance boundaries of the plurality of physical properties, and the plurality of tolerance boundaries gradually shrink during the episode.
In view of the above, the intermediate policy training system and method for agent proposed in the present disclosure have the following contributions and effects: Firstly, a generic linking mechanism, the intermediate policy, can transition between any two reinforcement learning polices as long as a target state is provided. Secondly, the intermediate policy can smoothly transition to a generic target state via a shrinking boundaries reward function design.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only and thus are not limitative of the present disclosure and wherein:
FIG. 1 is a block diagram of the intermediate policy training system for agent according to an embodiment of the present disclosure;
FIG. 2 is a flowchart of the intermediate policy training method for agent according to an embodiment of the present disclosure;
FIG. 3 is an example schematic diagram of gradually shrinking tolerance boundaries according to an embodiment of the present disclosure; and
FIG. 4 is a schematic diagram showing a combination of the intermediate policy with other policies.
DETAILED DESCRIPTION
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. According to the description, claims and the drawings disclosed in the specification, one skilled in the art may easily understand the concepts and features of the present invention. The following embodiments further illustrate various aspects of the present invention, but are not meant to limit the scope of the present invention.
FIG. 1 is a block diagram of the intermediate policy training system for agent according to an embodiment of the present disclosure. As shown in FIG. 1, the training system 10 includes a storage device 1 and a processor 3.
The storage device 1 is configured to store a plurality of instructions. In an embodiment, the storage device 1 may be implemented using at least one of the following examples: flash memory, hard disk drive (HDD), solid-state drive (SSD), dynamic random-access memory (DRAM), static random-access memory (SRAM), or other non-volatile memory. However, the present disclosure is not limited to the examples mentioned above.
The processor 3 is electrically connected to the storage device 1 to execute the plurality of instructions and cause a plurality of operations corresponding to a plurality of steps of the intermediate policy training method for agent according to an embodiment of the present disclosure. In an embodiment, the processor 3 may be implemented using at least one of the following examples: personal computer, network server, central processing unit (CPU), graphics processing unit (GPU), microcontroller (MCU), application processor (AP), field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), system-on-a-chip (SOC), deep learning accelerator, or any electronic device with similar functionality. The present disclosure does not limit the hardware type of the processor 3.
FIG. 2 is a flowchart of the intermediate policy training method for agent according to an embodiment of the present disclosure. In an embodiment, the agent is a Unitree A1 quadruped robot equipped with 12 Proportional-Derivative (PD) controllers, which drive a plurality of joints of the robot. However, the present disclosure does not limit the number of PD controllers and joints. The intermediate policy takes the current state and target state of the agent as inputs and generates novel transition trajectories to drive the agent towards the target state. As shown in FIG. 2, the training method includes steps S1, S2, and S3.
In step S1, the processor 3 selects a source policy and a target policy from a plurality of policies. Each policy is configured to drive the agent to perform a plurality of actions to be in a plurality of states. Each state includes a plurality of physical properties. In an embodiment, the plurality of physical properties includes: position, orientation, linear velocities, angular velocities, joint angles, and joint velocities, and binary feet contact indicators, where linear velocities may be estimated by fusing inertial measurement unit (IMU) readings and leg velocity during feet contacts.
In an embodiment, the plurality of policies mentioned in step S1 are pre-prepared. In another embodiment, before step S1, the training method further includes the following steps: collecting a plurality of reference motion clips or animation data, tracking the reference motion clips in a physics simulator, and training a plurality of physics-based controllers as the plurality of policies with a motion imitation framework. The plurality of policies may be denoted as πi(a|s, gi); i∈1 . . . k, where a represents the actions executed by the PD controllers, s represents the state of the agent, gi represents the data of the reference motion clips (e.g., including four future frames), and k represents the number of policies. In the above embodiment, the motion imitation framework is implemented with Isaac Gym, and the plurality of policies are trained with PPO-clip loss on a consumer-grade laptop equipped with an Intel 8-core i7-11800H 2.3 GHz processor and an NVIDIA RTX 3070 8 GB GPU.
In an embodiment, to transfer the plurality of policies from simulation to the real world and ensure that each policy is robust enough when applied to a real-world robot, domain randomization is employed. Specifically, this involves randomizing the mass of each joint of the robot, introducing disturbance forces (applying random external forces to the agent), adding noise to sensor readings (simulating the errors in sensor readings in the real world), and randomizing terrain height and friction. The detailed parameters are shown in Table 1 below.
TABLE 1
parameters for domain randomization. Ranges are sampled uniformly.
The feet contacts value indicates the probability of zeroing
out (feet not touching the ground) the contacts, per foot.
Value
Parameter
Policy
Intermediate Policy
Action Noise
±0.02
±0.02
Rigid Bodies Mass
[75%, 125%]
[95%, 105%]
P Gain (PD Controller)
[35, 65]
[45, 55]
D Gain (PD
[1.0, 1.4]
[0.9, 1.2]
Ground Friction
[0.1, 1.5]
[0.1, 1.5]
Noise - Orientation
±0.05
±0.06
Noise - Linear Velocity
±0.25
±0.25
Noise - Angular Velocity
±0.3
±0.3
Noise - Joint Angles
±0.02
±0.02
Noise - Joint Velocity
—
±1.5
Noise - Feet Contacts
20%
20%
In step S2, the processor 3 selects a source state from the plurality of states in the source policy and a target state from the plurality of states in the target policy. In an embodiment, the source and target states can be obtained by randomly sampling from a large set of physically feasible states or animation data, with noise added to the plurality of physical properties included in the target state. The intermediate policy will learn how to drive the agent from any given state to any target state.
In step S3, the processor 3 trains an intermediate policy by a reinforcement learning to transition the agent from the source state to the target state over an episode. The reinforcement learning involves an annealing function for setting a plurality of tolerance boundaries of the plurality of physical properties, and the plurality of tolerance boundaries gradually shrink during the episode.
The intermediate policy is configured to drive the agent to the target state by generating novel transition trajectories. The intermediate policy may be represented as P(at|st, ŝ), where st represents the current state of the agent and ŝ represents the target state. The current state st is the same as the policy and consists of a plurality of physical properties: position, orientation, linear velocities, angular velocities, joint angles, and joint velocities, and binary feet contact indicators. The target state ŝ represents the state of the agent should be after the transition. In an embodiment, if the target state ŝ is derived from animation data, the binary feet contact indicator may be excluded from the plurality of physical properties included in ŝ to reduce computational load.
The intermediate policy is trained with episodic reinforcement learning in a complete independent process from the aforementioned policies. During each episode, the goal of the intermediate policy is to match the target state ŝ. The architecture of the intermediate policy P(at|st, ŝ) is a two-layer feed-forward neural network with 512 and 256 hidden units. Except for the linear output layer, each layer uses the Exponential Linear Unit (ELU) as the activation function.
In an embodiment, the intermediate policy receives an observation vector of 208 dimensions which consists of the three latest states of the agent, the actions from the past three timesteps, the target state ŝ, the centers of the tolerance boundaries at the current timestep, and a single scalar that encodes the normalized time, starting at 0.0 and increasing to 1.0 at the end of the transition period (episode). The maximum episode length is set to 2 seconds. The output of the intermediate policy is configured to adjust the plurality of PD controllers, which in turn control the angles of the plurality of joints of the agent.
FIG. 3 is an example schematic diagram of gradually shrinking tolerance boundaries according to an embodiment of the present disclosure. As shown in FIG. 3, the tolerance boundary B is annealed around the line that connects the source state s0 to the target state ŝ. The agent A learns to stay close to the target state ŝ while avoiding the gradually shrinking tolerance boundary B. The transition duration is dynamic, and an episode finishes early if all states remain within the tolerance boundary B before the episode ends. In fact, if the intermediate policy performs well, most transitions will indeed terminate early.
The reward function incentivizes the agent A to stay within a set of tolerance boundaries B for each physical property, such as joint positions, linear velocity, angular velocity, and orientation.
If the agent A is within the tolerance boundaries B for all physical properties at the end of the episode (e.g., current state st1), the intermediate policy receives a basic reward of +1. If the agent A violates any of the plurality of tolerance boundaries B before the episode ends (e.g., current state st2), the intermediate policy receives a negative reward of −1 and the episode terminates. If the agent A reaches the target state ŝ before the episode ends (e.g., current state st3) and all physical properties are within the tolerance boundaries B, the intermediate policy receives an additional reward of 100, greater than the basic reward of 1. These values are illustrative and not intended to limit the present disclosure.
In an embodiment, the reward function for reinforcement learning includes a set of indicator functions for the shrinking hard boundaries and an energy efficiency penalization term. The hard boundaries represent mandatory constraints, with penalties imposed for violations. The indicator function is defined as follows:
β
=
{
100
,
if
max
(
❘
"\[LeftBracketingBar]"
s
t
-
s
^
❘
"\[RightBracketingBar]"
-
σ
e
)
≤
0
-
1
,
if
max
(
❘
"\[LeftBracketingBar]"
s
t
-
Ψ
(
s
0
,
s
^
,
t
)
❘
"\[RightBracketingBar]"
-
σ
e
)
≥
0
1
,
otherwise
(
Equation
1
)
where ψ(s0, ŝ, t) denotes an annealing function, a linear function that connects the source state s0 to the target state ŝ, and σt denotes the tolerance, annealed at each timestep t according to Equation 2,
σ
t
=
σ
s
+
t
p
(
σ
e
-
σ
s
)
(
Equation
2
)
with σs and σe respectively denoting the tolerance at the start and at the end of the transition period, and p is an exponential parameter (refer to Table 2) to modify how the tolerance boundary B shrinks. With the tolerance boundary B annealing, the intermediate policy learns to match the target state ŝ closely. However, it does not guarantee that the trajectories are smooth or energy efficient. Therefore, the reward function also includes a penalization term for joint torques. In an embodiment, step S3 further includes: deducting a weighted sum of the torques of the joints from a reward obtained by the intermediate policy. The complete reward function is shown as follows in Equation 3, where τj is the torque of the j-th joint of the agent A, and ωτ is a scalar that controls the scale of the penalization:
R
=
β
-
ω
τ
∑
j
τ
j
2
(
Equation
3
)
The Table 2 below is an example of detailed parameters for the tolerance boundary B and the penalization term.
TABLE 2
tunable parameters of the intermediate policy. σs and σe indicate
the tolerance at the start and end of the transition, respectively.
Component
Value
σs
σe
CoM (center of
p = 2
0.35
0.02
mass) - Height
Orientation
p = 4
1
0.2
Linear Velocity
p = 8
2.5
0.2
Angular Velocity
p = 8
15
0.2
Joint Angles
p = 2
3.14
0.5
Torque Term
ωτ = 0.0001
—
—
FIG. 4 is a schematic diagram showing a combination of the intermediate policy with other policies. As shown in FIG. 4, after training the intermediate policy P, various policies π1, π2, π3, π4, . . . , πN can be sequentially combined. For example, The agent A may start with policy π1, and at some later point, an event triggers a switch to another policy π3. To execute the transition between these two policies π1, π3, the intermediate policy P takes control of the agent A and samples a target state ŝ from the distribution of π3, using its animation data. Then, the intermediate policy P performs actions to drive the agent A to be close enough to the target state ŝ at the end of the transition, so that policy π3 can take over control. The transition ends with the target policy π3 taking control of the agent A. This process can be repeated indefinitely and robustly for any pair in the policy library L.
If the intermediate policy P is trained with a large enough distribution of source and target states, it is reasonable to assume that it should learn feasible transition between any policies, even new ones. For example, if the training data includes fast running actions both forward and backward (−1.0 m/s to 1.0 m/s), the agent A can also successfully transition during the testing phase with a slow walking action (0.5 m/s), because the slow walking speed is within the training range, even if the slow walking action itself is not in the training data. Therefore, by adding new policies and using the same intermediate policy P, the policy library L can be gradually and infinitely expanded without retraining or fine-tuning.
In view of the above, the intermediate policy training system and method for agent proposed in the present disclosure have the following contributions and effects: Firstly, a generic linking mechanism, the intermediate policy, can transition between any two reinforcement learning polices as long as a target state is provided. Secondly, the intermediate policy can smoothly transition to a generic target state via a shrinking boundaries reward function design.Source: ipg260414_r1.zip (2026-04-14)