Maze Navigation using Actor Critic Subnetwork
This simulation demonstrates the use of a Actor Critic network. The simulation takes place in a square grid with 3 rows and 3 columns. The cells in the grid are numbered from (1,1) to (3,3). The cell (3,3) contains reward. (1,1) is the starting cell. The network needs to learn the path from (1,1) to (3,3) to get the rewards. Use the following steps
to explore this simulation:
1. Clear the workspace (File > Clear Workspace).
2. Open the workspace file named MazeNavigation.sim located in simulations/sims/conditioning (File > Open Workspace). The following network should open in your workspace:
In this network, the bottom layer with 9 neurons represnts the current position in the grid. The bottom-left neuron represnts the state (1,1) and the top-right neuron represents the state (3,3). At a time, only one unit in this layer becomes active, to represent the current state.
The top part of the netowork has 2 layer. The left layer has 4 units, reprenting the following 4 actions from left to right: "Move Up", "Move Down", "Move Left" and "Move Right". The right layer consists of two neurons. The left neuron is the adaptive critic which generate the reward expectations. The right neuron is a Target neuron which is fed with the actual reward value.
In each time step, the network gets the state and actual reward values as input and it generates the reward expectation and action as output. The reward expectation value is used to update the network weights and the action is used to update the network state. Note that some of the actions may lead to no change in the state in certain cases - for example, if the network is currently in state (3,1), (3,2) or (3,3) then the action "Move up" will not cause any change in the state.
3. Open a console (in the main menu Insert > New console)
4. Type >td(); on the console window. This will run the console script to train the actor critic network.
5. To test the network, close the console window and open a new window. Double click on the network to open its property dialog. Uncheck the "Train the network" checkbox. Select "None" as the exploration policy. Type >tdtest(); in the console window. This will show the working of the network. The state of the network for each time step will be printed in the console. As mentioned earlier, the starting state is (1,1). Observe as the network moves from the start state to (3,3), the goal state.
6. If you want to start all over again, click in the network window. Press "w" and "c" and then "n" and "c". Double click on the network to open its propoerty dialog. Check "Train the network" and select "Random" exploration policy. Close the console and go back to step 3.