# Solving Klotski with Deep Reinforcement Learning (RL)

## What’s Klotski?

Klotski is a sliding block puzzle where the goal is to move a specific block to its predefined position. Recently a friend of mine from Japan introduced me to this puzzle. Being a veteran of puzzles and an RL enthusiast, I naturally dedicated an entire weekend to it.

Now this puzzle is fairly simple in terms of its computational complexity that a basic implementation of Breadth First Search (BFS) algorithm finds the shortest solution in 121 seconds, exhaustively visiting a total of 24635 unique states on my MacBook Pro (Late 2013). But, for deep RL it’s not that easy since the rewards are extremely sparse. The only time a reward is given is when the puzzle is in its solved state which is at minimum 85 levels deep from the root node. The probability of choosing a specific 85 actions long sequence at random is extremely low given there are 40 actions available at every step. So I decided to give it a shot.

## Deep RL Setup

RLllib is a pretty good framework for reinforcement learning that lets you scale your application to multiple nodes on an HPC cluster. I implemented a basic simulated environment of Klotski in Python that extends OpenAI’s Gym core class. I then use RLlib to train a PPO agent on this custom environment.

At each step, the agent gets the whole state of the board in a form of a 1D array, which is the flattened version of the following 2D array.

```
7 8 8 9
7 8 8 9
4 5 5 6
4 1 2 6
0 10 10 3
```

As it can be seen, each piece is encoded with a different number. This significantly increases the number of possible states. A simpler state representation would be the one where pieces of the same type are encoded with the same number. Since, for example, it doesn’t matter which 1X1 piece is at the bottom left, it could be either of 0, 1, 2, 3 and it wouldn’t make a difference. But since, deep RL can handle huge state space, I go with the shown state representation. Also, to humans it wouldn’t make a difference if someone were to color each piece with a different color. We would simply choose to ignore that extra information.

Note, the above BFS is run with the simpler state representation, otherwise it doesn’t halt even in 4hrs, visiting more than 350k unique states. I use the tree generated by BFS to get the depth of a given state, referred as “state-depth”.

For each state, the agent outputs an action which is an integer value from 0 to 39, where each number represents a specific piece and a specific direction. Since there are 10 pieces and 4 directions, there are 40 different actions.

## Experiments

### Vanilla PPO

Although as explained above, it’s extremely unlikely that PPO all by itself would be able to solve Klotski in a reasonable amount of time because of sparse reward problem, I anyway ran it out of curiosity with the following config.

```
{
.
.
"num_workers": 10,
"batch_mode": "complete_episodes",
"observation_filter": "MeanStdFilter",
"sample_batch_size": 64,
"sgd_minibatch_size": 512,
"num_sgd_iter": 10,
"train_batch_size": 4096
.
.
}
```

As expected it doesn’t perform well. For future comparison, the maximum state-depth value achieved by PPO is 45 with mean of around 17 during training.

### PPO with Naive Implementation of Novelty

Since the problem here is lack of rewards, we can definitely make use of intrinsic rewards or, a.k.a, self motivation or curiosity. Basically, we can reward the agent according to how novel the current state is. This motivates the agent to explore more and more states since it gets rewarded positively for visiting a novel state.

A naive implementation of that is maintaining a dictionary which keeps the record of whether a simple representation of a state has been visited or not. Since there are only about 24k of them, this approach is feasible. Next, reward the agent +0.1 if the state is not visited. And finally, flush the dictionary after every episode. Flushing the dictionary after every episode is crucial since no further reward is given for visiting a state more than once. This definitely makes the environment non-stationary, since visiting a state would sometime give a reward and other times won’t, but generally it’s not an issue. [cite]

This approach works decently well but seems to be very sensitive to the horizon value, i.e., the maximum number of steps allowed per episode. But nonetheless, it works decent enough and reaches maximum state-depth of 85 and with mean of 45. Put a video.

To be continued.

## Comments