Langues

Offline training

Screenshot of the simulation app

Yesterday, I started some training experiments on the collected data. I use a reinforcement learning algorithm (Q-learning and Sarsa) with a neural network as a function approximator for the Q-values. Basically, this approach allows the agent to learn experience in order to optimize a reward over time. The reward can be anything and, as long as it is consistent, the agent should usually be able to learn it. However, this might take time and if the problem is too complex, or the data too noisy, it might hamper the learning process.

Anyway, I used two different reward functions:

  1. in one case ("light-seeking" or "+light") the agent gets positive reward when it see more light from the solar cell
  2. in the other case ("light-avoiding" or "-light") the opposite happens

I trained it using an on-policy learning (Sarsa) by making a simulation based on gathered data. I have data for a set of different angles that were adopted by the agent in the course of one hour (note: that is not perfectly exact but we assumed it will do for our experiments, although we should try a better approach in the next data collections). Using these data, I build a model of the world that allows me to approximate what would be the perceived light at a given angle and at a given time. Then a simulation is run where the agent is able to move freely in this world and get the consequences of its actions.

I ran the simulation on three different databases:

  1. "Fake": hand-made data with 1000 data points simulating a relatively stable "peak" of light around 180 degrees (file: maquette_2011-06-07--fake--expanded_1000.flat)
  2. 2011-06-07 extract expanded: 1000 data points "expanded" (ie. some noise added at each step) based on the first hour of data collected on June 7th; this data is less "peaky" than (1) but stays pretty much stable through time (file: maquette_2011-06-07--extract--expanded_1000.flat)
  3. 2011-06-08 simfile: 100 data points generated by interpolating 10 points of data between each hour of data collected on June 8th; this data "changes" over time because during the night, the max and min light angles seem to be inverted (file: maquette_2011-06-08--simfile_100.flat)

Results

Whe trained the models for 100 epochs (each epoch is a complete run through the database). In each case, we compare with a 100% random policy (that is, the agent taking random actions at each step).

1. "Fake":

With reward "+light": (with 10 hidden units, epsilon = 0.1 and learning rate = 1)

Train after 100 epochs: Epoch summary: av. rew.: 0.792673 av. mse: 0.0082737
Best epoch (94) Epoch summary: 0.808373 av. mse: 0.0100743
Random decisions: Epoch summary: av. rew.: 0.260371

Perfection would be around 0.9 because with epsilon = 0.1 it means there's alway a 10% chance that we make a random move.

Here is an extract of the learning process:

# MatDataSet: 1000 examples loaded
Training started ====
Epoch: 1
Epoch 1 summary:  av. rew.: 0.166183 av. mse: 0.0162175
Epoch 2 summary:  av. rew.: 0.236203 av. mse: 0.0300084
Epoch 3 summary:  av. rew.: 0.223244 av. mse: 0.0246643
Epoch 4 summary:  av. rew.: 0.266584 av. mse: 0.0237364
Epoch 5 summary:  av. rew.: 0.243567 av. mse: 0.0251326
...
Epoch 91 summary:  av. rew.: 0.799072 av. mse: 0.00796436
Epoch 92 summary:  av. rew.: 0.788038 av. mse: 0.00968873
Epoch 93 summary:  av. rew.: 0.787857 av. mse: 0.00942879
Epoch 94 summary:  av. rew.: 0.808373 av. mse: 0.0100743
Epoch 95 summary:  av. rew.: 0.781321 av. mse: 0.00920019
Epoch 96 summary:  av. rew.: 0.783101 av. mse: 0.00865259
Epoch 97 summary:  av. rew.: 0.805089 av. mse: 0.00907472
Epoch 98 summary:  av. rew.: 0.787754 av. mse: 0.00756662
Epoch 99 summary:  av. rew.: 0.794785 av. mse: 0.0108819
Epoch 100 summary:  av. rew.: 0.792673 av. mse: 0.0082737

Very good results. See the video below.

2. 2011-06-07 extract expanded

Reward = +light:

Train after 100 epoch: av. rew.: 0.562583 av. mse: 0.00019878
Random: Epoch summary: av. rew.: 0.432124 av. mse: 0.000927744
==> GOOD

Reward = -light:

Train after 100 epoch: av. rew.: 0.562583 av. mse: 0.00019878
Epoch summary: av. rew.: 0.564106 av. mse: 0.000761962
==> BAD

3. 2011-06-08 simfile

Reward = +light:

Train after 100 epochs: av. rew.: 0.168295 av. mse: 0.000806739
Best results were around 0.17
Random: av. rew.: 0.158676 av. mse: 0.00140628
==> NOT IMPRESSIVE

Reward = -light:

Train after 100 epochs: Epoch summary: av. rew.: 0.830634 av. mse: 0.00914575
Random: Epoch summary: av. rew.: 0.841084 av. mse: 0.0140829
==> VERY BAD (I don't know why but the agent seems to get "stuck" at 360 degrees most of the time, preventing him from exploring better regions of the space)

Conclusion

As is usually the case with Machine Learning problems, it works well with data that is "crafted" to make life easy for the model to learn, but when it comes to real-life data it's another matter. So the first results are a bit disappointing, but it's not 100% bad either.

The best results were achieved with the "fake" data. Only in that case did the parameters (hidden units, learning rate, epsilon, lambda) seemed to really matter (it was actually important to tune them properly). In the other cases, changing them didn't influence very much the course of the training. Here are the different values that were typically tried:

  • Number of hidden units: 1, 2, 5, 10, 100
  • Learning rate: 1, 0.1, 0.05, 0.01
  • Lambda: 0.1, 0.01, 0.0001
  • Epsilon: 0.1, 0.01

Using a softmax policy instead of an ε-greedy policy seemed to change the learning but it's hard to tell how. Softmax generally seems to be "too" random to allow the agent to learn quickly.

Next steps

Verify my code. There might still be errors out there. Among other things, the agent often seems to be "stuck" at 0 or 360 degrees, that is strange.

Make it easier. Maybe it's too hard a problem to learn. Here are some suggestions:

  • Sleep during the night. At night time the light values seemed to be inverted, possibly due to the light incoming from the surrounding villages or the moon. This makes it very hard for the agent because it must learn that its ideal orientation is different during the night than during the day, which makes it adopt an "in between" policy that is not very good.
  • Add a preprocessed "night" binary input (for similar reasons).
  • Limit the possible actions. Instead of using 0..360 (360 actions) we could have directional actions eg. left or right, or ±some angle. Or possibly by limiting the available actions at angle x to eg. x±20 (similar to the directional actions but implementation is different).
  • Add action: do not move.

Advanced:

  • Exploration vs exploitation: decrease epsilon (or temperature in the case of softmax).
  • Add a weight decay and a decrease constant.
  • Max likelihood vs MSE criterion for neural network.
  • Try to avoid overfitting.

Yesterday, I started some training experiments on the collected data. I use a reinforcement learning algorithm (Q-learning and Sarsa) with a neural network as a function approximator for the Q-values. Basically, this approach allows the agent to learn experience in order to optimize a reward over time. The reward can be anything and, as long as it is consistent, the agent should usually be able to learn it. However, this might take time and if the problem is too complex, or the data too noisy, it might hamper the learning process.

What is important is the

What is important is the learning that comes along. - Marla Ahlgrimm

Poster un nouveau commentaire

Le contenu de ce champ ne sera pas montré publiquement.
  • Les adresses de pages web et de messagerie électronique sont transformées en liens automatiquement.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <pre> <h3> <h4> <h5> <h6>
  • Les lignes et les paragraphes vont à la ligne automatiquement.

Plus d'informations sur les options de formatage

CAPTCHA
Cette question permet de vérifier si vous êtes un visiteur humain afin d'empêcher les envois automatisés de pourriel.
Fill in the blank
By submitting this form, you accept the Mollom privacy policy.