Yesterday, I started some training experiments on the collected data. I use a reinforcement learning algorithm (Q-learning and Sarsa) with a neural network as a function approximator for the Q-values. Basically, this approach allows the agent to learn experience in order to optimize a reward over time. The reward can be anything and, as long as it is consistent, the agent should usually be able to learn it. However, this might take time and if the problem is too complex, or the data too noisy, it might hamper the learning process.
Anyway, I used two different reward functions:
I trained it using an on-policy learning (Sarsa) by making a simulation based on gathered data. I have data for a set of different angles that were adopted by the agent in the course of one hour (note: that is not perfectly exact but we assumed it will do for our experiments, although we should try a better approach in the next data collections). Using these data, I build a model of the world that allows me to approximate what would be the perceived light at a given angle and at a given time. Then a simulation is run where the agent is able to move freely in this world and get the consequences of its actions.
I ran the simulation on three different databases:
Results
Whe trained the models for 100 epochs (each epoch is a complete run through the database). In each case, we compare with a 100% random policy (that is, the agent taking random actions at each step).
1. "Fake":
With reward "+light": (with 10 hidden units, epsilon = 0.1 and learning rate = 1)
Train after 100 epochs: Epoch summary: av. rew.: 0.792673 av. mse: 0.0082737
Best epoch (94) Epoch summary: 0.808373 av. mse: 0.0100743
Random decisions: Epoch summary: av. rew.: 0.260371
Perfection would be around 0.9 because with epsilon = 0.1 it means there's alway a 10% chance that we make a random move.
Here is an extract of the learning process:
# MatDataSet: 1000 examples loaded Training started ==== Epoch: 1 Epoch 1 summary: av. rew.: 0.166183 av. mse: 0.0162175 Epoch 2 summary: av. rew.: 0.236203 av. mse: 0.0300084 Epoch 3 summary: av. rew.: 0.223244 av. mse: 0.0246643 Epoch 4 summary: av. rew.: 0.266584 av. mse: 0.0237364 Epoch 5 summary: av. rew.: 0.243567 av. mse: 0.0251326 ... Epoch 91 summary: av. rew.: 0.799072 av. mse: 0.00796436 Epoch 92 summary: av. rew.: 0.788038 av. mse: 0.00968873 Epoch 93 summary: av. rew.: 0.787857 av. mse: 0.00942879 Epoch 94 summary: av. rew.: 0.808373 av. mse: 0.0100743 Epoch 95 summary: av. rew.: 0.781321 av. mse: 0.00920019 Epoch 96 summary: av. rew.: 0.783101 av. mse: 0.00865259 Epoch 97 summary: av. rew.: 0.805089 av. mse: 0.00907472 Epoch 98 summary: av. rew.: 0.787754 av. mse: 0.00756662 Epoch 99 summary: av. rew.: 0.794785 av. mse: 0.0108819 Epoch 100 summary: av. rew.: 0.792673 av. mse: 0.0082737
Very good results. See the video below.
2. 2011-06-07 extract expanded
Reward = +light:
Train after 100 epoch: av. rew.: 0.562583 av. mse: 0.00019878
Random: Epoch summary: av. rew.: 0.432124 av. mse: 0.000927744
==> GOOD
Reward = -light:
Train after 100 epoch: av. rew.: 0.562583 av. mse: 0.00019878
Epoch summary: av. rew.: 0.564106 av. mse: 0.000761962
==> BAD
3. 2011-06-08 simfile
Reward = +light:
Train after 100 epochs: av. rew.: 0.168295 av. mse: 0.000806739
Best results were around 0.17
Random: av. rew.: 0.158676 av. mse: 0.00140628
==> NOT IMPRESSIVE
Reward = -light:
Train after 100 epochs: Epoch summary: av. rew.: 0.830634 av. mse: 0.00914575
Random: Epoch summary: av. rew.: 0.841084 av. mse: 0.0140829
==> VERY BAD (I don't know why but the agent seems to get "stuck" at 360 degrees most of the time, preventing him from exploring better regions of the space)
Conclusion
As is usually the case with Machine Learning problems, it works well with data that is "crafted" to make life easy for the model to learn, but when it comes to real-life data it's another matter. So the first results are a bit disappointing, but it's not 100% bad either.
The best results were achieved with the "fake" data. Only in that case did the parameters (hidden units, learning rate, epsilon, lambda) seemed to really matter (it was actually important to tune them properly). In the other cases, changing them didn't influence very much the course of the training. Here are the different values that were typically tried:
Using a softmax policy instead of an ε-greedy policy seemed to change the learning but it's hard to tell how. Softmax generally seems to be "too" random to allow the agent to learn quickly.
Next steps
Verify my code. There might still be errors out there. Among other things, the agent often seems to be "stuck" at 0 or 360 degrees, that is strange.
Make it easier. Maybe it's too hard a problem to learn. Here are some suggestions:
Advanced:
What is important is the
What is important is the learning that comes along. - Marla Ahlgrimm
Post new comment