江端さんの忘備録(2019-05-08)

2019-05-08 ―― この学習方法、そんなに良いか？ [長年日記]

連休前に、スクラッチから作り始めていた、機械学習のQ学習(強化学習の一種)の簡単なプログラムを動かすことができました。

During the holidays, I was able to run a simple program of machine-learning, "Q-learning" (a type of reinforcement learning).

学習が進んでいくのを見るのは楽しいですが、

It is fun to see the learning progress, but

―― この学習方法、そんなに良いか？

"Is this learning method so good?"

という疑念が払えません。

I can not answer the question.

-----

Q学習の特徴は、ルールが全く分からん対象に対して、「状態」の鎖を繋げるだけで、もっとも効率の良い「状態の連鎖」を自動的に発見できるこにあります。

The characteristic of Q-learning is to connect the chain of "states" to an object whose rules are completely unknown. The merit of Q-learning is that it can automatically find the most efficient "chain of states".

これは「放っておいても、勝手に解法(×解答)を見つけ出す」ということです。

This means "even if I leave it alone, the method can find the solution (x answer) by itself."

忘れてはならないことは、『私がラクできる』という点にあります。

What we should not forget is that "it can make us easier".

しかし、実際のところ、強化学習は、「状態が恐しく単純」である対象(ゲーム等)でなければ、簡単には適用できないのです。

However, in fact, reinforcement-learning can not be easily applied unless it is an object (such as a game) whose state is terribly simple.

ありのままの状態を、何の工夫もなくそのまま表現すれば、解空間が大きすぎて、学習の効果が全然出てこないし、状態リストが長くなってメモリ喰い潰されるし、長くなった状態リストでは、報酬計算に時間が取られすぎて、学習が進みません。

If I express the state "as is" without any ingenuity, the solution space becomes too big, and no learning effect can be obtained at all. As a result of it, the state list gets too long and memory is smashed. In addition, it takes too much time for reward calculation, and the learning does not progress.

つまり、「時間連続体世界」の情報量がお話にならないくらい大きくなるのは当然としても、世界を時間平面で切り取っただけの「単なる状態」であっても、コンピュータで扱えるような量にはなりません。

In other words, it goes without saying that the amount of information in the time continuum world is too large. However even just a state, that is cut as the time plane of the world, is not enough to handle on a computer.

例えば、「私の現実世界の10秒間を、1秒単位の状態としてコンピュータに理解させる」 ―― たった、これだけのことでも、世界中の全部のスーパーコンピュータを使っても無理だと思います。

For example, I am afraid that that even if we can use all super-computers in the world, it will be impossible to make them understand only ten seconds of my real world as a ten time planes

となれば、世界の断面、つまり状態を圧縮するしかありません ―― 「状態のモデル化」ができなければ、お話になりません。

Therefore, we have to compress the time plane of the world, that is "modeling of the state".

そして、コンピュータには、自動的に「状態のモデル化」するなどという芸当はできません。

In addition, no computer in the world has capabilities to realize "modeling of the state" automatically.

なにより、「状態のモデル化」をモデル化できるくらいなら、強化学習をするまでもなく、そのモデルは、既存の推論方式(ファジィ推論等)で利用可能です。

Above all, if we can realize "modeling of the state" by ourselves, we don't have to use the method of reinforcement-learning. We can use the model for the existing method(e.g. fuzzy reasoning).

つまり、状態をモデル化できるくらいなら、そもそも、学習する必要はないのです。

If we can succeed at the modeling, we don't have to need reinforcement-learning.

この「状態のモデル化」を回避する為に、ニューラルネットを使って、状態の空間を圧縮させる方法もあるようです。

In order to avoid this "modeling of the state", it seems that there is a way to compress the time plane, using "neural network".

しかし、強化学習をやる為に、ニューラルネットの構築までやらなければならないとなると、もう、何をやっているのか、訳分からないです。

However, in order to use reinforcement learning, if I have to build up a neural network, I will shout "What is what?"

-----

と、まあ、珍しく漢字沢山の文章を記載しましたが、私がこういう文章を書くときは、たいていの場合「腹を立ている」のです。

Well, whenever I wrote a lot of kanji sentences like above, I am usually in a passion.

私は、このQ学習のコーディングを繰り返している最中、ずっと自問し続けていたのです。

I have been asking myself all the time while writing this Q-learning code

『強化学習の強みは、ラクできる(手を抜く)ことにあるんだよね？』 ―― と。

"The merit of reinforcement-learning is to make me easier, isn't it? Am I right?"

[ツッコミを入れる]