Monday, January 29, 2018

January Notes

Writing this AI is a huge process in learning for me. Initially, I thought light-random playouts and MCTS would be enough to achieve strong play.

And in some regards, yes, the January version of my code does make Chain9 somewhat consistently. But there are certain things that feel "easy" for a human to figure out that my AI struggles with severely.

But first, here's a summary of the current MCTS Bot:

* Single-threaded -- While I've done a lot of research into multi-threaded programming, my bot is still single-threaded. I have a strong feeling that I can convert my bot into a multithreaded program "easily" (I've thought of a multi-threaded design from the start), but until its actually done, I guess its important to note its still single-threaded.

* Deterministic -- The main reason why I've kept it as single-threaded is because determinism is incredibly important for study. I need the bot to be repeatable. There are ways to make a multi-threaded deterministic bot, but my lock-free design does not allow for this very easily.

* L1 Cache Friendly ?? -- I don't have access to Intel's VTune. But based on some code analysis I've done, I expect the majority of the data to be processed in the L1 cache. I've done this by working with 16kb of MCTS data at a time, and ensuring my "pop and fall" algorithms use a minimum amount of RAM (on the order of ~2kb or less). So I expect all my processing to fit within the 32kb L1 cache of a typical modern systems. I basically have two MCTS algorithms: "Big" and "Little". The "Little" MCTS explores within 16kb of memory, while "Big" backpropagates the "little" results, and chooses a new node to perform "Little MCTS" upon.

Eventually, I'll properly test the L1 cache friendlyness with perftools or the like.

* 2GB Tested -- I've set the "Big" MCTS to a size as large as 2GB. This is somewhat important because I have written all memory routines manually. Its easier said than done: because MCTS never "forgets" or "deletes" data, its rather trivial to write a basic "atomic increment" memory allocator, especially since I never "free" until all the memory is used up.

Changes of Opinion:

* I no longer believe that light-random playouts is suitable for strong Puyo AI play. I'm working on a more suitable "heavy playout" idea with a better heuristic for evaluating field positions. A simple algorithm that is suitable for "trigger blocking" would be ideal.

* Multithreaded thoughts -- I've researched a LOT about multithreaded coding. A fully lock-free MCTS implementation is likely possible, but I've got a few ideas which are "mostly lock free, with spin-locks being used in corner cases". Spinlocks are surprisingly effective in some of my basic tests, and they're also not very difficult to write. (Well... you gotta read Intel's Optimization Manual to understand the "pause" instruction and understand how a spinlock might affect hyperthreads... as well as understand atomics and false-sharing of caches... but its not too difficult once you understand all that).

Multithreaded is still a barrier I have yet to cross. The main issue is that I'm primarily in a "research" phase, where speed is not yet my main concern. I'm still trying to research a good organization for the bot overall.

* MCTS Understanding -- I've learned that the core of MCTS is UCT, which is drawn from the "Multi-armed Bandit Problem". There are simpler algorithms to solve the Multi-Armed Bandit problem (such as Epsilon Greedy). If I were to start over, I'd probably use something like Epsilon Greedy instead of UCT. For now, its clear that I need a heuristic to evaluate puyo board positions to best improve my AI and perform "heavier playouts" instead.

No comments:

Post a Comment