Thursday, September 6, 2018

Developing a heuristic: Constraint Programming for Puyo

Last off, I was talking about the brute-force MCTS methodology was coming up with nice 7-chains at a reasonable pace. However, expert human Puyo players can build 15+ chains in actual combat, so while the MCTS was successful in some sense, it is clearly too weak to challenge human players still, by an order of magnitude. As such, I set off looking for a heuristic to improve MCTS.

I eventually came across constraint programming, the AI-methodology for representing problems in terms of a series of variables, domains, and constraints. A constraint solver tries to assign variables values that satisfy constraints.

For example, Sudoku can be described as 81 variables, each with domain 1-9, with the 27 constraints of:

* All different(X11, X12, X13, X14 ... X19)
* All different(X21, X22, X23 ...)
* ... repeat the above for all rows
* All different(X11, X21, X31, ... X91)
* All different(X12, X22, X32...)
* ... repeat the above for all columns
* All different (X11, X12, X13, X21, X22, X23... X33)
* ... repeat the above for all squares.

Anyway, solving Sudoku in this way is very "human" and "intuitive". Once the problem is stated in this manner, its a simple process of elimination between all variables to figure out what the values each variable should have, or if a solution even exists.

So how to represent this in terms of PuyoPuyo?

Well, applying constraints to PuyoPuyo is pretty easy. A constraint of "Doesn't_Pop" between variables would describe 4 variables with at least one value being different (or be an empty space). The domain of any particular variable would be the 4-colors or an empty spot (a total of domain-size 5). It would seem natural to describe chains in a form of "These variables don't pop" and "These variables MUST pop".

The hard part then, is not finding a solution. But estimating the NUMBER of solutions. After all, PuyoPuyo isn't about just building a chain, its about building a chain given the pieces from the random-number generator. As such, a good "base" is not just a base which has potential to become a 15, 16, or 18-chain... but also one which  has a large potential and flexibility to become a 15+ chain.

In addition, the potential to create hellfires (power-2 chains), or counter-formations, can be represented as constraints as well.

Funny story: Constraint Programming, as powerful as it is, is an NP complete problem. Finding a solution will likely take longer than polynomial time, and finding all exact solutions will take exponentially more time. The only hope is to estimate the number of solutions in some way.

But that's fine! All I'm looking for is a powerful chain-building heuristic. I'll write up a few more posts later that better describe how this Constraint Programming framework will work in my Puyo AI.

Tuesday, January 30, 2018

Research Notes + Reading List

I figure I might as well share some of the things I'm reading or watching that are relevant to the Puyo problem.

My "current reading list" has changed dramatically. A few months ago, I focused my research on OpenCL, but I am now focusing on a CPU-based agent.

APIs

* Extendable Storage Engine: https://msdn.microsoft.com/en-us/library/gg269259(v=exchg.10).aspx
* Reserving and Committing Memory: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366803(v=vs.85).aspx
* YieldProcessor: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687419(v=vs.85).aspx

For maximum reach, I am writing my bot in Win32, targeting Win7 and above. The ESE will provide me a concurrent file-format. I've written a custom memory management routine for my code (mostly because MCTS doesn't "delete" data until the end, at which point I can just "free" ALL the memory at once, or even just overwrite it with a new MCTS iteration).  

YieldProcessor / "Pause" Assembly instruction is important for spinlocks on SMT / Hyperthreaded systems. I wrote and tested my own spinlock, but I haven't quite gotten to use it yet. Mutexes are too slow for synchronization, Spinlock + Atomics ftw.

SPSA

* http://www.jhuapl.edu/spsa/

I'm slowly building up a number of parameters that I'll eventually need to optimize. Stochastic search is the study of optimizing these sorts of parameters. It seems like Stockfish uses SPSA to self-optimize itself.

I'll still need to build some kind of official testing framework and suite of benchmarks. For now, something like a 10,000 step single-player Puyo game is going to be the benchmark for the AI.

Multithreaded / Atomics 

* Every CPPCon talk by Pikus: https://www.youtube.com/watch?v=ZQFzMfHIxng

Pikus is an amazing speaker and explains "Atomics" exceptionally well. Above is a link to one of his 2017 talks, but I've watched all of them and understand multithreaded synchronization much better because of it.

* Another Pikus talk: https://www.youtube.com/watch?v=lVBvHbJsg5Y

* Herb Sutter's Atomic Weapons: https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

MCTS

* https://arxiv.org/abs/1204.5721

The Multi-Armed Bandit Problem is the cornerstone of the UCT algorithm, a piece of the MCTS algorithm overall.

Optimization

* Agner Fog: http://www.agner.org/optimize/
* Intel Optimization guide: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Low-level optimization guides for optimizing L1 cache and CPU cycles. I've of course got some profiling tools (Visual Studio's profiler) which is guiding me primarily, but a solid theory on how modern CPUs work is necessary to optimize. I'm staying mostly higher-level.

* Chess Programming Wiki: https://chessprogramming.wikispaces.com/

The Chess Programming Wiki has a huge number of low-level bit-twiddling tricks

Offline Reading / Books

* The Art of Computer Programming -- Knuth's signature book. I've read through it before, but Chapter 2 has a very good review of tree representations and tree-algorithms. I've reread this section, since there's a lot of tree stuff in in MCTS... trees.


Monday, January 29, 2018

January Notes

Writing this AI is a huge process in learning for me. Initially, I thought light-random playouts and MCTS would be enough to achieve strong play.

And in some regards, yes, the January version of my code does make Chain9 somewhat consistently. But there are certain things that feel "easy" for a human to figure out that my AI struggles with severely.

But first, here's a summary of the current MCTS Bot:

* Single-threaded -- While I've done a lot of research into multi-threaded programming, my bot is still single-threaded. I have a strong feeling that I can convert my bot into a multithreaded program "easily" (I've thought of a multi-threaded design from the start), but until its actually done, I guess its important to note its still single-threaded.

* Deterministic -- The main reason why I've kept it as single-threaded is because determinism is incredibly important for study. I need the bot to be repeatable. There are ways to make a multi-threaded deterministic bot, but my lock-free design does not allow for this very easily.

* L1 Cache Friendly ?? -- I don't have access to Intel's VTune. But based on some code analysis I've done, I expect the majority of the data to be processed in the L1 cache. I've done this by working with 16kb of MCTS data at a time, and ensuring my "pop and fall" algorithms use a minimum amount of RAM (on the order of ~2kb or less). So I expect all my processing to fit within the 32kb L1 cache of a typical modern systems. I basically have two MCTS algorithms: "Big" and "Little". The "Little" MCTS explores within 16kb of memory, while "Big" backpropagates the "little" results, and chooses a new node to perform "Little MCTS" upon.

Eventually, I'll properly test the L1 cache friendlyness with perftools or the like.

* 2GB Tested -- I've set the "Big" MCTS to a size as large as 2GB. This is somewhat important because I have written all memory routines manually. Its easier said than done: because MCTS never "forgets" or "deletes" data, its rather trivial to write a basic "atomic increment" memory allocator, especially since I never "free" until all the memory is used up.

Changes of Opinion:

* I no longer believe that light-random playouts is suitable for strong Puyo AI play. I'm working on a more suitable "heavy playout" idea with a better heuristic for evaluating field positions. A simple algorithm that is suitable for "trigger blocking" would be ideal.

* Multithreaded thoughts -- I've researched a LOT about multithreaded coding. A fully lock-free MCTS implementation is likely possible, but I've got a few ideas which are "mostly lock free, with spin-locks being used in corner cases". Spinlocks are surprisingly effective in some of my basic tests, and they're also not very difficult to write. (Well... you gotta read Intel's Optimization Manual to understand the "pause" instruction and understand how a spinlock might affect hyperthreads... as well as understand atomics and false-sharing of caches... but its not too difficult once you understand all that).

Multithreaded is still a barrier I have yet to cross. The main issue is that I'm primarily in a "research" phase, where speed is not yet my main concern. I'm still trying to research a good organization for the bot overall.

* MCTS Understanding -- I've learned that the core of MCTS is UCT, which is drawn from the "Multi-armed Bandit Problem". There are simpler algorithms to solve the Multi-Armed Bandit problem (such as Epsilon Greedy). If I were to start over, I'd probably use something like Epsilon Greedy instead of UCT. For now, its clear that I need a heuristic to evaluate puyo board positions to best improve my AI and perform "heavier playouts" instead.