Thursday, September 6, 2018

Developing a heuristic: Constraint Programming for Puyo

Last off, I was talking about the brute-force MCTS methodology was coming up with nice 7-chains at a reasonable pace. However, expert human Puyo players can build 15+ chains in actual combat, so while the MCTS was successful in some sense, it is clearly too weak to challenge human players still, by an order of magnitude. As such, I set off looking for a heuristic to improve MCTS.

I eventually came across constraint programming, the AI-methodology for representing problems in terms of a series of variables, domains, and constraints. A constraint solver tries to assign variables values that satisfy constraints.

For example, Sudoku can be described as 81 variables, each with domain 1-9, with the 27 constraints of:

* All different(X11, X12, X13, X14 ... X19)
* All different(X21, X22, X23 ...)
* ... repeat the above for all rows
* All different(X11, X21, X31, ... X91)
* All different(X12, X22, X32...)
* ... repeat the above for all columns
* All different (X11, X12, X13, X21, X22, X23... X33)
* ... repeat the above for all squares.

Anyway, solving Sudoku in this way is very "human" and "intuitive". Once the problem is stated in this manner, its a simple process of elimination between all variables to figure out what the values each variable should have, or if a solution even exists.

So how to represent this in terms of PuyoPuyo?

Well, applying constraints to PuyoPuyo is pretty easy. A constraint of "Doesn't_Pop" between variables would describe 4 variables with at least one value being different (or be an empty space). The domain of any particular variable would be the 4-colors or an empty spot (a total of domain-size 5). It would seem natural to describe chains in a form of "These variables don't pop" and "These variables MUST pop".

The hard part then, is not finding a solution. But estimating the NUMBER of solutions. After all, PuyoPuyo isn't about just building a chain, its about building a chain given the pieces from the random-number generator. As such, a good "base" is not just a base which has potential to become a 15, 16, or 18-chain... but also one which  has a large potential and flexibility to become a 15+ chain.

In addition, the potential to create hellfires (power-2 chains), or counter-formations, can be represented as constraints as well.

Funny story: Constraint Programming, as powerful as it is, is an NP complete problem. Finding a solution will likely take longer than polynomial time, and finding all exact solutions will take exponentially more time. The only hope is to estimate the number of solutions in some way.

But that's fine! All I'm looking for is a powerful chain-building heuristic. I'll write up a few more posts later that better describe how this Constraint Programming framework will work in my Puyo AI.

Tuesday, January 30, 2018

Research Notes + Reading List

I figure I might as well share some of the things I'm reading or watching that are relevant to the Puyo problem.

My "current reading list" has changed dramatically. A few months ago, I focused my research on OpenCL, but I am now focusing on a CPU-based agent.

APIs

* Extendable Storage Engine: https://msdn.microsoft.com/en-us/library/gg269259(v=exchg.10).aspx
* Reserving and Committing Memory: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366803(v=vs.85).aspx
* YieldProcessor: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687419(v=vs.85).aspx

For maximum reach, I am writing my bot in Win32, targeting Win7 and above. The ESE will provide me a concurrent file-format. I've written a custom memory management routine for my code (mostly because MCTS doesn't "delete" data until the end, at which point I can just "free" ALL the memory at once, or even just overwrite it with a new MCTS iteration).  

YieldProcessor / "Pause" Assembly instruction is important for spinlocks on SMT / Hyperthreaded systems. I wrote and tested my own spinlock, but I haven't quite gotten to use it yet. Mutexes are too slow for synchronization, Spinlock + Atomics ftw.

SPSA

* http://www.jhuapl.edu/spsa/

I'm slowly building up a number of parameters that I'll eventually need to optimize. Stochastic search is the study of optimizing these sorts of parameters. It seems like Stockfish uses SPSA to self-optimize itself.

I'll still need to build some kind of official testing framework and suite of benchmarks. For now, something like a 10,000 step single-player Puyo game is going to be the benchmark for the AI.

Multithreaded / Atomics 

* Every CPPCon talk by Pikus: https://www.youtube.com/watch?v=ZQFzMfHIxng

Pikus is an amazing speaker and explains "Atomics" exceptionally well. Above is a link to one of his 2017 talks, but I've watched all of them and understand multithreaded synchronization much better because of it.

* Another Pikus talk: https://www.youtube.com/watch?v=lVBvHbJsg5Y

* Herb Sutter's Atomic Weapons: https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

MCTS

* https://arxiv.org/abs/1204.5721

The Multi-Armed Bandit Problem is the cornerstone of the UCT algorithm, a piece of the MCTS algorithm overall.

Optimization

* Agner Fog: http://www.agner.org/optimize/
* Intel Optimization guide: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Low-level optimization guides for optimizing L1 cache and CPU cycles. I've of course got some profiling tools (Visual Studio's profiler) which is guiding me primarily, but a solid theory on how modern CPUs work is necessary to optimize. I'm staying mostly higher-level.

* Chess Programming Wiki: https://chessprogramming.wikispaces.com/

The Chess Programming Wiki has a huge number of low-level bit-twiddling tricks

Offline Reading / Books

* The Art of Computer Programming -- Knuth's signature book. I've read through it before, but Chapter 2 has a very good review of tree representations and tree-algorithms. I've reread this section, since there's a lot of tree stuff in in MCTS... trees.


Monday, January 29, 2018

January Notes

Writing this AI is a huge process in learning for me. Initially, I thought light-random playouts and MCTS would be enough to achieve strong play.

And in some regards, yes, the January version of my code does make Chain9 somewhat consistently. But there are certain things that feel "easy" for a human to figure out that my AI struggles with severely.

But first, here's a summary of the current MCTS Bot:

* Single-threaded -- While I've done a lot of research into multi-threaded programming, my bot is still single-threaded. I have a strong feeling that I can convert my bot into a multithreaded program "easily" (I've thought of a multi-threaded design from the start), but until its actually done, I guess its important to note its still single-threaded.

* Deterministic -- The main reason why I've kept it as single-threaded is because determinism is incredibly important for study. I need the bot to be repeatable. There are ways to make a multi-threaded deterministic bot, but my lock-free design does not allow for this very easily.

* L1 Cache Friendly ?? -- I don't have access to Intel's VTune. But based on some code analysis I've done, I expect the majority of the data to be processed in the L1 cache. I've done this by working with 16kb of MCTS data at a time, and ensuring my "pop and fall" algorithms use a minimum amount of RAM (on the order of ~2kb or less). So I expect all my processing to fit within the 32kb L1 cache of a typical modern systems. I basically have two MCTS algorithms: "Big" and "Little". The "Little" MCTS explores within 16kb of memory, while "Big" backpropagates the "little" results, and chooses a new node to perform "Little MCTS" upon.

Eventually, I'll properly test the L1 cache friendlyness with perftools or the like.

* 2GB Tested -- I've set the "Big" MCTS to a size as large as 2GB. This is somewhat important because I have written all memory routines manually. Its easier said than done: because MCTS never "forgets" or "deletes" data, its rather trivial to write a basic "atomic increment" memory allocator, especially since I never "free" until all the memory is used up.

Changes of Opinion:

* I no longer believe that light-random playouts is suitable for strong Puyo AI play. I'm working on a more suitable "heavy playout" idea with a better heuristic for evaluating field positions. A simple algorithm that is suitable for "trigger blocking" would be ideal.

* Multithreaded thoughts -- I've researched a LOT about multithreaded coding. A fully lock-free MCTS implementation is likely possible, but I've got a few ideas which are "mostly lock free, with spin-locks being used in corner cases". Spinlocks are surprisingly effective in some of my basic tests, and they're also not very difficult to write. (Well... you gotta read Intel's Optimization Manual to understand the "pause" instruction and understand how a spinlock might affect hyperthreads... as well as understand atomics and false-sharing of caches... but its not too difficult once you understand all that).

Multithreaded is still a barrier I have yet to cross. The main issue is that I'm primarily in a "research" phase, where speed is not yet my main concern. I'm still trying to research a good organization for the bot overall.

* MCTS Understanding -- I've learned that the core of MCTS is UCT, which is drawn from the "Multi-armed Bandit Problem". There are simpler algorithms to solve the Multi-Armed Bandit problem (such as Epsilon Greedy). If I were to start over, I'd probably use something like Epsilon Greedy instead of UCT. For now, its clear that I need a heuristic to evaluate puyo board positions to best improve my AI and perform "heavier playouts" instead.

Sunday, December 24, 2017

Monte Carlo Tree Search / Upper Confidence Bounds Introduction

Monte Carlo Tree Search is the search tree representation I decided to go for in this Puyo AI project.

First: you must have an understanding of game trees. A game AI "thinks" by exploring the Game Tree. Each node represents a board state, with the root-node being a representation of the current board. The children of this node are all of the immediate actions the AI could take. For example, in Chess, a child node could be "Pawn to E5", or in Puyo it could be "Rotate Twice and Drop in Column 3". The specifics of how this game tree is built up depends on the algorithm.

The "value" of a node is a heuristic that the programmer makes up. In chess, this could be "Queens are 9 points and Rooks are 5 points". In Go, most AIs try to evaluate a probability of which player will win from a particular board position. (This move has a 40% chance of winning, or that move has a 60% chance of winning).

In the case of Monte Carlo Tree Search, there are four steps that are tightly looped.

  1. Selection -- A process where the most "interesting" node is chosen. In the case of UCT (Upper Confidence Bounds applied to Trees), selection searches the nodes with the highest value and the fewest visits.
  2. Expansion -- Once a node is selected, a new child-node is added to the tree at that location.
  3. Simulation -- This calculates the "value" of a particular node. In the case of UCT, you play a random game all the way to the "end". For my Puyo AI, I usually cap the simulation to 20-moves (subject to future research). 
  4. Backpropagation -- Once statistics are calculated from the simulation, you change the value of the tree to match the simulation results. From there, the process begins at #1: Selection and the loop continues.

In the animation above, I assume that backpropagation is a simple "maximum" function, but every step of this process has been customized by various researchers. In any case, the growth of this tree is the AI pondering about future steps. In the case of AlphaGo, a self-trained neural network was used as both the "Selection" and "Simulation" algorithms. In the case of UCT, simple-random searches serve as simulation, while selection is a simple function. So there are lots of different ways to implement this methodology.

The overall concept of MCTS is just the above four steps, and many Google Searches will get you this far. But in my opinion, the devil is in the details... details that don't seem to be talked about too much yet. I'll elaborate on these details in a later post.

Thursday, December 21, 2017

Inaugural run

With only 14kb of RAM... to fit inside of L1 cache... I have relatively low expectations of this initial version of the PuyoAI. With that said, this is the first time I got to see my AI in action. And these results please me.

You can see the first 200-moves of my AI on PasteBin: https://pastebin.com/tBycFJyH

My "Legal Moves" check is a bit simplified: Puyos drop from the top, I don't worry about ghosted columns, and death only happens if ALL columns fill up. With that said, the AI does seem to have a reasonable amount of intelligence. It was able to play all 200 moves without killing itself, and it also managed to execute a fancy 7-chain on move 62.

See the Chainsim here: https://puyonexus.com/chainsim/chain/XCJdA



Even for a weak AI that barely even sees 2 move ahead... I think the formations demonstrated in this "first chain7" show some promise. In particular, the last "clear with garbage" was on iteration 44. So the AI spends ~18 turns building up this formation. If we look at the "before and after" shots, from iteration 44 through 62, we'll see the following:




The reds are popped in column 3/4, which leads to an interesting Yellow and Red L-shape in column 2/3.

There are "bad shapes" everywhere still: there's a "Very Bad GTR" on the right-hand side, where the Red Puyo is blocked in by the Yellow Puyos. But the GTR formation looks like it has some promise: the Yellows on the 2nd row from the bottom could turn into a real chain, and the left side of the formation looks like it can be turned into a serious tail.

Maybe I'm just being too optimistic here with the abilities of a grossly memory-constrained Puyo AI. But again, considering that this AI only had 14kB to work with, I'm very impressed that it managed to make anything that even looks remotely decent.

My next goal is to extend this AI to use 8-thread / 4-cores of my processor, and simultaneously allow this AI to grow to MB or even GB of thinking-space. At that point, I expect it will be a much better player!

Wednesday, December 20, 2017

Simple Stats

My current benchmark: 1512519 simulations in 36.074 seconds.


This called the "Little_MCTS" function 1000 times. There was a printout that indicated the total of 1512519 "visits".

Little_MCTS is my "cache-friendly" implementation of MCTS, with 16-bit "pointers" on an address space of 14kb. My plan is to run Little_MCTS on 8-threads (across 4 cores with hyperthreading), on my (kind of old) Sandy Bridge i7-2600k processor.

Sandy Bridge has 32kb of L1 data-cache (with 32kb on instructions cache). With two threads on one core running, that would use 28kb of the L1 cache, 4kb left for stack-data and other calculations. Hopefully that's enough? There's a lot of pointer-chasing in the "select" method in MCTS

The #1 "inner" function is "pop", at 56% of the time. "Pop" is my implementation of the flood-fill problem, that looks for Puyo-sets of 4. It basically implements the Puyo scoring algorithm. I've racked my brain at how to improve the speed of the "Pop" algorithm, but I haven't come up with anything faster yet.

Assuming ~100 bytes per simulation in the "full" implementation (not yet done), it would require 8GB / 100 bytes == 80-million simulations to fill it up. Assuming a 4x multiplier from 4-cores / 8-threads hyperthreaded, that's 8-minutes or so per analysis session.

Which is... way slower than most humans play. But for a deep-analysis of Puyo positions, that actually seems reasonable. I think this project, at this speed tier, can actually be useful at that speed!

Puyo AI Project

With freely available programs like Stockfish and SCID vs PC, it is possible to constantly have a super-human level analysis applied to every Chess game you play. You can find missed opportunities on both sides, and hone your chess skills to the max.

For a game like PuyoPuyo however, no such super-human AI exists. And my "Open Source Itch" is to develop some self-learning tool for myself, and the greater PuyoPuyo community. True, there are lots of training tools and tutorials out there, such as the excellent PuyoNexus chain simulator, and the community is quite helpful when discussing strategy. But there's nothing quite like a super-human AI that automatically tells you mistakes as you're playing the game.

The first step to solving this problem is to somehow create a super-human AI. But how to do it?

With Google's success with "AlphaGo" in recent years, I've grown to trust the Monte-Carlo Tree Search algorithm, although I'm still suspicious of "Deep Learning" or other neural network methodologies, at least on consumer hardware. Clearly a legion of 100TFLOP or 200TFLOP "TPUs" in Google supercomputers change the game (as AlphaGo has demonstrated), but standard hardware is limited to 1000x less performance with ~100GFLOP AVX instructions. Even if I used a more difficult GPGPU methodology, the best $800 GPUs only get 10TFLOPs, and I'd expect training and inference to take too long under those conditions.

So my initial plans are to create this PuyoPuyo AI that plays under the classical MCTS with "light random rollouts" akin to the original MCTS "MoGo" AI.

In short, expect blogposts on the following:

  • Puyo Puyo, a great little niche puzzle game
  • Brief OpenCL / GPGPU discussion, and why I'm holding off on this tech for this project
  • Monte-Carlo Tree Search, and how it relates to game trees.
  • Optimizations I'm employing to shrink down the memory-space to better fit into L1 Cache. Faster speed means more nodes explored, which will always improve the AI's strategy
  • SIMD opportunities in the MCTS algorithm.

Of course, as I learn about new opportunities, I'll write down my thoughts.