Building a tic-tac-toe neural network =D

7 min readJust now

If you just want to experiment with the network, here’s the Github repo:

Just a heads up, this article focuses on a high-level overview rather than detailed code explanations — those will be in a future tutorial. Here’s the article summary

Create training data
Train the model
Results
Experiment with the model yourself!
Extra learnings about MCTS

Creating the training data

First, I decided what the network should output. Option one was predicting a player’s win likelihood for any given position and running this evaluation on the position resulting from each possible move to determine the best one. Option two, which I chose, involved assigning scores (probabilities) to each move, reflecting how strongly the network favored them.

I then used Monte Carlo Tree Search (MCTS) to evaluate positions (check out this article for the intuition behind MCTS!). Each position and its evaluation were paired as a tuple and added to the training data list.

import itertools
import pickle
import tqdm
from game_logic import *


def valid_turns(board):
    # Input: 1D list
    if board.count(1) == board.count(-1) or board.count(1) - 1  == board.count(-1):
        return True
    return False


def outcome_1D(board):
    win_conditions= [
            [0, 1, 2], [3, 4, 5], [6, 7, 8],  # rows
            [0, 3, 6], [1, 4, 7], [2, 5, 8],  # columns
            [0, 4, 8], [2, 4, 6]  # diagonals
    ]
    for condition in win_conditions:
        square = condition[0]
        if board[square] != 0:
            win = True
            token = board[square]
            for square in condition[1:]:
                if board[square] != token:
                    win = False
                    break
            if win == True:
                return True
    return False


def player_turn(board):
    if board.count(1) == board.count(-1):
        return 1
    elif board.count(1) - 1 == board.count(-1):
        return -1  


possible_items = [1, -1, 0]
all_boards = list(list(tup) for tup in itertools.product(possible_items, repeat=9))

valid_boards = [board for board in all_boards if 0 in board]  # Ensure at least 1 empty square
valid_boards = [board for board in valid_boards if valid_turns(board)]
valid_boards = [board for board in valid_boards if not outcome_1D(board)]
boards_with_turn = [(board, player_turn(board)) for board in valid_boards]

# Convert flat boards to Board objects for MCTS to work with
boards = []
for flat_board, turn in boards_with_turn:
    flat_board = [flat_board[i] * turn for i in range(9)]  # Ensure player to move is always represented by '1' token
    board = [flat_board[i:i+3] for i in range(0, 9, 3)]
    empties = [(x, y) for y in range(3) for x in range(3) if board[y][x] == 0]
    board = Board(3, 3, 3, 1, board, empties)
    boards.append(board)

# Evaluate boards with MCTS
training_data = []
iterations = 200
for board in tqdm.tqdm(boards, desc="Game Generation Progress"):
    mcts = MCTS(board)
    move_probs = mcts.search(iterations, 10)
    training_data.append((board, move_probs))

# Save training data
with open(f'train_data/mcts{iterations}_iterations.pkl', 'wb') as f:
    pickle.dump(training_data, f)

Training the model

To decide the network architecture I just did some Googling to see what others did for similar projects (a great way to decide on architecture). For other elements like the loss function, optimizer, MCTS search strength, and train/test split, I just relied on experimentation to see what worked.

Here’s short list of things I learned

SGD requires a much larger learning rate to achieve similar results to Adam (~500X)
Larger networks require lower learning rates due to having more parameters to adjust

import pickle
import random
import tqdm
import torch
import torch.nn as nn
import torch.optim as optim
from game_logic import *


class TTTNet(nn.Module):
    def __init__(self, input_size=9, hidden_size=36, output_size=9):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        x = self.fc1(x)
        x = torch.relu(x)

        x = self.fc2(x)
        x = torch.relu(x)

        x = self.fc3(x)
        x = torch.sigmoid(x)
        return x


def calculate_loss(data_set, net, loss_metric, optimizer=None):
    total_loss = 0
    for board, move_probs in data_set:
        output = net(board)
        loss = loss_metric(output, move_probs)
        if optimizer: # Train the network if an optimizer is provided
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        total_loss += loss.item()
    return total_loss / len(data_set)


def calculate_accuracy(data_set, net, probability_threshold=0.1, return_failed_predictions=False):
    """
    A prediction is considered correct if the network's highest probability move
    is within the probability_threshold of the target's highest probability move.
    """
    correct = 0
    for board, move_probs in data_set:
        output = net(board)
        best_move = torch.argmax(move_probs)
        best_moves = (move_probs >= move_probs[best_move] - probability_threshold).nonzero().flatten()
        if torch.argmax(output) in best_moves:
            correct += 1
    return correct / len(data_set)


def save_model(model, hidden_size, train_test_split):
    train_pct = int(train_test_split * 100) # Convert to percentage for model path
    model_path = f'saved_models/model_{hidden_size}h_{train_pct}tr.pth'
    torch.save(model.state_dict(), model_path)
    print(f"Model saved to {model_path}")


def train_model(train_test_split, network_hidden_size, epochs=20, learn_rate=0.3):
    # Prepare training data
    with open('train_data/mcts200_iterations.pkl', 'rb') as f:
        all_data = pickle.load(f)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tensor_data = [(torch.tensor([board.state[y][x] for y in range(3) for x in range(3)], dtype=torch.float).to(device),
                    torch.tensor([move_probs[y][x] for y in range(3) for x in range(3)], dtype=torch.float).to(device))
                    for board, move_probs in all_data]
    random.shuffle(tensor_data)
    split_idx = int(len(tensor_data) * train_test_split)
    train_data = tensor_data[:split_idx]
    test_data = tensor_data[split_idx:]

    # Initialize variables for training loop
    net = TTTNet(hidden_size=network_hidden_size).to(device)
    optimizer = optim.SGD(net.parameters(), lr=learn_rate)
    loss_metric = nn.MSELoss()
    
    # Train the network
    for epoch in tqdm.tqdm(range(epochs)):
        print(f"========= Epoch {epoch} =========")
        train_loss = calculate_loss(train_data, net, loss_metric, optimizer)
        print(f"Train loss: {train_loss}")
        with torch.no_grad():
            train_accuracy = calculate_accuracy(train_data, net)
            print(f"Train accuracy: {train_accuracy}")
            if train_test_split <1.0:
                test_loss = calculate_loss(test_data, net, loss_metric)
                print(f"Test loss: {test_loss}")
                test_accuracy = calculate_accuracy(test_data, net)
                print(f"Test accuracy: {test_accuracy}")
     
    # Save the trained model
    save_model(net, network_hidden_size, train_test_split)


if __name__ == '__main__':
    train_model(0.8, 36)

Results

First let’s talk about some silly things I did when evaluating my model’s performance:

At first I penalized the network whenever its top move didn’t match with the top MCTS move. Then I started wondering why the accuracy would always peak at 80%, and it turns out that the network was being penalized in positions where there were multiple equally good moves, and it just happened to choose one other than the top MCTS move.

To fix this, I counted a prediction as correct if the model’s top move was among the top MCTS moves (moves with a probability score within 0.05 of the “best move” by MCTS).

With this change, using only 10% of the full dataset for training, the network achieved 80% accuracy when tested on unseen positions from the remaining 90% of the dataset. However when I played against the network myself, it messed up far more often than 1 in 5 moves.

After some digging, I realized it’s because the opening positions are underrepresented in the training data, so a random sample has a high chance of not including them. This is important because many tic-tac-toe games are decided within the first 1–3 moves!

Here’s a breakdown of failed predictions by game length. The percentage is calculated as the number of failed predictions divided by the total positions of that length.

Game length 0: 100.00% failure rate
Game length 1: 66.67% failure rate
Game length 2: 33.33% failure rate
Game length 3: 14.68% failure rate
Game length 4: 10.85% failure rate
Game length 5: 5.61% failure rate
Game length 6: 1.82% failure rate
Game length 7: 1.72% failure rate
Game length 8: 8.56% failure rate

As we can see, the network struggles with openings but excels in the middle game. This is because middle game decisions, like winning immediately or blocking threats, are more straightforward. Openings, however, are subtler, and the network can’t apply middle game knowledge to them.

To fix this, I’d ensure the training data has a balanced number of games across all game lengths. Alternatively, we could train on the full dataset and skip the test set if we’re evaluating by playing against the model.

After training several models, the highest accuracy I achieved was 99% with a network of 2048 neurons per hidden layer. Adding more neurons didn’t improve accuracy, so I’m unsure if it’s a neural network limitation or if I needed more training epochs or a different learning rate.

In general, the best models could force wins, get 3 in a row, and block threats. However, they made at least one mistake every 2–3 games.

That’s it for this project, but I’m curious how to enable a network to achieve perfect play without training on all tic-tac-toe positions. One thing I would try is comparing the linear network’s performance directly to a convolutional neural network, but that’s a topic for another time.

There is another issue I’d like to mention here though. It’s an issue with the training data itself.

Monte Carlo Tree Search (MCTS) often converges on a single move, even when multiple best moves exist, which can mislead the network

More on MCTS

MCTS tends to converge on a single good move, even when several are equally strong. The more iterations it runs, the more one move gets a much higher score (like 0.9) than the others, which can be misleading when multiple moves lead to a forced win or draw.

Interestingly, the network generalized better, recognizing when moves were equally good, and got penalized for being “inaccurate” when MCTS gave extreme results.

I could influence the top move by adding a starting bonus to a node, but if the move isn’t one of the best, MCTS always corrects itself. The random playouts heavily impact which move MCTS converges on, so if one move gets a higher score early on, it’s more likely to be selected.

I added a file called 4_test_mcts.py if you want to experiment with this.

Now that we’ve covered how MCTS works, let’s move on to testing and experimenting with the model yourself :)

Experiment with the model here!

Link to the code. Just download it as a zip file, extract it, and make sure you open the “SL-TicTacToe-NN” folder in your code editor so the file pathing is correct.

To experiment with the training process, run 2_train_net.py
To play against the model, or have different nets play against each other, run 3_test_net.py

Be curious and have fun!

If you’re interested in an interactive tutorial on building a model like this from scratch, stay tuned, because that is coming soon :D