Pandemaniac Competing Cascades

tl;dr

I designed an efficient algorithm for time-constrained seed selection in competing cascade graphs.

Overview

As part of another yearly Caltech tradition, students in CS 144 are exposed to an emerging hot topic in networks—competing cascades. A competing cascade on graph $G = (V,E)$ is defined as follows:

Initialization: each node in $G$ begins uncolored.
Seeding: each team selects $k$ seed nodes to start with their respective color. If two teams attempt to seed the same node, it begins blank. With the colored seeds set, we now begin rounds of spreading.
During each round: each node $n \in G$ looks at its non-blank neighbors and totals the number of each color. If any color's total obtains a strict majority (over half) of the total sum, $n$ takes on that color at the end of the round. Note that if $n$ already has a color, it also "votes" for its own color with weight 1.5.
Termination: rounds continue until convergence, meaning no nodes change color during a round. Note that there is a small chance colors oscillate infinitely; if this occurs then we choose a random number of rounds $\in [100, 200]$ to stop.
Scoring: after convergence or a manual stop, the team with the most nodes of their respective color wins. If both teams happened to select the exact same $k$ seed nodes, both are awarded a loss.

Note that the definition above admits cascades with an unlimited number of teams, but here we focus mostly on two-team cascades. Using the above cascade mechanics, the Pandemaniac challenge is run according to the following procedure:

Teams are given access to graphs on the course website, where each file stores the graph in adjacency list format as well as the number of seed nodes k to select.
From the moment a team downloads a graph file, they have a maximum of 5 minutes to submit a text file containing their desired seed nodes.
Each day of the challenge, a few graphs are released and each team's seeds are pitted round-robin style against all other teams. Team with the best overall records would then face off elimination style until a winner was crowned.

Pandemaniac is clearly a complicated game and the development of a winning strategy will involve some combination of graph theory as well as game theory. Do we opt to choose simply the highest-degree nodes in the graph, or perhaps use some other measure of centrality? Do we prioritize nodes that the other team is unlikely to seed in order to avoid picking the same ones? The answers to these questions will become apparent as we explore tools and strategies below.

Tools

The simulation code provided with the sample project files was adequate for simple testing, but we knew from the start that our strategy would likely involve many successive simulations. Thus, we sought to improve the simulation code to be more efficient. For reference, the original simulation roughly followed the pseucode below:

def original_sim(G, seeds1, seeds2):
  # Initialize the graph with the given seeds.
  for node in seeds1:
    G[node].color = color1

  for node in seeds2:
    if G[node].color == blank:
      G[node].color = color2
    else:
      G[node].color = blank

  # Run until convergence or manual stop.
  max_iter = random_int(100, 200)
  iter = 0
  new_G = None

  while G != new_G and iter < max_iter:
    new_G = G.copy()
    for node in G:
      totals = [0, 0]
      for neighbor in node.neighbors:
        if neighbor.color != blank:
          totals[neighbor.color] += 1
        
      if node.color != blank:
        totals[node.color] += 1.5

      if totals[color1] > node.nonblank_neighbors:
        new_G[node].color = color1
      elif totals[color2] > node.nonblank_neighbors:
        new_G[node].color = color2

    G = new_G
    iter += 1

  # Count up colors.
  totals = [0, 0]
  for node in G:
    if node.color != blank:
      totals[node] += 1

  return totals

The first improvement simply optimized the original code to avoid unnecessary neighbor checks. For the original simulator, each round requires looking through every node in the graph as well as every neighbor at a given node. This includes nodes that have no colored neighbors and have no reason to be checked in the first place. The improvement we made was to look only at nodes that we'd previously marked as “needing updating.” When a node updates its color, it updates the votes for its color for each of its neighbors and marks the neighbor as needing updating. Thus, we only look through the necessary nodes at each iteration. In testing, this produced about a 30% time save and allowed us to start pursuing brute force strategies with higher success.

def better_sim(G, seeds1, seeds2):
  # Initialize the graph with the given seeds.
  for node in seeds1:
    G[node].color = color1

  for node in seeds2:
    if G[node].color == blank:
      G[node].color = color2
    else:
      G[node].color = blank

  # Figure out which nodes we actually need to check (neighbors of colored nodes).
  to_check = set()

  for node in set_diff(seeds1, seeds2):
    for neighbor in node.neighbors:
      to_check.add(neighbor)

  # Run until convergence or manual stop.
  max_iter = random_int(100, 200)
  iters = 0

  while iters < max_iter:
    to_update = set()

    for node in to_check:
      for color in colors:
        if node.color == blank and node.counts[color] > node.nonblank_neighbors:
          for neighbor in node.neighbors:
            neighbor.new_nonblank_neighbors += 1
            neighbor.new_counts[color] += 1
            to_update.add(neighbor)
            to_check.add(neighbor)
          node.color = color
        elif node.color != color and node.counts[color] > node.nonblank_neighbors + 1.5:
          for neighbor in node.neighbors:
            neighbor.new_counts[node.color] -= 1
            neighbor.new_counts[color] += 1
            to_update.add(neighbor)
            to_check.add(neighbor)
          node.color = color

    if to_update.is_empty():
      break

    for node in to_update:
      node.nonblank_neighbors = node.new_nonblank_neighbors
      node.counts = node.new_counts

    iters += 1

  # Count up colors.
  totals = [0, 0]
  for node in G:
    if node.color != blank:
      totals[node] += 1

  return totals, iters

The second improvement was essentially a complete redesign of the simulation that foregoes traversing a graph altogether. Instead, it takes advantage of sparese matrix multiplication, which is much quicker. For a game with two colors, we can accomplish this by creating vector $b$ and matrix $A$ with the following setup:

b_i = \begin{cases} 1, & \text{if node } i \text{ is color 1} \\ -1, & \text{if node } i \text{ is color 2} \end{cases},~~~~ A_{ij} = \begin{cases} 1, & \text{if node } i \text{ shares an edge with node } j \\ 1.5, & \text{if } i = j \\ 0, & \text{otherwise} \end{cases}

Note that $A$ is exactly the adjacency matrix with a superimposed diagonal of value 1.5. What the above setup accomplishes is the ability to represent a given round of the cascade as $\text{sign}(bA)$ . If this is unclear, perhaps take some time to convince yourself that this is true. But we now had a way to rapidly simulate cascade games using just a few lines of linear algebra.

import numpy as np

def best_sim(G, n1, n2):  
  if seed1 == seed2:
    return 0.0, 0.0

  # Construct initial seeding for each color.
  n = A.shape[G.order()]
  A = G.adjacency_matrix() + np.identity(n)
  curr = np.zeros(n).reshape((-1,1))
  curr[list(seed1)] += 1
  curr[list(seed2)] -= 1

  # Simulate until convergence or max iterations reached
  max_iter = max_rounds = np.random.randint(100, 200)
  iter = 0
  prev = None

  while (prev != curr).any() and iter < max_iter:
    prev = curr
    curr = np.sign(A @ prev)
    iter += 1

  curr = np.array(curr).flatten()
  counts = collections.Counter(curr)
  return counts[1], counts[-1]

After implementing most of our strategies using Python and heavily leveraging NumPy, we decided to try making it faster. The first thing we tried was PyTorch, which is a GPU-enabled ML library. The syntax and functionality is very similar to NumPy. After implementing the changes, at least on our MacBooks (which was primarily what we had access to), there was a marginal improvement over NumPy. We then tried CuPy, which is a numpy version that leverages CUDA for improvements, but our windows desktops were not able to run CuPy due to some CUDA issues, and it was not supported on Apple silicon for obvious reasons. Afterwards, we stumbled across a special version of NumPy that is optimized for Apple silicon, and immediately saw an average of over a 5x increase in processing speed over base NumPy, and a 2x improvement over the PyTorch implementation. This is ultimately what we went for. Using these improvements, we were able to increase processing for our final strategy from ~100-150k iterations to over 1 million.

Potential Strategies

Weighted: the first potential strategy was created by doing some testing with the provided simulation. We saw that the winning strategies usually had mostly high-centrality nodes, so we decided to develop around it. In graph analysis, the three main centrality measurements are degree, betweenness, and closeness centrality and are defined for node $v$ as follows:

\begin{aligned} \text{Degree Centrality: } & C_D(v) = \frac{deg(v)}{n-1} \\ \text{Betweenness Centrality: } & C_B(v) = \sum_{s\neq v\neq t}\frac{\sigma_{s,t}(v)}{\sigma_{s,t}} \\ \text{Closeness Centrality: } & C_C(v) = \frac{1}{\sum_{u\in V} d(v,u)} \end{aligned}

Nodes of high centrality are very useful because they are more likely to spread their color faster because of how connected they are to the rest of the graph. For this strategy we took each of the above centralities and applyied a weight to each, choosing the nodes with the highest rank after weighting. We then iterated over all discrete combinations of centrality weights, and pitted each combination against each other. Ultimately, we took the strategy that won the most from this.

Genetic: Given that Pandemaniac is a kind of game, we took inspiration from genetic algorithms developed to find how to play video games in an unsupervised way. To do this, we started with a population of size $n$ seedings that contained random strategies made by randomly selecting nodes from the graph. We then calculated the “fitness” of each strategy, which basically consisted of pitting each strategy in the population against each other. The resulting fitness is a measure of how many games that seeding won. We then take the top 2 strategies as “parents” to generate “offspring” by generating $n-2$ new seedings by randomly choosing $k$ seeds from the set of seeds from both parents in the “crossover” step. Finally, we had a “mutation” step, where a small subset of nodes from each offspring was randomly switched out with some other node in the graph. This process repeated until the parent seeds did not change for 5 iterations, after which point we pit the final population against each other and took the winning-est seeding.

Clustering: This strategy was less of an individual strategy and moreso a focusing of the above weighting strategy. The goal was to reduce the problem scope, specifically by using k-means to find the largest cluster, and making use of the weighted strategy to choose the best seed nodes within this cluster. The thinking was that if parts of the graph were under contention, we’d be best off having a stronghold within the largest cluster. In testing this typically lost to a more generalized weighted strategy, or even simply the highest degree centrality nodes.

After extensive testing with the above 3 strategies, we pit the results together to get the overall winning-est strategy. This usually ended up being from the first vein of strategies, specifically one in which degree centrality was weighted very highly. We used this empirical observation when developing our final strategy described below, which (surprisingly) has very little in common with the above three as it takes on more of a brute-force approach.

Developing a Final Strategy

Our final strategy has very little in common with the ones explored above and takes more of a brute-force approach. We developed it in three different stages, which we describe below with some associated pseudocode. Note that our actual implementations used Python's multiprocessing module to squeeze as much effort as possible out of the CPU within five minutes:

V0. Random brute force: This was our first foray into the black box we would eventually develop. The motivation behind attempting this strategy to begin with is quite simple: despite our best efforts analytically, we could rarely find seed nodes which outright beat degree centrality. Given how random it seemed, we thought it would be amusing to generate random seed nodes for 4 minutes and see if any beat degree centrality. To our surprise, this not only produced better results but also produced more results within the entire five-minute timespan. Thus, we elected to compete with this for the first few days while we looked into optimizations.

from datetime import datetime
import random

def v0(G, k):
  # Start with highest degree nodes.
  t = datetime.now()
  deg_seeds = sorted(G.nodes, key=G.degree, reverse=True)[:k]
  scores = dict()

  # Spend the entire time searching for seeds beating degree.
  while (datetime.now() - t).total_seconds() < 240:
    seeds = random.choice(G.nodes, k)

    [n_us, n_deg] = original_sim(G, seeds, deg_seeds)

    if n_us > n_deg:
      scores[seeds] = n_us

  # Pick set that performed best against degree (if found).
  if len(scores) == 0:
    return deg_seeds
  return sorted(scores, key=scores.get)[-1]

Given that our first improvement did not include any sort of analytical consideration and was based off of quick simulations, we decided to double down on the strategy and looked into improvements that simulate even faster. So we improved our sim to make use of matrix multiplication approach described in the tools above and changed our performance metric to frontier expansion rate as opposed to nodes captured total. What this meant was that we measured how fast the seeds accumulated nodes by dividing the number of nodes accumulated over the number of iterations it took to converge. This disincentivized non-convergent or slow-growing seeds. This was met with much better success than V0, but we still had not cracked the top few teams.

from datetime import datetime
import random

def better_v0(G, k):
  # Start with highest degree nodes.
  t = datetime.now()
  deg_seeds = sorted(G.nodes, key=G.degree, reverse=True)[:k]
  scores = dict()

  # Spend the entire time searching for seeds beating degree.
  while (datetime.now() - t).total_seconds() < 240:
    seeds = random.choice(G.nodes, k)

    [n_us, n_deg], iters = better_sim(G, seeds, deg_seeds)

    if n_us > n_deg:
      scores[seeds] = n_us / iters

  # Pick set that performed best against degree (if found).
  if len(scores) == 0:
    return deg_seeds
  return sorted(scores, key=scores.get)[-1]

V1. Directed brute force: This is when the strategy really began to get smarter. Having invested significant time into optimizing the simulation process, we instead shifted our focus on reducing the problem space itself. From empirical observation of results for the past few days, it appeared that most winning seeds consisted of some combination of high-degree nodes, and a few (usually 1-3) slightly lower degree nodes. Knowing this, we developed a new approach that looked through seeds in a more efficient order (therefore increasing our odds of finding a good one within five minutes). We accomplished this by first sorting nodes by degree centrality and then did the following at each iteration: we chose a random number $m \in [0, k/2]$ . We then chose $k-m$ seed nodes from the top $k$ nodes as well as $m$ seed nodes from the remaining top 25% of nodes. This produced many more high-performing seeds, further strengthening our final result because we had enough seeds to do a final layer of testing before picking which one to submit. The final step consisted of filtering out the top 100 seeds from the above process, pitting them all against each other, and submitting the one with the highest average win rate. As expected, this was met with even more success during round robin matches, advancing us to the final round of the tournament!

from datetime import datetime
import itertools
import random

def v1(G, k):
  # Store highest degree nodes as well as top 25% of nodes by degree.
  t = datetime.now()
  deg_seeds = sorted(G.nodes, key=G.degree, reverse=True)[:k]
  sample_nodes = sorted(G.nodes, key=G.degree, reverse=True)[k:G.degree // 4]

  # Search for sets of seeds beating degree centrality.
  tested = set()
  scores = dict()

  while (datetime.now() - t).total_seconds() < 240:
    n_outside = random.randint(0, k//2)
    nodes_outside = random.sample(sample_nodes, n_outside)
    nodes_inside = random.sample(deg_seeds, k - n_outside)
    seeds = union(nodes_outside, nodes_inside)

    if seeds not in tested:
      [score_us, score_deg] = best_sim(G, seeds, deg_seeds)
      if n_us > n_deg:
        scores[seeds] = score_us
      tested.add(seeds)

  if len(scores) == 0:
    return deg_seeds

  # Pit the top 100 combinations against each other.
  top_seeds = sorted(scores, key=scores.get, reverse=True)[:100]
  final_scores = {seed : 0 for seed in top_seeds}

  for s1, s2 in itertools.combinations(top_seeds, 2):
    s1_score, s2_score = best_sim(A, s1, s2)

    if s1_score > s2_score:
        final_scores[s1] += 1
    if s1_score < s2_score:
        final_scores[s2] += 1

  return max(final_scores, key=final_scores.get)

V2. Greedy top-down search: As a final improvement, we decided to make the strategy get greedy in an effort to further improve the pool of high-performing seed nodes we iterate through. We begin with the highest degree nodes as our baseline strategy and the current strategy. We then launch $k$ (number of seeds) processes, where the $i$ th process is responsible for considering what would happen if we replace the ith node of the current seed set with something new, confirming that seed hasn’t been tested yet, and testing it against the baseline. If it beats the baseline, we add the strategy to a winning dictionary that keeps track of how much these strategies won by. After all processes finish iterating, we pool the winning dictionaries together, select the best performing seed from the last iteration as the new current strategy, and repeat. After 4 minutes, we enact the same final step of testing as in V1 where the top 100 seedings from all iterations are put against each other and the set of seeds with the highest average win percentage is submitted.

from datetime import datetime
import random

def v2(G, k):
  # Store highest degree nodes as well as top 25% of nodes by degree.
  t = datetime.now()
  deg_seeds = sorted(G.nodes, key=G.degree, reverse=True)[:k]
  sample_nodes = sorted(G.nodes, key=G.degree, reverse=True)[k:G.degree // 4]

  # Iterate through seeds by permuting individual nodes.
  tested = set()
  scores = dict()
  curr = deg_seeds

  while (datetime.now() - t).total_seconds() < 240:
    tested_here = set()

    for idx_to_remove in range(len(curr)):
      for node in sample_nodes:
        new_seed = curr
        new_seed[idx_to-remove] = node

        if new_seed not in tested:
          [score_us, score_deg] = best_sim(G, seeds, deg_seeds)
          if n_us > n_deg:
            scores[seeds] = score_us
          tested.add(seeds)
          tested_here.add(seeds)

    curr = max(tested_here, key=scores.get)
  
  # Pit the top 100 combinations against each other.
  top_seeds = sorted(scores, key=scores.get, reverse=True)[:100]
  final_scores = {seed : 0 for seed in top_seeds}

  for s1, s2 in itertools.combinations(top_seeds, 2):
    s1_score, s2_score = best_sim(A, s1, s2)

    if s1_score > s2_score:
        final_scores[s1] += 1
    if s1_score < s2_score:
        final_scores[s2] += 1

  return max(final_scores, key=final_scores.get)

Results

Out of about 32 teams, we came in second place. As it turns out, the first-place team used pretty much the same strategy we did, but with no final step. What this effectively accomplishes is that they found a pseudo-Nash equilibrium, a strategy that was excellent at beating other "good" strategies but not necessarily average strategies. This makes a lot of sense in the context of previous rounds, where we mad a much better record against teams that employed surface-level strategies like degree or betweenness centrality because we prioritized average win rate instead of just wins against good strategies.

Overall, I'm extremely pleased with how we did. I'm incredibly surprised with how effective a heavily-optimized brute force approach worked, especially in a competition that has been a Caltech classic for many years. Perhaps this just goes to show that sometimes simpler is better!