Skip to content

Influence Maximization

Motivation and Problem Definition

Suppose we are interested in computing the seed set which activates the most number of nodes in expectation. This problem is known as influence maximization. Given a graph \(G = (V,E)\) and a fixed budget \(k\), we wish to choose a seed set \(S \subseteq V\) such that \(|S| = k\) and the influence \(\sigma(S)\) on the graph is maximized.

Overview of CELF

A well-known algorithm for this problem is the CELF algorithm, which is an optimized version of the greedy algorithm. At a high level, this algorithm works by greedily adding nodes to the seed set which has the highest marginal gain. This is repeated until the budget is exhausted.

Computing Marginal Gains

To implement this, we first need a specialized function to compute the marginal gain of adding a node to a given seed set. In this example, we focus on the independent cascade model. CyNetDiff allows us to accomplish this task very quickly:

import typing as t
from cynetdiff.models import DiffusionModel

def compute_marginal_gain(
    model: DiffusionModel,
    new_node: int,
    seeds_list: t.List[int],
    num_trials: int = 1_000,
) -> float:
    """
    Compute the marginal gain in the spread of influence by adding a new node to the set of seed nodes,
    by summing the differences of spreads for each trial and then taking the average.

    Parameters:
    - model: The model used for simulating the spread of influence.
    - new_node: The new node to consider adding to the set of seed nodes.
    - seeds: The current set of seed nodes.
    - num_trials: The number of trials to average the spread of influence over.

    Returns:
    - The average marginal gain in the spread of influence by adding the new node.
    """
    seeds = set(seeds_list)
    original_spread = 0
    new_spread = 0
    # If no seeds at the beginning, original spread is always just zero.
    # Prevents wasted work in cases where the seed set is empty.
    if len(seeds) > 0:
        model.set_seeds(seeds)

        for _ in range(num_trials):
            model.reset_model()
            model.advance_until_completion()
            original_spread += model.get_num_activated_nodes()

    new_seeds = seeds.union({new_node})
    model.set_seeds(new_seeds)

    for _ in range(num_trials):
        model.reset_model()
        model.advance_until_completion()
        new_spread += model.get_num_activated_nodes()

    # Avoid floating point division until the very end.
    return (new_spread - original_spread) / num_trials

Main Algorithm

With this function, implementing the rest of the algorithm is straightforward.

from tqdm import tqdm, trange
import heapq


def celf(
    model: DiffusionModel, n: int, k: int, num_trials: int = 1_000
) -> t.Tuple[t.Set[int], t.List[float]]:
    """
    Input: graph object, number of seed nodes
    Output: optimal seed set, resulting spread, time for each iteration
    Code adapted from this blog post:
    https://hautahi.com/im_greedycelf
    """

    # Run the CELF algorithm
    marg_gain = []

    print("Computing marginal gains.")
    # First, compute all marginal gains
    for node in trange(n):
        marg_gain.append(
            (
                -compute_marginal_gain(
                    model,
                    node,
                    set(),
                    num_trials,
                ),
                node,
            )
        )

    # Convert to heap
    heapq.heapify(marg_gain)

    max_mg, selected_node = heapq.heappop(marg_gain)
    S = [selected_node]
    spread = -max_mg
    spreads = [spread]

    print("Greedily selecting nodes.")
    # Greedily select remaining nodes
    for _ in trange(k - 1):
        while True:
            current_mg, current_node = heapq.heappop(marg_gain)
            new_mg_neg = -compute_marginal_gain(
                model,
                current_node,
                S,
                num_trials,
            )

            if new_mg_neg > current_mg:
                break
            else:
                heapq.heappush(marg_gain, (current_mg, current_node))

        spread += -new_mg_neg
        S.append(current_node)
        spreads.append(spread)

    # Return the maximizing set S and the increasing spread values.
    return S, spreads

Running the Algorithm

To see this in action, we can create a model as before and run our algorithm. We'll be using the trivalency weighting scheme to set our activation probabilities:

import networkx as nx
from cynetdiff.utils import networkx_to_ic_model, set_activation_random_sample

n = 5_000
# Randomly generate the graph
ic_graph = nx.random_regular_graph(7, n).to_directed()
# Set activation probabilites
set_activation_random_sample(ic_graph, {0.1, 0.01, 0.001})
# Create corresponding model
celf_ic_model, _ = networkx_to_ic_model(ic_graph)

num_seeds = 20
# Get best seed set returned by the algorithm
celf_ic_seeds, ic_marg_gains = celf(celf_ic_model, n, num_seeds)

Trying Different Models

The celf function also works with the Linear Threshold diffusion model:

from cynetdiff.utils import networkx_to_lt_model

n = 5_00
# Randomly generate the graph
lt_graph = nx.random_regular_graph(7, n).to_directed()

# Set weighting scheme
celf_lt_model, _ = networkx_to_lt_model(lt_graph)

num_seeds = 20
# Get best seed set returned by the algorithm
celf_lt_seeds, lt_marg_gains = celf(celf_lt_model, n, num_seeds)