layout | title |
---|---|
post | Influence Maximization |
Identification of influential nodes in a network has important practical uses. A good example is "viral marketing", a strategy that uses existing social networks to spread and promote a product. A well-engineered viral marking compaign will identify the most influential customers, convince them to adopt and endorse the product, and then spread the product in the social network like a virus.
The key question is how to find the most influential set of nodes? To answer this question, we will first look at two classical cascade models:
Then, we will develop a method to find the most influential node set in the Independent Cascade Model.
In the Linear Threshold Model, we have the following setup:
$$ \sum_{w\text{ neighbor of v }} b_{v,w}\leq 1 $$
$$ \sum_{w\text{ active neighbor of v }} b_{v,w}\geq\theta_{v} $$
The following figure demonstrates the process:
(A) node V is activated and influences W and U by 0.5 and 0.2, respectively; (B) W becomes activated and influences X and U by 0.5 and 0.3, respectively; (C) U becomes activated and influences X and Y by 0.1 and 0.2, respectively; (D) X becomes activated and influences Y by 0.2; no more nodes can be activated; process stops.
In this model, we model the influences (activation) of nodes based on probabilities in a directed graph:
Note:
*Red-colored nodes a and b are active. The two green areas enclose the nodes activated by a and b respectively, i.e. $$X_{a}$$ and $$X_{b}$$.*
Note:
The influential maximization problem is then an optimization problem:
$$ \max_{S \text{ of size }k}f(S) $$
This problem is NP-hard [Kempe et al. 2003]. However, there is a greedy approximation algorithm--Hill Climbing that gives a solution $$S$$ with the following approximation guarantee:
$$ f(S)\geq(1-\frac{1}{e})f(OPT) $$
where $$OPT$$ is the globally optimal solution.
Algorithm: at each step $$i$$, activate and pick the node $$u$$ that has the largest marginal gain $$\max_{u}f(S_{i-1}\cup{u})$$:
Claim: Hill Climbing produces a solution that has the approximation guarantee $$f(S)\geq(1-\frac{1}{e})f(OPT)$$.
Definition of Monotone: if $$f(\emptyset)=0$$ and $$f(S)\leq f(T)$$ for all $$S\subseteq T$$, then $$f(\cdot)$$ is monotone.
Definition of Submodular: if $$f(S\cup {u})-f(S)\geq f(T\cup{u})-f(T)$$ for any node $$u$$ and any $$S\subseteq T$$, then $$f(\cdot)$$ is submodular.
Theorem [Nemhauser et al. 1978]:{% include sidenote.html id='note-nemhauser-theorem' note='also see this handout' %} if $$f(\cdot)$$ is monotone and submodular, then the $$S$$ obtained by greedily adding $$k$$ elements that maximize marginal gains satisfies
$$ f(S)\geq(1-\frac{1}{e})f(OPT) $$
Given this theorem, we need to prove that the largest expected cascade size function $$f(\cdot)$$ is monotone and submodular.
It is clear that the function $$f(\cdot)$$ is monotone based on the definition of $$f(\cdot)$${% include sidenote.html id='note-monotone' note='If no nodes are active, then the influence is 0. That is $$f(\emptyset)=0$$. Because activating more nodes will never hurt the influence, $$f(U)\leq f(V)$$ if $$U\subseteq V$$.' %}, and we only need to prove $$f(\cdot)$$ is submodular.
Fact 1 of Submodular Functions: $$f(S)=\mid \cup_{k\in S}X_{k}\mid$$ is submodular, where $$X_{k}$$ is a set. Intuitively, the more sets you already have, the less new "area", a newly added set $$X_{k}$$ will provide.
Fact 2 of Submodular Functions: if $$f_{i}(\cdot)$$ are submodular and $$c_{i}\geq0$$, then $$F(\cdot)=\sum_{i}c_{i} f_{i}(\cdot)$$ is also submodular. That is a non-negative linear combination of submodular functions is a submodular function.
Proof that $$f(\cdot)$$ is Submodular: we run many simulations on graph G (see sidenote 1). For the simulated world $$i$$, the node $$v$$ has an activation set $$X^{i}{v}$$, then $$f{i}(S)=\mid\cup_{v\in S}X^{i}_{v}\mid$$ is the size of the cascades of $$S$$ for world $$i$$. Based on Fact 1, $$f_{i}(S)$$ is submodular. The expected influence set size $$f(S)=\frac{1}{\mid I\mid}\sum_{i\in I}f_{i}(S)$$ is also submodular, due to Fact 2. QED.
Evaluation of $$f(S)$$ and Approximation Guarantee of Hill Climbing In Practice: how to evaluate $$f(S)$$ is still an open question. The estimation achieved by simulating a number of possible worlds is a good enough evaluation [Kempe et al. 2003]:
Time complexity of Hill Climbing
To find the node $$u$$ that $$\max_{u}f(S_{i-1}\cup{u})$$ (see the algorithm above):
We will do this $$k$$ (number of nodes to be selected) times. Therefore, the time complexity of Hill Climbing is $$O(k\cdot n \cdot m \cdot R)$$, which is slow. We can use sketches [Cohen et al. 2014] to speed up the evaluation of $$X_{u}$$ by reducing the evaluation time from $$O(m)$$ to $$O(1)$${% include sidenote.html id='note-evaluate-influence' note='Besides sketches, there are other proposed approaches for efficiently evaluating the influence function: approximation by hypergraphs [Borgs et al. 2012], approximating Riemann sum [Lucier et al. 2015], sparsification of influence networks [Mathioudakis et al. 2011], and heuristics, such as degree discount [Chen et al. 2009].'%}.
Single Reachability Sketches
Intuition: if $$v$$ can reach a large number of nodes, then its rank is likely to be small. Hence, the rank of node $$v$$ can be used to estimate the influence of node $$v$$ in $$G^{i}$$.
However, influence estimation based on Single Reachability Sketches (i.e. single simulation of $$G$$ ) is inaccurate. To make a more accurate estimate, we need to build sketches based on many simulations{% include sidenote.html id='note-sketches' note='This is similar to take an average of $$f_{i}(S)$$ in sidenote 1, but in this case, it is achieved by using Combined Reachability Sketches.' %}, which leads to the Combined Reachability Sketches.
Combined Reachability Sketches
In Combined Reachability Sketches, we simulate several possible worlds and keep the smallest $$c$$ values among the nodes that $$u$$ can reach in all the possible worlds.
Construct Combined Reachability Sketches:
Run Greedy for Influence Maximization:
Note: using Combined Reachability Sketches does not provide an approximation guarantee on the true expected influence but an approximation guarantee with respect to the possible worlds considered.