CS224W Course Notes
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

154 lines
11 KiB

  1. ---
  2. layout: post
  3. title: Influence Maximization
  4. ---
  5. ## Motivation
  6. Identification of influential nodes in a network has important practical uses. A good example is "viral marketing", a strategy that uses existing social networks to spread and promote a product. A well-engineered viral marking compaign will identify the most influential customers, convince them to adopt and endorse the product, and then spread the product in the social network like a virus.
  7. The key question is how to find the most influential set of nodes? To answer this question, we will first look at two classical cascade models:
  8. - Linear Threshold Model
  9. - Independent Cascade Model
  10. Then, we will develop a method to find the most influential node set in the Independent Cascade Model.
  11. ## Linear Threshold Model
  12. In the Linear Threshold Model, we have the following setup:
  13. - A node $$v$$ has a random threshold $$\theta_{v} \sim U[0,1]$$
  14. - A node $$v$$ influenced by each neighbor $$w$$ according to a weight $$b_{v,w}$$, such that
  15. $$
  16. \sum_{w\text{ neighbor of v }} b_{v,w}\leq 1
  17. $$
  18. - A node $$v$$ becomes active when at least $$\theta_{v}$$ fraction of its neighbors are active. That is
  19. $$
  20. \sum_{w\text{ active neighbor of v }} b_{v,w}\geq\theta_{v}
  21. $$
  22. The following figure demonstrates the process:
  23. ![linear_threshold_model_demo](../assets/img/influence_maximization_linear_threshold_model_demo.png?style=centerme)
  24. *(A) node V is activated and influences W and U by 0.5 and 0.2, respectively; (B) W becomes activated and influences X and U by 0.5 and 0.3, respectively; (C) U becomes activated and influences X and Y by 0.1 and 0.2, respectively; (D) X becomes activated and influences Y by 0.2; no more nodes can be activated; process stops.*
  25. ## Independent Cascade Model
  26. In this model, we model the influences (activation) of nodes based on probabilities in a directed graph:
  27. - Given a directed finite graph $$G=(V, E)$$
  28. - Given a node set $$S$$ starts with a new behavior (e.g. adopted new product and we say they are active)
  29. - Each edge $$(v, w)$$ has a probability $$p_{vw}$$
  30. - If node $$v$$ becomes active, it gets one chance to make $$w$$ active with probability $$p_{vw}$$
  31. - Activation spread through the network
  32. Note:
  33. - Each edge fires only once
  34. - If $$u$$ and $$v$$ are both active and link to $$w$$, it does not matter which tries to activate $$w$$ first
  35. ## Influential Maximization (of the Independent Cascade Model)
  36. ### Definitions
  37. - **Most influential Set of size $$k$$** ($$k$$ is a user-defined parameter) is a set $$S$$ containing $$k$$ nodes that if activated, produces the largest expected{% include sidenote.html id='note-most-influential-set' note='Why "expected cascade size"? Due to the stochastic nature of the Independent Cascade Model, node activation is a random process, and therefore, $$f(S)$$ is a random variable. In practice, we would like to compute many random simulations and then obtain the expected value $$f(S)=\frac{1}{\mid I\mid}\sum_{i\in I}f_{i}(S)$$, where $$I$$ is a set of simulations.' %} cascade size $$f(S)$$.
  38. - **Influence set $$X_{u}$$ of node $$u$$** is the set of nodes that will be eventually activated by node $$u$$. An example is shown below.
  39. ![influence_set](../assets/img/influence_maximization_influence_set.png?style=centerme)
  40. *Red-colored nodes a and b are active. The two green areas enclose the nodes activated by a and b respectively, i.e. $$X_{a}$$ and $$X_{b}$$.*
  41. Note:
  42. - It is clear that $$f(S)$$ is the size of the union of $$X_{u}$$: $$f(S)=\mid\cup_{u\in S}X_{u}\mid$$.
  43. - Set $$S$$ is more influential, if $$f(S)$$ is larger
  44. ### Problem Setup
  45. The influential maximization problem is then an optimization problem:
  46. $$
  47. \max_{S \text{ of size }k}f(S)
  48. $$
  49. This problem is NP-hard [[Kempe et al. 2003]](https://www.cs.cornell.edu/home/kleinber/kdd03-inf.pdf). However, there is a greedy approximation algorithm--**Hill Climbing** that gives a solution $$S$$ with the following approximation guarantee:
  50. $$
  51. f(S)\geq(1-\frac{1}{e})f(OPT)
  52. $$
  53. where $$OPT$$ is the globally optimal solution.
  54. ### Hill Climbing
  55. **Algorithm:** at each step $$i$$, activate and pick the node $$u$$ that has the largest marginal gain $$\max_{u}f(S_{i-1}\cup\{u\})$$:
  56. - Start with $$S_{0}=\{\}$$
  57. - For $$i=1...k$$
  58. - Activate node $$u\in V\setminus S_{i-1}$$ that $$\max_{u}f(S_{i-1}\cup\{u\})$$
  59. - Let $$S_{i}=S_{i-1}\cup\{u\}$$
  60. **Claim:** Hill Climbing produces a solution that has the approximation guarantee $$f(S)\geq(1-\frac{1}{e})f(OPT)$$.
  61. ### Proof of the Approximation Guarantee of Hill Climbing
  62. **Definition of Monotone:** if $$f(\emptyset)=0$$ and $$f(S)\leq f(T)$$ for all $$S\subseteq T$$, then $$f(\cdot)$$ is monotone.
  63. **Definition of Submodular:** if $$f(S\cup \{u\})-f(S)\geq f(T\cup\{u\})-f(T)$$ for any node $$u$$ and any $$S\subseteq T$$, then $$f(\cdot)$$ is submodular.
  64. **Theorem [Nemhauser et al. 1978]:**{% include sidenote.html id='note-nemhauser-theorem' note='also see this [handout](http://web.stanford.edu/class/cs224w/handouts/CS224W_Influence_Maximization_Handout.pdf)' %} if $$f(\cdot)$$ is **monotone** and **submodular**, then the $$S$$ obtained by greedily adding $$k$$ elements that maximize marginal gains satisfies
  65. $$
  66. f(S)\geq(1-\frac{1}{e})f(OPT)
  67. $$
  68. Given this theorem, we need to prove that the largest expected cascade size function $$f(\cdot)$$ is monotone and submodular.
  69. **It is clear that the function $$f(\cdot)$$ is monotone based on the definition of $$f(\cdot)$${% include sidenote.html id='note-monotone' note='If no nodes are active, then the influence is 0. That is $$f(\emptyset)=0$$. Because activating more nodes will never hurt the influence, $$f(U)\leq f(V)$$ if $$U\subseteq V$$.' %}, and we only need to prove $$f(\cdot)$$ is submodular.**
  70. **Fact 1 of Submodular Functions:** $$f(S)=\mid \cup_{k\in S}X_{k}\mid$$ is submodular, where $$X_{k}$$ is a set. Intuitively, the more sets you already have, the less new "area", a newly added set $$X_{k}$$ will provide.
  71. **Fact 2 of Submodular Functions:** if $$f_{i}(\cdot)$$ are submodular and $$c_{i}\geq0$$, then $$F(\cdot)=\sum_{i}c_{i} f_{i}(\cdot)$$ is also submodular. That is a non-negative linear combination of submodular functions is a submodular function.
  72. **Proof that $$f(\cdot)$$ is Submodular**: we run many simulations on graph G (see sidenote 1). For the simulated world $$i$$, the node $$v$$ has an activation set $$X^{i}_{v}$$, then $$f_{i}(S)=\mid\cup_{v\in S}X^{i}_{v}\mid$$ is the size of the cascades of $$S$$ for world $$i$$. Based on Fact 1, $$f_{i}(S)$$ is submodular. The expected influence set size $$f(S)=\frac{1}{\mid I\mid}\sum_{i\in I}f_{i}(S)$$ is also submodular, due to Fact 2. QED.
  73. **Evaluation of $$f(S)$$ and Approximation Guarantee of Hill Climbing In Practice:** how to evaluate $$f(S)$$ is still an open question. The estimation achieved by simulating a number of possible worlds is a good enough evaluation [[Kempe et al. 2003]](https://www.cs.cornell.edu/home/kleinber/kdd03-inf.pdf):
  74. - Estimate $$f(S)$$ by repeatedly simulating $$\Omega(n^{\frac{1}{\epsilon}})$$ possible worlds, where $$n$$ is the number of nodes and $$\epsilon$$ is a small positive real number
  75. - It achieves $$(1\pm \epsilon)$$-approximation to $$f(S)$$
  76. - Hill Climbing is now a $$(1-\frac{1}{e}-\epsilon)$$-approximation
  77. ### Speed-up Hill Climbing by Sketch-Based Algorithms
  78. **Time complexity of Hill Climbing**
  79. To find the node $$u$$ that $$\max_{u}f(S_{i-1}\cup\{u\})$$ (see the algorithm above):
  80. - we need to evaluate the $$X_{u}$$ (the influence set) of each of the remaining nodes which has the size of $$O(n)$$ ($$n$$ is the number of nodes in $$G$$)
  81. - for each evaluation, it takes $$O(m)$$ time to flip coins for all the edges involved ($$m$$ is the number of edges in $$G$$)
  82. - we also need $$R$$ simulations to estimate the influence set ($$R$$ is the number of simulations/possible worlds)
  83. We will do this $$k$$ (number of nodes to be selected) times. Therefore, the time complexity of Hill Climbing is $$O(k\cdot n \cdot m \cdot R)$$, which is slow. We can use **sketches** [[Cohen et al. 2014]](https://www.microsoft.com/en-us/research/wp-content/uploads/2014/08/skim_TR.pdf) to speed up the evaluation of $$X_{u}$$ by reducing the evaluation time from $$O(m)$$ to $$O(1)$${% include sidenote.html id='note-evaluate-influence' note='Besides sketches, there are other proposed approaches for efficiently evaluating the influence function: approximation by hypergraphs [[Borgs et al. 2012]](https://arxiv.org/pdf/1212.0884.pdf), approximating Riemann sum [[Lucier et al. 2015]](https://people.seas.harvard.edu/~yaron/papers/localApproxInf.pdf), sparsification of influence networks [[Mathioudakis et al. 2011]](https://chato.cl/papers/mathioudakis_bonchi_castillo_gionis_ukkonen_2011_sparsification_influence_networks.pdf), and heuristics, such as degree discount [[Chen et al. 2009]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/weic-kdd09_influence.pdf).'%}.
  84. **Single Reachability Sketches**
  85. - Take a possible world $$G^{i}$$ (i.e. one simulation of the graph $$G$$ using the Independent Cascade Model)
  86. - Give each node a uniform random number $$\in [0,1]$$
  87. - Compute the **rank** of each node $$v$$, which is the **minimum** number among the nodes that $$v$$ can reach in this world.
  88. *Intuition: if $$v$$ can reach a large number of nodes, then its rank is likely to be small. Hence, the rank of node $$v$$ can be used to estimate the influence of node $$v$$ in $$G^{i}$$.*
  89. However, influence estimation based on Single Reachability Sketches (i.e. single simulation of $$G$$ ) is inaccurate. To make a more accurate estimate, we need to build sketches based on many simulations{% include sidenote.html id='note-sketches' note='This is similar to take an average of $$f_{i}(S)$$ in sidenote 1, but in this case, it is achieved by using Combined Reachability Sketches.' %}, which leads to the Combined Reachability Sketches.
  90. **Combined Reachability Sketches**
  91. In Combined Reachability Sketches, we simulate several possible worlds and keep the smallest $$c$$ values among the nodes that $$u$$ can reach in all the possible worlds.
  92. - Construct Combined Reachability Sketches:
  93. - Generate a number of possible worlds
  94. - For node $$u$$, assign uniformly distributed random numbers $$r^{i}_{v}\in[0,1]$$ to all $$(v, i)$$ pairs, where $$v$$ is the node in $$u$$'s reachable nodes set in the world $$i$$.
  95. - Take the $$c$$ smallest $$r^{i}_{v}$$ as the Combined Reachability Sketches
  96. - Run Greedy for Influence Maximization:
  97. - Whenever the greedy algorithm asks for the node with the largest influence, pick node $$u$$ that has the smallest value in its sketch.
  98. - After $$u$$ is chosen, find its influence set $$X^{i}_{u}$$, mark the $$(v, i)$$ as infected and remove their $$r^{i}_{v}$$ from the sketches of other nodes.
  99. Note: using Combined Reachability Sketches does not provide an approximation guarantee on the true expected influence but an approximation guarantee with respect to the possible worlds considered.