Friday, November 21, 2014

Google’s search algorithm combined precomputed PageRank scores with text matching scores to obtain an overall ranking score for each webpage. The PageRank algorithm assigns a PageRank score to each webpages. The algorithm models the behavior of an idealized random Web surfer [1, 2]. This Internet user randomly chooses a webpage to view from the listing of available webpages. Then, the surfer randomly selects a link from that webpage to another webpage.

The model the activity of the random Web surfer, the PageRank algorithm represents the link structure of the Web as a directed graph.

The process for determining PageRank begins by expressing the directed Web graph as the nxn hyperlink matrix, H, where n is the number of webpages.

$$H = \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 1.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix}$$

If webpage i has $l_i$ $\geq$ 1 links to other webpages and webpage i links to webpage j, then the element in row i and column j of H is $H_{ij}$ = $\frac{1}{l_i}$. Otherwise, $H_{ij}$ = 0.

If there is no any links to other webpages, then it means that it is a dangling node. The resulting matrix is S = H + dw, where d is a column vector that identifies dangling nodes, meaning $d_{j}$ = 1 if $l_{j}$ = 0 and $d_{i}$ = 0, otherwise; and w = \{$w_{1}$ $w_{2}$ . . . $w_{n}$\} is a row vector with $w_{j}$ $\geq$ 0 for all 1 $\leq$ j $\leq$ n and $\sum\limits_{j=1}^n w_{j} = 1$. There exists some options to fix this problem. One of them is set the probability of the web pages to 1/n. The most popular choice for w is the uniform row vector

$$H = \begin{bmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix}$$
$$S = \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 1.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \end{bmatrix} \begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}$$
$$S = \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 1.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 \\ 0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix}$$

To model the overall behaviour of a random web surfer, Google forms the matrix G = $\alpha$S + (1 - $\alpha$)1v, where 0 $\leq$ $\alpha$ $<$ 1 is a scalar, 1 is the column vector of ones, and v is a row probability distribution vector called the personalization vector. The damping factor($\alpha$) in the Google matrix indicates that random Web surfers move to a different webpage by some means other than selecting a link with probability 1 - $\alpha$.

$$G = \alpha S + (1 - \alpha) 1v$$
$$G = 0.85 \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0 \\ 0.0 & 0.0 & 1.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 \\ 0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix} + (1 - 0.85) \begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix} \begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}$$
$$G = \begin{bmatrix} 3/80 & 71/80 & 3/80 & 3/80 \\ 3/80 & 3/80 & 71/80 & 3/80 \\ 37/80 & 3/80 & 37/80 & 3/80 \\ 0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix}$$

To compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, $\pi$G = $\pi$. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$\pi^{(2)} = \pi^{(1)}G,\ where\ k = 1$$$$\pi^{(3)} = \pi^{(2)}G,\ where\ k = 2$$$$\vdots$$$$\pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ...$$

Until some convergence criterion is satisfied.(We keep going until the sequence settles down, i.e. until $\pi^{(k+1)}$ is nearly equal to $\pi^{(k)}$G) You will obtain the PageRank vector.

\noindent\textbf{2. Explain how the eigenvalue problem is solved. You don’t need to rewrite what is in the paper, just put down what you understand.}

As it has been explained above, to compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, $\pi$G = $\pi$. A stationary probability vector $\pi$ is defined as a vector that does not change under application of the transition matrix; that is, it is defined as a left eigenvector of G matrix, associated with eigenvalue 1. We use the power method to calculate the eigenvector. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$\pi^{(2)} = \pi^{(1)}G,\ where\ k = 1$$$$\pi^{(3)} = \pi^{(2)}G,\ where\ k = 2$$$$\vdots$$$$\pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ...$$

Until convergence criterion is satisfied, you will obtain the PageRank vector.(We keep going until the sequence settles down, i.e. until $\pi^{(k+1)}$ is nearly equal to $\pi^{(k)}$G) Thus G has a unique eigenvector with eigenvalue 1. We find that eigenvector, normalize it so that the sum of the entries is 1. Then we use this vector to rank the webpages: the page i with the largest number is ranked first, and so on.

$$\pi^{(0)} = \begin{bmatrix} 0.250 & 0.250 & 0.250 & 0.250 \end{bmatrix}$$$$\pi^{(1)} = \begin{bmatrix} 0.196 & 0.303 & 0.303 & 0.196 \end{bmatrix}$$$$\pi^{(2)} = \begin{bmatrix} 0.208 & 0.246 & 0.337 & 0.208 \end{bmatrix}$$$$\pi^{(3)} = \begin{bmatrix} 0.225 & 0.258 & 0.291 & 0.225 \end{bmatrix}$$$$\vdots$$$$\pi^{(28)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix}$$$$\pi^{(29)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix}$$

$$Normalized\ vector = \begin{bmatrix} 3 & 2 & 1 & 3 \end{bmatrix}$$

I made some calculation on MATLAB and after 29 iteration, now we can normalize the $\pi^{(29)}$ vector to obtain the PageRank.

\noindent\textbf{3. Starting with a 5x5 Markov matrix, apply the algorithm described in Section 5 of the paper to find the eigenvector associated with eigenvalue 1.}

Let's assume that, $\alpha$ = 0.85 and we have a web structure something like this;

$$H = \begin{bmatrix} 0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\ 0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \end{bmatrix}$$

As you see, there is a problem in that matrix, there is no any links to other webpages for last row, then it means that the graph contains a dangling node. To handle it, we should set the probability of the web pages to 1/n.(The most popular choice) The resulting matrix will be S = H + dw.

$$S = \begin{bmatrix} 0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\ 0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\ 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} + \begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \end{bmatrix} \begin{pmatrix} 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{pmatrix}$$
$$S = \begin{bmatrix} 0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\ 0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix} v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \end{bmatrix}$$

Even just multiplying a huge vector by a huge matrix could take a while. But for the particular matrix G, we can find Ge much more easily than it might first appear. Note that

$$G = \alpha S + (1 - \alpha) 1v$$$$G = 0.85 \begin{bmatrix} 0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\ 0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\ 0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\ 0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix} + (1 - 0.85) \begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix} \begin{pmatrix} 0.2 & 0.2 & 0.2 & 0.2 & 0.2 \end{pmatrix}$$
$$G = \begin{bmatrix} 0 & 0.425 & 0 & 0 & 0.425 \\ 0 & 0 & 0.850 & 0 & 0 \\ 0.425 & 0 & 0 & 0.425 & 0 \\ 0.212 & 0.212 & 0.212 & 0 & 0.212 \\ 0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} + 0.15 \begin{bmatrix} 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\ 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix}$$
$$G = \begin{bmatrix} 0 & 0.425 & 0 & 0 & 0.425 \\ 0 & 0 & 0.850 & 0 & 0 \\ 0.425 & 0 & 0 & 0.425 & 0 \\ 0.212 & 0.212 & 0.212 & 0 & 0.212 \\ 0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} + \begin{bmatrix} 0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\ 0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\ 0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\ 0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\ 0.03 & 0.03 & 0.03 & 0.03 & 0.03 \end{bmatrix}$$
$$G = \begin{bmatrix} 0.030 & 0.455 & 0.030 & 0.030 & 0.455 \\ 0.030 & 0.030 & 0.880 & 0.030 & 0.030 \\ 0.455 & 0.030 & 0.030 & 0.455 & 0.030 \\ 0.242 & 0.242 & 0.242 & 0.030 & 0.242 \\ 0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix}$$

We use the power method to calculate the eigenvector. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$\pi^{(0)} = \begin{bmatrix} 0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix}$$$$\pi^{(1)} = \begin{bmatrix} 0.191 & 0.191 & 0.276 & 0.149 & 0.191 \end{bmatrix}$$$$\pi^{(2)} = \begin{bmatrix} 0.211 & 0.175 & 0.257 & 0.180 & 0.175 \end{bmatrix}$$$$\pi^{(3)} = \begin{bmatrix} 0.207 & 0.188 & 0.247 & 0.169 & 0.188 \end{bmatrix}$$$$\vdots$$$$\pi^{(10)} = \begin{bmatrix} 0.205 & 0.185 & 0.254 & 0.169 & 0.185 \end{bmatrix}$$$$\pi^{(11)} = \begin{bmatrix} 0.205 & 0.184 & 0.254 & 0.169 & 0.184 \end{bmatrix}$$$$Normalized\ vector = \begin{bmatrix} 2 & 3 & 1 & 4 & 3 \end{bmatrix}$$

All these calculations made by MATLAB.

REFERENCES