Friday, November 21, 2014

How Google ranks Web pages?

Google’s search algorithm combined precomputed PageRank scores with text matching scores to obtain an overall ranking score for each webpage. The PageRank algorithm assigns a PageRank score to each webpages. The algorithm models the behavior of an idealized random Web surfer [1, 2]. This Internet user randomly chooses a webpage to view from the listing of available webpages. Then, the surfer randomly selects a link from that webpage to another webpage.

The model the activity of the random Web surfer, the PageRank algorithm represents the link structure of the Web as a directed graph.


The process for determining PageRank begins by expressing the directed Web graph as the nxn hyperlink matrix, H, where n is the number of webpages.



$$ H = \begin{bmatrix}
0.0 & 1.0  & 0.0  & 0.0 \\
0.0 & 0.0  & 1.0  & 0.0 \\
0.5 & 0.0  & 0.0  & 0.5 \\
0.0 & 0.0  & 0.0  & 0.0 \end{bmatrix} $$

If webpage i has \(l_i\) \(\geq\) 1 links to other webpages and webpage i links to webpage j, then the element in row i and column j of H is \(H_{ij}\) = \(\frac{1}{l_i}\). Otherwise, \(H_{ij}\) = 0.

If there is no any links to other webpages, then it means that it is a dangling node. The resulting matrix is S = H + dw, where d is a column vector that identifies dangling nodes, meaning \(d_{j}\) = 1 if \(l_{j}\) = 0 and \(d_{i}\) = 0, otherwise; and w = \{\(w_{1}\) \(w_{2}\) . . . \(w_{n}\)\} is a row vector with \(w_{j}\) \(\geq\) 0 for all 1 \(\leq\) j \(\leq\) n and \(\sum\limits_{j=1}^n w_{j} = 1\). There exists some options to fix this problem. One of them is set the probability of the web pages to 1/n. The most popular choice for w is the uniform row vector

$$ H = \begin{bmatrix} 0.25 & 0.25  & 0.25  & 0.25 \end{bmatrix} $$
$$ S = \begin{bmatrix}
0.0 & 1.0 & 0.0 & 0.0 \\
0.0 & 0.0 & 1.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} +
\begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}
$$
$$ S = \begin{bmatrix}
0.0 & 1.0  & 0.0  & 0.0 \\
0.0 & 0.0  & 1.0  & 0.0 \\
0.5 & 0.0  & 0.0  & 0.5 \\
0.25 & 0.25  & 0.25  & 0.25 \end{bmatrix}
$$

To model the overall behaviour of a random web surfer, Google forms the matrix G = \(\alpha\)S + (1 - \(\alpha\))1v, where 0 \(\leq\) \(\alpha\) \(<\) 1 is a scalar, 1 is the column vector of ones, and v is a row probability distribution vector called the personalization vector. The damping factor(\(\alpha\)) in the Google matrix indicates that random Web surfers move to a different webpage by some means other than selecting a link with probability 1 - \(\alpha\).

$$ G = \alpha S + (1 - \alpha) 1v $$
$$
G = 0.85 \begin{bmatrix}
0.0 & 1.0  & 0.0  & 0.0 \\
0.0 & 0.0  & 1.0  & 0.0 \\
0.5 & 0.0  & 0.0  & 0.5 \\
0.25 & 0.25  & 0.25  & 0.25 \end{bmatrix} +
(1 - 0.85)
\begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0  \end{bmatrix}
\begin{pmatrix} 0.25 & 0.25  & 0.25  & 0.25  \end{pmatrix}
$$
$$
G = \begin{bmatrix}
3/80 & 71/80 & 3/80 & 3/80 \\
3/80 & 3/80 & 71/80 & 3/80 \\
37/80 & 3/80 & 37/80 & 3/80 \\
0.25 & 0.25  & 0.25  & 0.25 \end{bmatrix}
$$

To compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, \(\pi\)G = \(\pi\). Given a starting vector \(\pi^{(0)}\), e.g. \(\pi^{(0)}\) = v, the power method calculates successive iterates

$$ \pi^{(2)} = \pi^{(1)}G,\ where\ k = 1 $$$$ \pi^{(3)} = \pi^{(2)}G,\ where\ k = 2 $$$$ \vdots $$$$ \pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ... $$

Until some convergence criterion is satisfied.(We keep going until the sequence settles down, i.e. until \(\pi^{(k+1)}\) is nearly equal to \(\pi^{(k)}\)G) You will obtain the PageRank vector.

\noindent\textbf{2. Explain how the eigenvalue problem is solved. You don’t need to rewrite what is in the paper, just put down what you understand.}

As it has been explained above, to compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, \(\pi\)G = \(\pi\). A stationary probability vector \(\pi\) is defined as a vector that does not change under application of the transition matrix; that is, it is defined as a left eigenvector of G matrix, associated with eigenvalue 1. We use the power method to calculate the eigenvector. Given a starting vector \(\pi^{(0)}\), e.g. \(\pi^{(0)}\) = v, the power method calculates successive iterates

$$ \pi^{(2)} = \pi^{(1)}G,\ where\ k = 1 $$$$ \pi^{(3)} = \pi^{(2)}G,\ where\ k = 2 $$$$ \vdots $$$$ \pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ... $$

Until convergence criterion is satisfied, you will obtain the PageRank vector.(We keep going until the sequence settles down, i.e. until \(\pi^{(k+1)}\) is nearly equal to \(\pi^{(k)}\)G) Thus G has a unique eigenvector with eigenvalue 1. We find that eigenvector, normalize it so that the sum of the entries is 1. Then we use this vector to rank the webpages: the page i with the largest number is ranked first, and so on.

$$\pi^{(0)} = \begin{bmatrix} 0.250 & 0.250 & 0.250 & 0.250 \end{bmatrix} $$$$\pi^{(1)} = \begin{bmatrix} 0.196 & 0.303 & 0.303 & 0.196 \end{bmatrix} $$$$\pi^{(2)} = \begin{bmatrix} 0.208 & 0.246 & 0.337 & 0.208 \end{bmatrix} $$$$\pi^{(3)} = \begin{bmatrix} 0.225 & 0.258 & 0.291 & 0.225 \end{bmatrix} $$$$\vdots$$$$\pi^{(28)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix} $$$$\pi^{(29)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix} $$

$$Normalized\ vector = \begin{bmatrix} 3 & 2 & 1 & 3 \end{bmatrix} $$

I made some calculation on MATLAB and after 29 iteration, now we can normalize the \(\pi^{(29)}\) vector to obtain the PageRank.

\noindent\textbf{3. Starting with a 5x5 Markov matrix, apply the algorithm described in Section 5 of the paper to find the eigenvector associated with eigenvalue 1.}

Let's assume that, \(\alpha\) = 0.85 and we have a web structure something like this;



$$ H = \begin{bmatrix}
0.0 & 0.5  & 0.0  & 0.0  & 0.5 \\
0.0 & 0.0  & 1.0  & 0.0  & 0.0 \\
0.5 & 0.0  & 0.0  & 0.5  & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.0 & 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix}
v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20  \\ 0.20  \\ 0.20  \end{bmatrix}
$$

As you see, there is a problem in that matrix, there is no any links to other webpages for last row, then it means that the graph contains a dangling node. To handle it, we should set the probability of the web pages to 1/n.(The most popular choice) The resulting matrix will be S = H + dw.

$$ S = \begin{bmatrix}
0.0 & 0.5  & 0.0  & 0.0  & 0.5 \\
0.0 & 0.0  & 1.0  & 0.0  & 0.0 \\
0.5 & 0.0  & 0.0  & 0.5  & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.0 & 0.0  & 0.0  & 0.0  & 0.0 \end{bmatrix} +
\begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 0.0 \\ 1.0  \end{bmatrix}
\begin{pmatrix} 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{pmatrix}
$$
$$ S = \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix}
v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \end{bmatrix}
$$

Even just multiplying a huge vector by a huge matrix could take a while. But for the particular matrix G, we can find Ge much more easily than it might first appear. Note that

$$ G = \alpha S + (1 - \alpha) 1v $$$$ G = 0.85 \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix} +
(1 - 0.85)
\begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.2 & 0.2 & 0.2 & 0.2 & 0.2 \end{pmatrix}
$$
$$
 G = \begin{bmatrix}
0 & 0.425 & 0 & 0 & 0.425 \\
0 & 0 & 0.850 & 0 & 0 \\
0.425 & 0 & 0 & 0.425 & 0 \\
0.212 & 0.212 & 0.212 & 0 & 0.212 \\
0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} +
0.15
\begin{bmatrix}
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix}
$$
$$
 G = \begin{bmatrix}
0 & 0.425 & 0 & 0 & 0.425 \\
0 & 0 & 0.850 & 0 & 0 \\
0.425 & 0 & 0 & 0.425 & 0 \\
0.212 & 0.212 & 0.212 & 0 & 0.212 \\
0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} +
\begin{bmatrix}
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \end{bmatrix}
$$
$$
G = \begin{bmatrix}
0.030 & 0.455 & 0.030 & 0.030 & 0.455 \\
0.030 & 0.030 & 0.880 & 0.030 & 0.030 \\
0.455 & 0.030 & 0.030 & 0.455 & 0.030 \\
0.242 & 0.242 & 0.242 & 0.030 & 0.242 \\
0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix}
$$

We use the power method to calculate the eigenvector. Given a starting vector \(\pi^{(0)}\), e.g. \(\pi^{(0)}\) = v, the power method calculates successive iterates

$$\pi^{(0)} = \begin{bmatrix} 0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix} $$$$\pi^{(1)} = \begin{bmatrix} 0.191 & 0.191 & 0.276 & 0.149 & 0.191 \end{bmatrix} $$$$\pi^{(2)} = \begin{bmatrix} 0.211 & 0.175 & 0.257 & 0.180 & 0.175 \end{bmatrix} $$$$\pi^{(3)} = \begin{bmatrix} 0.207 & 0.188 & 0.247 & 0.169 & 0.188 \end{bmatrix} $$$$\vdots$$$$\pi^{(10)} = \begin{bmatrix} 0.205 & 0.185 & 0.254 & 0.169 & 0.185 \end{bmatrix} $$$$\pi^{(11)} = \begin{bmatrix} 0.205 & 0.184 & 0.254 & 0.169 & 0.184 \end{bmatrix} $$$$Normalized\ vector = \begin{bmatrix} 2 & 3 & 1 & 4 & 3 \end{bmatrix} $$

All these calculations made by MATLAB.



REFERENCES

1 - www.abcseo.com/seo-book/toolbar-google.htm, Google Toobar PageRank.

2 - www.google.com/corporate/history.html, Google Corp. Information: Google Milestones.

No comments:

Post a Comment