Taner Güngör: How Google ranks Web pages?

Google’s search algorithm combined precomputed PageRank scores with text matching scores to obtain an overall ranking score for each webpage. The PageRank algorithm assigns a PageRank score to each webpages. The algorithm models the behavior of an idealized random Web surfer [1, 2]. This Internet user randomly chooses a webpage to view from the listing of available webpages. Then, the surfer randomly selects a link from that webpage to another webpage.

The model the activity of the random Web surfer, the PageRank algorithm represents the link structure of the Web as a directed graph.

The process for determining PageRank begins by expressing the directed Web graph as the nxn hyperlink matrix, H, where n is the number of webpages.

$$ H = \begin{bmatrix}
0.0 & 1.0 & 0.0 & 0.0 \\
0.0 & 0.0 & 1.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} $$

If webpage i has $l_i$ $\geq$ 1 links to other webpages and webpage i links to webpage j, then the element in row i and column j of H is $H_{ij}$ = $\frac{1}{l_i}$. Otherwise, $H_{ij}$ = 0.

If there is no any links to other webpages, then it means that it is a dangling node. The resulting matrix is S = H + dw, where d is a column vector that identifies dangling nodes, meaning $d_{j}$ = 1 if $l_{j}$ = 0 and $d_{i}$ = 0, otherwise; and w = \{$w_{1}$ $w_{2}$ . . . $w_{n}$\} is a row vector with $w_{j}$ $\geq$ 0 for all 1 $\leq$ j $\leq$ n and $\sum\limits_{j=1}^n w_{j} = 1$. There exists some options to fix this problem. One of them is set the probability of the web pages to 1/n. The most popular choice for w is the uniform row vector

$$ H = \begin{bmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix} $$
$$ S = \begin{bmatrix}
0.0 & 1.0 & 0.0 & 0.0 \\
0.0 & 0.0 & 1.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} +
\begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}
$$
$$ S = \begin{bmatrix}
0.0 & 1.0 & 0.0 & 0.0 \\
0.0 & 0.0 & 1.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 \\
0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix}
$$

To model the overall behaviour of a random web surfer, Google forms the matrix G = $\alpha$S + (1 - $\alpha$)1v, where 0 $\leq$ $\alpha$ $<$ 1 is a scalar, 1 is the column vector of ones, and v is a row probability distribution vector called the personalization vector. The damping factor($\alpha$) in the Google matrix indicates that random Web surfers move to a different webpage by some means other than selecting a link with probability 1 - $\alpha$.

$$ G = \alpha S + (1 - \alpha) 1v $$
$$
G = 0.85 \begin{bmatrix}
0.0 & 1.0 & 0.0 & 0.0 \\
0.0 & 0.0 & 1.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 \\
0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix} +
(1 - 0.85)
\begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.25 & 0.25 & 0.25 & 0.25 \end{pmatrix}
$$
$$
G = \begin{bmatrix}
3/80 & 71/80 & 3/80 & 3/80 \\
3/80 & 3/80 & 71/80 & 3/80 \\
37/80 & 3/80 & 37/80 & 3/80 \\
0.25 & 0.25 & 0.25 & 0.25 \end{bmatrix}
$$

To compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, $\pi$G = $\pi$. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$ \pi^{(2)} = \pi^{(1)}G,\ where\ k = 1 $$$$ \pi^{(3)} = \pi^{(2)}G,\ where\ k = 2 $$$$ \vdots $$$$ \pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ... $$

Until some convergence criterion is satisfied.(We keep going until the sequence settles down, i.e. until $\pi^{(k+1)}$ is nearly equal to $\pi^{(k)}$G) You will obtain the PageRank vector.

\noindent\textbf{2. Explain how the eigenvalue problem is solved. You don’t need to rewrite what is in the paper, just put down what you understand.}

As it has been explained above, to compute the PageRank score of each webpages. We should find exact solutions to the eigensystem, $\pi$G = $\pi$. A stationary probability vector $\pi$ is defined as a vector that does not change under application of the transition matrix; that is, it is defined as a left eigenvector of G matrix, associated with eigenvalue 1. We use the power method to calculate the eigenvector. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$ \pi^{(2)} = \pi^{(1)}G,\ where\ k = 1 $$$$ \pi^{(3)} = \pi^{(2)}G,\ where\ k = 2 $$$$ \vdots $$$$ \pi^{(k + 1)} = \pi^{(k)}G,\ where\ k = 1, 2, ... $$

Until convergence criterion is satisfied, you will obtain the PageRank vector.(We keep going until the sequence settles down, i.e. until $\pi^{(k+1)}$ is nearly equal to $\pi^{(k)}$G) Thus G has a unique eigenvector with eigenvalue 1. We find that eigenvector, normalize it so that the sum of the entries is 1. Then we use this vector to rank the webpages: the page i with the largest number is ranked first, and so on.

$$\pi^{(0)} = \begin{bmatrix} 0.250 & 0.250 & 0.250 & 0.250 \end{bmatrix} $$$$\pi^{(1)} = \begin{bmatrix} 0.196 & 0.303 & 0.303 & 0.196 \end{bmatrix} $$$$\pi^{(2)} = \begin{bmatrix} 0.208 & 0.246 & 0.337 & 0.208 \end{bmatrix} $$$$\pi^{(3)} = \begin{bmatrix} 0.225 & 0.258 & 0.291 & 0.225 \end{bmatrix} $$$$\vdots$$$$\pi^{(28)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix} $$$$\pi^{(29)} = \begin{bmatrix} 0.213 & 0.264 & 0.307 & 0.213 \end{bmatrix} $$

$$Normalized\ vector = \begin{bmatrix} 3 & 2 & 1 & 3 \end{bmatrix} $$

I made some calculation on MATLAB and after 29 iteration, now we can normalize the $\pi^{(29)}$ vector to obtain the PageRank.

\noindent\textbf{3. Starting with a 5x5 Markov matrix, apply the algorithm described in Section 5 of the paper to find the eigenvector associated with eigenvalue 1.}

Let's assume that, $\alpha$ = 0.85 and we have a web structure something like this;

$$ H = \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.0 & 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix}
v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \end{bmatrix}
$$

As you see, there is a problem in that matrix, there is no any links to other webpages for last row, then it means that the graph contains a dangling node. To handle it, we should set the probability of the web pages to 1/n.(The most popular choice) The resulting matrix will be S = H + dw.

$$ S = \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.0 & 0.0 & 0.0 & 0.0 & 0.0 \end{bmatrix} +
\begin{bmatrix} 0.0 \\ 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{pmatrix}
$$
$$ S = \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix}
v = \begin{bmatrix} 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \\ 0.20 \end{bmatrix}
$$

Even just multiplying a huge vector by a huge matrix could take a while. But for the particular matrix G, we can find Ge much more easily than it might first appear. Note that

$$ G = \alpha S + (1 - \alpha) 1v $$$$ G = 0.85 \begin{bmatrix}
0.0 & 0.5 & 0.0 & 0.0 & 0.5 \\
0.0 & 0.0 & 1.0 & 0.0 & 0.0 \\
0.5 & 0.0 & 0.0 & 0.5 & 0.0 \\
0.25 & 0.25 & 0.25 & 0.0 & 0.25 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix} +
(1 - 0.85)
\begin{bmatrix} 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \\ 1.0 \end{bmatrix}
\begin{pmatrix} 0.2 & 0.2 & 0.2 & 0.2 & 0.2 \end{pmatrix}
$$
$$
G = \begin{bmatrix}
0 & 0.425 & 0 & 0 & 0.425 \\
0 & 0 & 0.850 & 0 & 0 \\
0.425 & 0 & 0 & 0.425 & 0 \\
0.212 & 0.212 & 0.212 & 0 & 0.212 \\
0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} +
0.15
\begin{bmatrix}
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \\
0.20 & 0.20 & 0.20 & 0.20 & 0.20 \end{bmatrix}
$$
$$
G = \begin{bmatrix}
0 & 0.425 & 0 & 0 & 0.425 \\
0 & 0 & 0.850 & 0 & 0 \\
0.425 & 0 & 0 & 0.425 & 0 \\
0.212 & 0.212 & 0.212 & 0 & 0.212 \\
0.170 & 0.170 & 0.170 & 0.170 & 0.170 \end{bmatrix} +
\begin{bmatrix}
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \\
0.03 & 0.03 & 0.03 & 0.03 & 0.03 \end{bmatrix}
$$
$$
G = \begin{bmatrix}
0.030 & 0.455 & 0.030 & 0.030 & 0.455 \\
0.030 & 0.030 & 0.880 & 0.030 & 0.030 \\
0.455 & 0.030 & 0.030 & 0.455 & 0.030 \\
0.242 & 0.242 & 0.242 & 0.030 & 0.242 \\
0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix}
$$

We use the power method to calculate the eigenvector. Given a starting vector $\pi^{(0)}$, e.g. $\pi^{(0)}$ = v, the power method calculates successive iterates

$$\pi^{(0)} = \begin{bmatrix} 0.200 & 0.200 & 0.200 & 0.200 & 0.200 \end{bmatrix} $$$$\pi^{(1)} = \begin{bmatrix} 0.191 & 0.191 & 0.276 & 0.149 & 0.191 \end{bmatrix} $$$$\pi^{(2)} = \begin{bmatrix} 0.211 & 0.175 & 0.257 & 0.180 & 0.175 \end{bmatrix} $$$$\pi^{(3)} = \begin{bmatrix} 0.207 & 0.188 & 0.247 & 0.169 & 0.188 \end{bmatrix} $$$$\vdots$$$$\pi^{(10)} = \begin{bmatrix} 0.205 & 0.185 & 0.254 & 0.169 & 0.185 \end{bmatrix} $$$$\pi^{(11)} = \begin{bmatrix} 0.205 & 0.184 & 0.254 & 0.169 & 0.184 \end{bmatrix} $$$$Normalized\ vector = \begin{bmatrix} 2 & 3 & 1 & 4 & 3 \end{bmatrix} $$

All these calculations made by MATLAB.

REFERENCES

1 - www.abcseo.com/seo-book/toolbar-google.htm, Google Toobar PageRank.

2 - www.google.com/corporate/history.html, Google Corp. Information: Google Milestones.

Taner Güngör

Sayfalar

Friday, November 21, 2014

How Google ranks Web pages?

No comments:

Post a Comment