MACM 316 – Computing Assignment 6

Due Date: Friday March 20 at 11:00pm.

Submission Instructions: You must upload one .pdf file in Crowdmark that consists of two pages:

page 1 is your report which should fit all discussions, data and figures into a single page; and page 2 is a

listing of your code. The deadline is 11:00pm on the due date. The actual due time is set to 11:05pm

and if Crowdmark indicates that you submitted late, you will be assigned a grade of 0 on this assignment.

Your TA has emailed you a Crowdmark link that you should save since it will allow you to upload your

completed assignments.

? Please review the Guidelines for Assignments carefully.

? Acknowledge any collaborations or assistance from colleagues/TAs/instructor.

? If you have any questions about Matlab or aspects of this assignment, then you are strongly encouraged

to attend tutorials and drop-in workshops.

Background on Google PageRank (summarized from lectures)

When you do a Google search, the resulting web pages are listed in a ranked order of importance or

significance. The method used to compute this ranking is called the PageRank algorithm, and is the

basis for Google’s financial success. At its core, this algorithm takes the n × n connectivity matrix G for

all web pages on the internet (currently numbered at n > 60 billion pages) and computes an eigenvector

x that contains the ranks of each page. In this week’s computing assignment, you will explore the process

of computing the PageRank vector for a few simple web graphs.

Ranking Web Pages

In lectures you saw a brief description of how the PageRank is computed, which is repeated here. Imagine

surfing the web from page to page by randomly choosing an outgoing link from your current page to move

to the next. You may occasionally get stuck at a dead end (with no outgoing links) or within a cycle of

connected pages. In these cases you might simply “bail out” by picking some other page at random, and

then continue from there. Roughly, a page is considered to have high rank if other pages with high rank

link to it. Let’s expand on what we mean by rank.

The n × n web connectivity matrix G has entries defined as

gij =(1, if there is a link to page i from page j0, otherwise

G is enormous but extremely sparse. Suppose that page j points to a total of cj other pages, then it’s

clear that cj =Pn

i=1 gij (the number of outlinks) is just the j

th column sum of G. Similarly, the row sum

ri =Pn

j=1 gij gives the number of inlinks to page i. A small example with n = 6 pages is shown below.

Next, define the rank of a page j to be xj and propose a ranking system in which page j contributes

an equal share of its rank, xj

cj, to each of the pages it points to. We denote for any page i the set of pages

that link to it by Bi. Then the rank of this page is just

xi =Xj∈Bixjcj

which can be written in matrix form as x = GDx where D is the n × n diagonal matrix with entries

djj =1cj. The matrix A = GD with entries aij =

gij

cj

is shown for the small-web example above. The

equation x = Ax has a familiar form, in which our PageRank vector x is an eigenvector of A corresponding

to the eigenvalue 1!

This is the simplest situation only, and we still need to explain how to deal with the two problem

cases mentioned earlier:

Dead-end nodes: In the example on the right, there is no escaping page 3,

which corresponds to a zero column (no outlinks, c3 = 0). A reasonable

solution is to simply jump to a random page, which means replacing all

entries in the column by 1

n. Then the matrix A becomes

aij =(gijcj, if cj 6= 01n, if cj = 0

Cyclic paths: For the example on the right, one can get trapped in the cycle

2 → 3 → 4 → 5 → 2 → . . . . To permit escape, we turn our systematic

web traversal into a random walk in which we occasionally skip the usual

page selection and instead jump to some arbitrary page in the web. To

be more precise, we follow links as usual with some probability p (say

p = 0.85), but then with probability 1 ? p we jump to another randomly

selected page (p is sometimes called a “teleport probability”). This leads

to a modified formula for A:

In summary, finding the PageRank vector x reduces to solving the eigenvector problem x = Ax where

A can be written compactly in matrix form as

and e = (1, 1, . . . , 1)T being the n-vector containing all ones.

Computing the PageRank Vector with Matlab

A simple approach for solving eigenvalue problems is called the power method and takes the form of

a fixed point iteration x

(k) = Ax(k?1) for k = 1, 2, . . . , given some initial guess x

(0). With no prior

knowledge about the ranks, we choose x

(0) =1ne which assumes that all pages are equally ranked. And

that’s it!1 The algorithm described above is fairly easy to implement in Matlab, using the code below

(posted on Canvas as mypagerank.m):

% Estimate the PageRank for a small web connectivity

% matrix using the power method.

p = 0.85; % "teleport" probability

% Set up the connectivity matrix using index

% vectors that define the from->to connections

% between pages numbers 1 to n.

n = 6; % number of pages

ii = [2 4 3 6 1 4 6 2 5 2 6 3]; % "to" index

jj = [1 1 2 2 3 3 3 4 4 5 5 6]; % "from" index

G = sparse(ii, jj, 1, n, n);

c = full(sum(G)); % column sums

e = ones(n,1);

k = find(c~=0); % indices of zero columns

D = sparse(k, k, 1./c(k), n, n);

z = ((1-p)*(c~=0) + (c==0)) / n;

A = p*G*D + e*z; % PageRank matrix

x = e/n; % initial guess

xold = zeros(n,1);

niter = 0;

while norm(x-xold) > 0.0001,

niter = niter + 1;

xold = x;

x = A*x;

end

x, niter

bar(x), shg

This code returns the PageRank vector x = (0.1038, 0.1693, 0.2781 ??????

, 0.1479, 0.0879, 0.2131)T

, from which

it’s easy to see that page 3 receives the top ranking.

1This is a simplified version of the power method that applies to the very special case where A has positive entries,

kAk1 = 1, kx

(0)k1 = 1 and the eigenvalue is 1. Under these conditions, we’re guaranteed to have a unique positive real

solution x, and the power method converges to this solution, albeit sometimes slowly. For more general eigenvalue problems,

the power method requires requires a few extra modifications.

Your Computing Assignment – Exploring PageRank

1. Run the mypagerank code on the given 6-page example for various values of the teleport parameter

p lying between 0 and 1. How do changes in p impact the top-ranked web page, and the speed of

convergence of the power method iterations? What happens to the rankings as p → 0, and can you

explain why this might be so?

2. Modify the code for the larger set of n = 18 web pages and links pictured below:

(a) Compute the PageRank for each of the 18 pages using p = 0.85, and report the results of

the ranking as a bar plot (you may find Matlab’s sort command useful for picking out the

highest-ranked page).

(b) Repeat with p = 0.5. Which page ranks increase, and which decrease?

(c) Repeat with p = 0.98, and notice that several pages have a rank that is close to zero. Is there

anything about the link structure of this web graph that might explain why this happens? If

you were able to add a single link between any two web pages that would most increase the

reputation of the lowest-ranked pages, which would it be? Explain your choice.

版权所有：编程辅导网 2018 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。