MACM 316 – Computing Assignment 6
Due Date: Friday March 20 at 11:00pm.
Submission Instructions: You must upload one .pdf file in Crowdmark that consists of two pages:
page 1 is your report which should fit all discussions, data and figures into a single page; and page 2 is a
listing of your code. The deadline is 11:00pm on the due date. The actual due time is set to 11:05pm
and if Crowdmark indicates that you submitted late, you will be assigned a grade of 0 on this assignment.
Your TA has emailed you a Crowdmark link that you should save since it will allow you to upload your
completed assignments.
❼ Please review the Guidelines for Assignments carefully.
❼ Acknowledge any collaborations or assistance from colleagues/TAs/instructor.
❼ If you have any questions about Matlab or aspects of this assignment, then you are strongly encouraged
to attend tutorials and drop-in workshops.
Background on Google PageRank (summarized from lectures)
When you do a Google search, the resulting web pages are listed in a ranked order of importance or
significance. The method used to compute this ranking is called the PageRank algorithm, and is the
basis for Google’s financial success. At its core, this algorithm takes the n × n connectivity matrix G for
all web pages on the internet (currently numbered at n > 60 billion pages) and computes an eigenvector
x that contains the ranks of each page. In this week’s computing assignment, you will explore the process
of computing the PageRank vector for a few simple web graphs.
Ranking Web Pages
In lectures you saw a brief description of how the PageRank is computed, which is repeated here. Imagine
surfing the web from page to page by randomly choosing an outgoing link from your current page to move
to the next. You may occasionally get stuck at a dead end (with no outgoing links) or within a cycle of
connected pages. In these cases you might simply “bail out” by picking some other page at random, and
then continue from there. Roughly, a page is considered to have high rank if other pages with high rank
link to it. Let’s expand on what we mean by rank.
The n × n web connectivity matrix G has entries defined as
gij =(1, if there is a link to page i from page j0, otherwise
G is enormous but extremely sparse. Suppose that page j points to a total of cj other pages, then it’s
clear that cj =Pn
i=1 gij (the number of outlinks) is just the j
th column sum of G. Similarly, the row sum
ri =Pn
j=1 gij gives the number of inlinks to page i. A small example with n = 6 pages is shown below.
Next, define the rank of a page j to be xj and propose a ranking system in which page j contributes
an equal share of its rank, xj
cj, to each of the pages it points to. We denote for any page i the set of pages
that link to it by Bi. Then the rank of this page is just
xi =Xj∈Bixjcj
which can be written in matrix form as x = GDx where D is the n × n diagonal matrix with entries
djj =1cj. The matrix A = GD with entries aij =
gij
cj
is shown for the small-web example above. The
equation x = Ax has a familiar form, in which our PageRank vector x is an eigenvector of A corresponding
to the eigenvalue 1!
This is the simplest situation only, and we still need to explain how to deal with the two problem
cases mentioned earlier:
Dead-end nodes: In the example on the right, there is no escaping page 3,
which corresponds to a zero column (no outlinks, c3 = 0). A reasonable
solution is to simply jump to a random page, which means replacing all
entries in the column by 1
n. Then the matrix A becomes
aij =(gijcj, if cj 6= 01n, if cj = 0
Cyclic paths: For the example on the right, one can get trapped in the cycle
2 → 3 → 4 → 5 → 2 → . . . . To permit escape, we turn our systematic
web traversal into a random walk in which we occasionally skip the usual
page selection and instead jump to some arbitrary page in the web. To
be more precise, we follow links as usual with some probability p (say
p = 0.85), but then with probability 1 − p we jump to another randomly
selected page (p is sometimes called a “teleport probability”). This leads
to a modified formula for A:
In summary, finding the PageRank vector x reduces to solving the eigenvector problem x = Ax where
A can be written compactly in matrix form as
and e = (1, 1, . . . , 1)T being the n-vector containing all ones.
Computing the PageRank Vector with Matlab
A simple approach for solving eigenvalue problems is called the power method and takes the form of
a fixed point iteration x
(k) = Ax(k−1) for k = 1, 2, . . . , given some initial guess x
(0). With no prior
knowledge about the ranks, we choose x
(0) =1ne which assumes that all pages are equally ranked. And
that’s it!1 The algorithm described above is fairly easy to implement in Matlab, using the code below
(posted on Canvas as mypagerank.m):
% Estimate the PageRank for a small web connectivity
% matrix using the power method.
p = 0.85; % "teleport" probability
% Set up the connectivity matrix using index
% vectors that define the from->to connections
% between pages numbers 1 to n.
n = 6; % number of pages
ii = [2 4 3 6 1 4 6 2 5 2 6 3]; % "to" index
jj = [1 1 2 2 3 3 3 4 4 5 5 6]; % "from" index
G = sparse(ii, jj, 1, n, n);
c = full(sum(G)); % column sums
e = ones(n,1);
k = find(c~=0); % indices of zero columns
D = sparse(k, k, 1./c(k), n, n);
z = ((1-p)*(c~=0) + (c==0)) / n;
A = p*G*D + e*z; % PageRank matrix
x = e/n; % initial guess
xold = zeros(n,1);
niter = 0;
while norm(x-xold) > 0.0001,
niter = niter + 1;
xold = x;
x = A*x;
end
x, niter
bar(x), shg
This code returns the PageRank vector x = (0.1038, 0.1693, 0.2781 ✿✿✿✿✿✿
, 0.1479, 0.0879, 0.2131)T
, from which
it’s easy to see that page 3 receives the top ranking.
1This is a simplified version of the power method that applies to the very special case where A has positive entries,
kAk1 = 1, kx
(0)k1 = 1 and the eigenvalue is 1. Under these conditions, we’re guaranteed to have a unique positive real
solution x, and the power method converges to this solution, albeit sometimes slowly. For more general eigenvalue problems,
the power method requires requires a few extra modifications.
Your Computing Assignment – Exploring PageRank
1. Run the mypagerank code on the given 6-page example for various values of the teleport parameter
p lying between 0 and 1. How do changes in p impact the top-ranked web page, and the speed of
convergence of the power method iterations? What happens to the rankings as p → 0, and can you
explain why this might be so?
2. Modify the code for the larger set of n = 18 web pages and links pictured below:
(a) Compute the PageRank for each of the 18 pages using p = 0.85, and report the results of
the ranking as a bar plot (you may find Matlab’s sort command useful for picking out the
highest-ranked page).
(b) Repeat with p = 0.5. Which page ranks increase, and which decrease?
(c) Repeat with p = 0.98, and notice that several pages have a rank that is close to zero. Is there
anything about the link structure of this web graph that might explain why this happens? If
you were able to add a single link between any two web pages that would most increase the
reputation of the lowest-ranked pages, which would it be? Explain your choice.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。