联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Web作业Web作业

日期:2018-10-31 10:08

CS 3640: Introduction to Networks and Their Applications [Fall 2018]

Assignment 4 | Web Scraping: Record and Replay

Instructor: Rishab Nithyanand | Office hours: Wednesday 9-10 am or by appointment

Teaching assistant: Md. Kowsar Hossain | Office hours: Monday 1:30-2:30 pm

Released on: October 25th | Due on: November 9th (11:59:59 pm)

Maximum score: 100 | Value towards final grade: 13%

Groups

Group ID Group Hawk IDs

1 ['jblue', 'kzhang24', 'zluo1', 'ywang391', 'susmerano']

2 ['xchen117', 'jstoltz', 'jpflint', 'godkin']

3 ['mcagley', 'kdzhou', 'lye1', 'okueter', 'yitzhou']

4 ['msmith3', 'zzhang103', 'yonghfan', 'tnlowry']

5 ['mfmrphy', 'jmagri', 'trjns', 'jpthiede', 'uupadhyay']

6 ['dstutz', 'cweiske', 'hrunning', 'nicgoh']

7 ['awestemeier', 'nsonalkar', 'bzhang22', 'tsimonson']

8 ['xiaosong', 'jdhatch', 'tgoodmn', 'apatrck']

9 ['atran4', 'ymann', 'bchoskins', 'hpen']

10 ['apizzimenti', 'jglowacki', 'xxing2', 'yzheng19']

11 ['gongyzhou', 'ywang455', 'shangwchen', 'ppeterschmidt']

12 ['sklemm', 'weigui', 'lburden', 'gmich']

Learning goals

This assignment is intended to familiarize you with the HTTP protocol. HTTP is

(arguably) the most important application level protocol on the Internet today: the Web

runs on HTTP, and increasingly other applications use HTTP as well (including

Bittorrent, streaming video, Facebook and Twitter social APIs, etc.). You will also get

very familiar with webdrivers and the HAR format.

1

Download your VM

VM link:

https://drive.google.com/file/d/1rwdZkCJS8fVLwpNLEYgAUYJUrwvBQ6jx/view?usp=s

haring

- Extract the tar.gz file that you just downloaded.

- Download Virtual Box (its free and open source) from here: www.virtualbox.org.

- Open Virtual Box and create a new machine. See instructions below!

- The system will boot up Ubuntu 18.04. Your username and password on this system

is "cs3640"

Virtual machine setup

- Click "New".

- Type in the name of your new virtual machine. Select "Linux" as Type and "Ubuntu 64

bit" as Version. Click next.

- Allocate at least 2GB (2048MB) RAM to your VM.

- When you're asked about creating a disk, click the "Use an existing virtual hard disk

file" option. The disk you should select is "cs3640-assignment-4.vmdk" located in the

folder you just extracted.

- That's it! Now every time you need to boot up the VM just open Virtual Box and select

your VM.

2

Task I: Crawling the Web (30 points)

As part of your first task, you will learn how to programmatically scrape the Web by

instrumenting a Web browser such as Chrome or Firefox to automatically load

webpages for you and record content from these webpages.

Specifically, you will do the following:

- [15 points] You will write a program that will read an input file containing a list of

URLs and open a Web browser to visit each of the URLs in sequence.

- You might consider using the Selenium Webdriver Python API to do this.

I’ve used it for years and it’s always been (in my opinion) the best

webdriver out there.

- https://selenium-python.readthedocs.io/getting-started.html

- Create your own input text file named “url_list” which contains a list of

URLs that the browser must visit in order. One URL per line.

- [15 points] Your program should also record all HTTP requests being issued by

the browser and HTTP responses from the web servers. These requests and

responses should be saved in a HTTP Archive Record (HAR) format.

- All recorded HAR files should be saved in a “har-data” folder contained in

the same working directory. This folder will contain one folder for each

URL crawled.

- All HAR files generated file crawling this URL should be placed in

this folder. For example, you will have a ./har-data/<url> folder

which will contain the HAR files recorded while visiting <url>.

- There are many different ways to generate a HAR file. The easiest is to

consider using a tool which sits between the network and your browser

and records this information for you see. Another way is to use browser

extensions such as Firebug + NetExport or HAR Export Trigger to do this

for you (this is more complicated).

- https://browsermob-proxy-py.readthedocs.io

- https://selenium-python.readthedocs.io/faq.html

3

Task I submission instructions:

- You will submit a single file named “task-1.zip”. This is the zipped version of a

folder named “task-1”.

- This folder will contain a file named “web-scraper.py”. This file should

take as input the path to the file containing the list of URLs.

- This folder will contain a file “requirements.txt” which will contain a list of

python packages that need to be installed to make your code work.

- This folder will contain a file “install.sh” which will contain bash code to

automatically install any system tools/packages required to make your

code work.

- You will not submit your “har-data” folder.

- Here is how we will evaluate this task:

- We will run “sudo pip install -r requirements.txt”

- We will then run “sudo chmod +x install.sh; ./install.sh”

- We will then run “python web-scraper.py url_list” where “url_list” will be

supplied by us -- expect it to at least contain www.google.com.

- We will then check if the “har-data” folder created by your program

contains the HAR files expected to be generated while crawling each of

the URLs on our list.

4

Task II: Parsing HAR files (30 points)

As part of your second task, you will learn how to parse files in the HAR format. This

will also get you very familiar with the HTTP protocol and the request/response

message types.

Specifically, you will do the following:

- [15 points] You will write a program to parse HAR files and extract a list of all the

requested URLs, corresponding HTTP response status codes, response content

types/sizes, and host-names.

- The input to your program will be the path to a folder containing HARs.

- The output generated by your program will be a collection of JSON files

-- 1 per host-name seen in the HTTP requests present in your input HAR

files. These should be generated in a folder named “parsed-requests".

- For example: If “file-1.har” has requests to the host-names

“github.io” and “godaddy.com” and “file-2.har” has requests to the

host-names “github.io” and “google.com”, your program will

generate 3 json files: “github.io”, “google.com”, and

“godaddy.com”. These files will contain the extracted information

for each request made to the corresponding host.

- [15 points] Your program should also parse the generated HAR files and

recreate “html”, “png”, and “svg” content contained in all the HTTP responses.

- The output generated by your program should be a folder named

“parsed-objects” containing the “html”, “png”, and “svg” received from

the HTTP responses in each HAR file.

- There should be one folder per observed host-name in “objects”.

Continuing from the above example, there should be a

“parsed-objects/github.io” folder containing all “html”, “png”, and

“svg” objects loaded from “github.io”.

- You might find it useful to know that you can decode the images

(from base64 text) as follows: base64 -D text.txt > decoded.png

5

Task II submission instructions:

- You will submit a single file named “task-2.zip”. This is the zipped version of a

folder named “task-2”.

- This folder will contain a file named “har-parser.py”. This file should take

as input the path to the folder containing the HAR files to be analyzed.

- This folder will contain a file “requirements.txt” which will contain a list of

python packages that need to be installed to make your code work.

- This folder will contain a file “install.sh” which will contain bash code to

automatically install any system tools/packages required to make your

code work.

- Here is how we will evaluate this task:

- We will run “sudo pip install -r requirements.txt”

- We will then run “sudo chmod +x install.sh; ./install.sh”

- We will then run “python har-parser.py

task-1/har-data/www.google.com”

- We will then check if the “parsed-requests” folder created by your

program contains the expected output.

- We will then check if the “parsed-objects” folder created by your program

contains the “html”, “png”, and “svg” objects loaded from each of the

web hosts.

6

Task III: Re-serving web content (30 points)

As part of your third task, you will learn how to serve web content and replicate the

HTTP request and response process in an emulated network.

Specifically, you will do the following:

- [15 points] You will write a program to create a mininet virtual network

containing one host named “client” and one “server” for each host-name

contained in the input folder. The input folder will have one sub-folder for each

host-name which contains the objects expected to be served by the webserver

mimicking that host. Each of your emulated “web server” hosts will run an actual

web server which is expected to serve these objects. Each server will write the

list of files that they are able to serve to disk.

- The input to your program will be the path to a “parsed-objects” folder

derived from Task II.

- Output: Each of the mininet web servers (1 for each host seen in the

“parsed-objects” folder) will create a file named <host-name> listing the

files that they are able to serve. This output needs to be stored in a folder

named “servable-content”.

- [15 points] Your program will automatically rewrite the HTTP requests exiting

from your client as follows: If the request is for an object that is already being

served by one of our emulated web servers, then the HTTP request should be

re-written to fetch that object from that web server. Other objects may be

fetched from the Web as normal (i.e., do not re-write the request URLs of

objects not available through your emulated servers).

- For example: If the client is loading “www.google.com” and the

“index.html” file is already available through our emulated “google.com”

web host, then the object should be fetched from there instead.

- The input to your program will be the path to a file containing a list of

URLs, one URL per line.

7

- The output should be the responses to the GET <url> requests. These

responses should be written into a “get-responses” folder. Files should

be named by their index in the input file.

Task III submission instructions:

- You will submit a single file named “task-3.zip”. This is the zipped version of a

folder named “task-3”.

- This folder will contain a file named “emulated-web.py”. This file should

take as input the path to a root “parsed-objects” folder.

- This folder will contain a file “requirements.txt” which will contain a list of

python packages that need to be installed to make your code work.

- This folder will contain a file “install.sh” which will contain bash code to

automatically install any system tools/packages required to make your

code work.

- Here is how we will evaluate this task:

- We will run “sudo pip install -r requirements.txt”

- We will then run “sudo chmod +x install.sh; ./install.sh”

- We will then run “python emulated-web.py ./input/parsed-objects

./input/url_list”

- We will then check if the “servable-content” folder contains the expected

set of files/content based on the supplied “./input/parsed-objects” folder.

- We will then monitor the outgoing HTTP requests from your client to see

if the URLs are being rewritten as expected and if the responses in

“get-responses” match the expected output.

8

Task IV: The credit reel (10 points)

As always, you will get 10 points for submitting a well formatted credit reel. This should

be in a file named “credit-reel.txt”. Follow the same instructions as the previous

assignments.

Submission instructions

Each group is to submit a single zip file (which will contain 3 zip files -- 1 per task). The

submissions are due on ICON at 23:59:59 on November 9th, 2018. The last submission

submitted by a team member before midnight on the due date will be the one graded

unless ALL team members let the TA and me know that they want another submission

to be graded (the late penalty if a submission made past the due date is chosen).

Late submissions

I am being generous in the amount of time allotted to this assignment to account for

difficulties in scheduling meetings, etc. There will be no extensions of the due date

under any circumstances. If a submission is received past the due date, the late policy

detailed on the course webpage will apply.

Team-mate feedback

Each team member may also send me an email (rishab-nithyanand@uiowa.edu) with

subject "Feedback: Assignment 4, Group N" detailing their experience working with

each of their team-mates. For each team member, tell me at least one good thing and

one thing they could improve. These will be anonymized and released to each

individual at the end of the term. It's important to know how to work well in a team and

early feedback before you move on to bigger and better things is always helpful.

Sending feedback for all 4 assignments will fetch you a 4% bonus at the end of the

term. Note: Sending with an incorrect subject line means that the email will not get

forwarded to the right inbox.


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp