CS 3640: Introduction to Networks and Their Applications [Fall 2018]
Assignment 4 | Web Scraping: Record and Replay
Instructor: Rishab Nithyanand | Office hours: Wednesday 9-10 am or by appointment
Teaching assistant: Md. Kowsar Hossain | Office hours: Monday 1:30-2:30 pm
Released on: October 25th | Due on: November 9th (11:59:59 pm)
Maximum score: 100 | Value towards final grade: 13%
Groups
Group ID Group Hawk IDs
1 ['jblue', 'kzhang24', 'zluo1', 'ywang391', 'susmerano']
2 ['xchen117', 'jstoltz', 'jpflint', 'godkin']
3 ['mcagley', 'kdzhou', 'lye1', 'okueter', 'yitzhou']
4 ['msmith3', 'zzhang103', 'yonghfan', 'tnlowry']
5 ['mfmrphy', 'jmagri', 'trjns', 'jpthiede', 'uupadhyay']
6 ['dstutz', 'cweiske', 'hrunning', 'nicgoh']
7 ['awestemeier', 'nsonalkar', 'bzhang22', 'tsimonson']
8 ['xiaosong', 'jdhatch', 'tgoodmn', 'apatrck']
9 ['atran4', 'ymann', 'bchoskins', 'hpen']
10 ['apizzimenti', 'jglowacki', 'xxing2', 'yzheng19']
11 ['gongyzhou', 'ywang455', 'shangwchen', 'ppeterschmidt']
12 ['sklemm', 'weigui', 'lburden', 'gmich']
Learning goals
This assignment is intended to familiarize you with the HTTP protocol. HTTP is
(arguably) the most important application level protocol on the Internet today: the Web
runs on HTTP, and increasingly other applications use HTTP as well (including
Bittorrent, streaming video, Facebook and Twitter social APIs, etc.). You will also get
very familiar with webdrivers and the HAR format.
1
Download your VM
VM link:
https://drive.google.com/file/d/1rwdZkCJS8fVLwpNLEYgAUYJUrwvBQ6jx/view?usp=s
haring
- Extract the tar.gz file that you just downloaded.
- Download Virtual Box (its free and open source) from here: www.virtualbox.org.
- Open Virtual Box and create a new machine. See instructions below!
- The system will boot up Ubuntu 18.04. Your username and password on this system
is "cs3640"
Virtual machine setup
- Click "New".
- Type in the name of your new virtual machine. Select "Linux" as Type and "Ubuntu 64
bit" as Version. Click next.
- Allocate at least 2GB (2048MB) RAM to your VM.
- When you're asked about creating a disk, click the "Use an existing virtual hard disk
file" option. The disk you should select is "cs3640-assignment-4.vmdk" located in the
folder you just extracted.
- That's it! Now every time you need to boot up the VM just open Virtual Box and select
your VM.
2
Task I: Crawling the Web (30 points)
As part of your first task, you will learn how to programmatically scrape the Web by
instrumenting a Web browser such as Chrome or Firefox to automatically load
webpages for you and record content from these webpages.
Specifically, you will do the following:
- [15 points] You will write a program that will read an input file containing a list of
URLs and open a Web browser to visit each of the URLs in sequence.
- You might consider using the Selenium Webdriver Python API to do this.
I’ve used it for years and it’s always been (in my opinion) the best
webdriver out there.
- https://selenium-python.readthedocs.io/getting-started.html
- Create your own input text file named “url_list” which contains a list of
URLs that the browser must visit in order. One URL per line.
- [15 points] Your program should also record all HTTP requests being issued by
the browser and HTTP responses from the web servers. These requests and
responses should be saved in a HTTP Archive Record (HAR) format.
- All recorded HAR files should be saved in a “har-data” folder contained in
the same working directory. This folder will contain one folder for each
URL crawled.
- All HAR files generated file crawling this URL should be placed in
this folder. For example, you will have a ./har-data/<url> folder
which will contain the HAR files recorded while visiting <url>.
- There are many different ways to generate a HAR file. The easiest is to
consider using a tool which sits between the network and your browser
and records this information for you see. Another way is to use browser
extensions such as Firebug + NetExport or HAR Export Trigger to do this
for you (this is more complicated).
- https://browsermob-proxy-py.readthedocs.io
- https://selenium-python.readthedocs.io/faq.html
3
Task I submission instructions:
- You will submit a single file named “task-1.zip”. This is the zipped version of a
folder named “task-1”.
- This folder will contain a file named “web-scraper.py”. This file should
take as input the path to the file containing the list of URLs.
- This folder will contain a file “requirements.txt” which will contain a list of
python packages that need to be installed to make your code work.
- This folder will contain a file “install.sh” which will contain bash code to
automatically install any system tools/packages required to make your
code work.
- You will not submit your “har-data” folder.
- Here is how we will evaluate this task:
- We will run “sudo pip install -r requirements.txt”
- We will then run “sudo chmod +x install.sh; ./install.sh”
- We will then run “python web-scraper.py url_list” where “url_list” will be
supplied by us -- expect it to at least contain www.google.com.
- We will then check if the “har-data” folder created by your program
contains the HAR files expected to be generated while crawling each of
the URLs on our list.
4
Task II: Parsing HAR files (30 points)
As part of your second task, you will learn how to parse files in the HAR format. This
will also get you very familiar with the HTTP protocol and the request/response
message types.
Specifically, you will do the following:
- [15 points] You will write a program to parse HAR files and extract a list of all the
requested URLs, corresponding HTTP response status codes, response content
types/sizes, and host-names.
- The input to your program will be the path to a folder containing HARs.
- The output generated by your program will be a collection of JSON files
-- 1 per host-name seen in the HTTP requests present in your input HAR
files. These should be generated in a folder named “parsed-requests".
- For example: If “file-1.har” has requests to the host-names
“github.io” and “godaddy.com” and “file-2.har” has requests to the
host-names “github.io” and “google.com”, your program will
generate 3 json files: “github.io”, “google.com”, and
“godaddy.com”. These files will contain the extracted information
for each request made to the corresponding host.
- [15 points] Your program should also parse the generated HAR files and
recreate “html”, “png”, and “svg” content contained in all the HTTP responses.
- The output generated by your program should be a folder named
“parsed-objects” containing the “html”, “png”, and “svg” received from
the HTTP responses in each HAR file.
- There should be one folder per observed host-name in “objects”.
Continuing from the above example, there should be a
“parsed-objects/github.io” folder containing all “html”, “png”, and
“svg” objects loaded from “github.io”.
- You might find it useful to know that you can decode the images
(from base64 text) as follows: base64 -D text.txt > decoded.png
5
Task II submission instructions:
- You will submit a single file named “task-2.zip”. This is the zipped version of a
folder named “task-2”.
- This folder will contain a file named “har-parser.py”. This file should take
as input the path to the folder containing the HAR files to be analyzed.
- This folder will contain a file “requirements.txt” which will contain a list of
python packages that need to be installed to make your code work.
- This folder will contain a file “install.sh” which will contain bash code to
automatically install any system tools/packages required to make your
code work.
- Here is how we will evaluate this task:
- We will run “sudo pip install -r requirements.txt”
- We will then run “sudo chmod +x install.sh; ./install.sh”
- We will then run “python har-parser.py
task-1/har-data/www.google.com”
- We will then check if the “parsed-requests” folder created by your
program contains the expected output.
- We will then check if the “parsed-objects” folder created by your program
contains the “html”, “png”, and “svg” objects loaded from each of the
web hosts.
6
Task III: Re-serving web content (30 points)
As part of your third task, you will learn how to serve web content and replicate the
HTTP request and response process in an emulated network.
Specifically, you will do the following:
- [15 points] You will write a program to create a mininet virtual network
containing one host named “client” and one “server” for each host-name
contained in the input folder. The input folder will have one sub-folder for each
host-name which contains the objects expected to be served by the webserver
mimicking that host. Each of your emulated “web server” hosts will run an actual
web server which is expected to serve these objects. Each server will write the
list of files that they are able to serve to disk.
- The input to your program will be the path to a “parsed-objects” folder
derived from Task II.
- Output: Each of the mininet web servers (1 for each host seen in the
“parsed-objects” folder) will create a file named <host-name> listing the
files that they are able to serve. This output needs to be stored in a folder
named “servable-content”.
- [15 points] Your program will automatically rewrite the HTTP requests exiting
from your client as follows: If the request is for an object that is already being
served by one of our emulated web servers, then the HTTP request should be
re-written to fetch that object from that web server. Other objects may be
fetched from the Web as normal (i.e., do not re-write the request URLs of
objects not available through your emulated servers).
- For example: If the client is loading “www.google.com” and the
“index.html” file is already available through our emulated “google.com”
web host, then the object should be fetched from there instead.
- The input to your program will be the path to a file containing a list of
URLs, one URL per line.
7
- The output should be the responses to the GET <url> requests. These
responses should be written into a “get-responses” folder. Files should
be named by their index in the input file.
Task III submission instructions:
- You will submit a single file named “task-3.zip”. This is the zipped version of a
folder named “task-3”.
- This folder will contain a file named “emulated-web.py”. This file should
take as input the path to a root “parsed-objects” folder.
- This folder will contain a file “requirements.txt” which will contain a list of
python packages that need to be installed to make your code work.
- This folder will contain a file “install.sh” which will contain bash code to
automatically install any system tools/packages required to make your
code work.
- Here is how we will evaluate this task:
- We will run “sudo pip install -r requirements.txt”
- We will then run “sudo chmod +x install.sh; ./install.sh”
- We will then run “python emulated-web.py ./input/parsed-objects
./input/url_list”
- We will then check if the “servable-content” folder contains the expected
set of files/content based on the supplied “./input/parsed-objects” folder.
- We will then monitor the outgoing HTTP requests from your client to see
if the URLs are being rewritten as expected and if the responses in
“get-responses” match the expected output.
8
Task IV: The credit reel (10 points)
As always, you will get 10 points for submitting a well formatted credit reel. This should
be in a file named “credit-reel.txt”. Follow the same instructions as the previous
assignments.
Submission instructions
Each group is to submit a single zip file (which will contain 3 zip files -- 1 per task). The
submissions are due on ICON at 23:59:59 on November 9th, 2018. The last submission
submitted by a team member before midnight on the due date will be the one graded
unless ALL team members let the TA and me know that they want another submission
to be graded (the late penalty if a submission made past the due date is chosen).
Late submissions
I am being generous in the amount of time allotted to this assignment to account for
difficulties in scheduling meetings, etc. There will be no extensions of the due date
under any circumstances. If a submission is received past the due date, the late policy
detailed on the course webpage will apply.
Team-mate feedback
Each team member may also send me an email (rishab-nithyanand@uiowa.edu) with
subject "Feedback: Assignment 4, Group N" detailing their experience working with
each of their team-mates. For each team member, tell me at least one good thing and
one thing they could improve. These will be anonymized and released to each
individual at the end of the term. It's important to know how to work well in a team and
early feedback before you move on to bigger and better things is always helpful.
Sending feedback for all 4 assignments will fetch you a 4% bonus at the end of the
term. Note: Sending with an incorrect subject line means that the email will not get
forwarded to the right inbox.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。