联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2023-04-01 07:56

GEOG0051 Mining Social and Geographic Datasets


1 Overview of Tasks

The coursework for the module consists of two separate tasks. The first concerns analysing the Gowalla Cambridge mobility patterns

GC dataset and the second concerns a machine learning task analysing a venue-review dataset. Although each of these tasks

will have sub-prompts to be answered, your responses to each of them should be in the form of a coherent report addressing all

of these prompts, rather than discrete paragraphs specifically answering individual prompts. Literature can be used to give

context to the study. Finally, any datasets that you require will be uploaded on the Assessment tab on the course Moodle page.

1.1 Submission format

Students should submit a report through Turnitin on the course Moodle page, under the ’Assessment’ tab, containing a

description and analysis of the methods taken and results obtained,

in a PDF document with text of font size 11 or 12 and written fully in complete sentences, e.g. not using bullet points,

of a maximum length of 2,500 words which you are free to divide in any way between your responses for the two tasks.

The word count includes the title, headings, sub-headings, introduction, conclusion and captions of figures or tables, but

excludes the coursework cover page and bibliography (list of references) at the end of the document. The report should

not contain actual code.

The maximum number of figures is 10 in total (multiple sub-figures used to make the same point are allowed) and the

relevance of these figures should be explained in your write-up.

The code developed by the student should be submitted using a separate submission link available on the course Moodle page

in a single ZIP (compressed) file. The code can be submitted as either Jupyter notebook(s), i.e. .ipynb files, or as a .py files,

but they must be contained within one ZIP file. The report should not contain any code and functions used as that is in the

code itself.The submission deadline is noon on the 24th of April, 2023. Further details on the submission procedures will be

available on Moodle.

Note: FAILURE TO INCLUDE YOUR FULL NOTEBOOKS/CODE WILL INCUR A 7-POINT PENALTY.

1.2 Queries

All related queries must be posted on the moodle forum; this is largely to address a likely overlap in questions that students

may have and so that all students will benefit from any clarification that is given.

Questions seeking clarification about, for instance, the wording of the task briefs or format of submission will be answered.

However, as this is an assessed piece of work, you may not ask about questions that pertain directly to the coursework itself,

e.g. ”Is analysis X the best way to answer question 1a?” Because of the same reason, any collaboration or discussion of the

coursework with anyone is strictly prohibited. The rules for plagiarism apply and any cases of suspected plagiarism of other

works, published or not, will be taken very seriously.

The deadline for any questions to be asked and answered is noon on the 17th of April, 2023, i.e. 1 week before submission

deadline (24th of April, 2023).

1

2 Mobility Patterns Analysis in Cambridge

For the first task, you will be analysing the mobility patterns of users from Gowalla, a now-defunct online geo-social network

from a decade ago. On Gowalla, users were able to check in at different locations across the course of the day. The dataset that

is provided to you (available on Moodle) is a subset of Gowalla users located in Cambridge, UK from the Stanford University,

Stanford Network Analysis Project. The data has been anonymised (personal identifier removed). However, you could still

trace the location of particular individuals, according to their check-in locations.

For further information, the entire dataset is available at https://snap.stanford.edu/data/loc-gowalla.

html.

2.1 Format of Data

The variables contained in the dataset (which should be self-explanatory), provided in a .csv file, are:

User ID, or the unique identifier of the user, e.g. 196514

check-in-date, e.g. 2010-07-24

check-in-time, e.g. 13:45:06

latitude, e.g. 53.3648119

longitude, e.g. -2.2723465833

loc id, or the unique identifier of the location, e.g. 145064

2.2 Analysis Prompts

2.2.1 Visualise individual check-in locations

Visualise the check-in locations of the GC dataset for users with User IDs [75027] and [102829] using the Geopandas/Folium

library. Comment briefly on your findings of the locations visited by the 2 users, using any library that enables mapping. You

should also comment briefly on the privacy implications of this type of analysis. [Note: This task primarily serves to help

familiarise you with the dataset; we advise not to spend too long on it.]

2.2.2 Provide Characterisation of the Gowalla dataset

Provide a characterisation of the data available for the user [75027] on 30/01/2010 and for user [102829] on 24/05/2010, by

visualising the shortest paths (on the street network) between each consecutive stop-points for the user using the OSMnx library.

Then, summarising your answers in a table in your report and compute, for each user:

the maximum displacement (i.e. maximum distance between two consecutive locations they moved between);

the average displacement (i.e. average distance between two consecutive locations/check-ins);

the total distance travelled on the day;

**Note: All distances should be described in network distance (driving or walking), i.e. the distances of paths along the

street networks, rather than crow-fly distances without consideration of the street network.

2.2.3 Comparative analysis of check-in frequencies and network centrality

Describe the general pattern of user check-ins in the Gowalla dataset in relation to closeness centrality measures for the City of

Cambridge, UK, using whatever visual aids you see as fitting to your analysis. Comment on any observable trends which you

find most noticeable and/or interesting.

2.2.4 Urban Planning Application Question

Imagine that you were taking the role of a consultant to the authorities in Cambridge responsible for urban planning. Choose

one of the following urban features and propose a new location where you would build that feature: museum, shopping mall,

fire station, community park or kindergarden. Use the outputs of your analysis from the task above (2.2.3) and any relevant

knowledge of the local area to justify your decision. [Note: You do not have to do any further analysis/ visualisation by

code. However, if you feel like your response could benefit from further analysis, you can choose to briefly describe what

accompanying analysis you would undertake.]

2/4

3 Machine Learning Analysis with Venue Review Data in Calgary, Canada

For this second task, we would like you to analyse a dataset that contains review data of different venues in the city of Calgary,

Canada. With the help of several machine learning techniques that we have learnt in the course, you will be tasked to distill

insights from this social media dataset. Two of its notable features are the geocoding of every reviewed venues and the

availability of a considerable amount of text data in it, which lend to its ability to be processed using spatial and text analysis

techniques respectively.

As a prelude to the analysis prompts below, have a brief think about some of these questions: What can we discover about the

venue review data? Are there any spatial patterns that can be extracted from the data? Can we build a machine learning model

that predicts review rating for unseen data points using the text of the reviews?

3.1 Format of Data

The variables contained in the dataset provided in a .csv file, are:

’business id’, unique identifier of the premise

’Name’, name of premise

’latitude’, ’longitude’, i.e. the locational attributes of the venue

’review count’, or the number of reviews the venue has been given

’categories’ general category of establishment that a venue falls under (Note: this variable is rather messy and requires

cleaning to be used)

’hours’, or the opening hours of the venue

’review id’, unique identifier of the review

’user id’, unique identifier of the individual who left the review

’stars y’ individual ratings of the venue

’useful’, ’funny’, ’cool’, i.e. tags for the review (similar to ” of Likes” for a review.)

’text’ text of the review

’date’, i.e. the date of the review

3.2 Analysis Prompts

3.2.1 Loading and cleaning the textual dataset

In a realistic context, most text datasets are messy in their raw forms. They require considerable data cleaning before any

analysis can be conducted and, not unlike data cleaning for non-textual datasets, this would include the removal of invalid data,

missing values, and outliers. In this first prompt you will be required to complete the tasks stated below to prepare the dataset

for subsequent analysis.

Load and understand the dataset.

Think about which attributes you will use / focus on (in subsequent prompts) and check its data distribution.

Pre-process the text review data and create a new column in the data frame which will hold the cleaned review data.

Some of the steps to consider are: removal of numbers, punctuation, short words, stopwords, lemmatise words, etc.

Note that while there are no immediate outputs from this prompt that you will be assessed on, you will be assessed on the

process of data cleaning that you detail in your report. Furthermore, the quality of your data clean for a text analysis task will

strongly impact your outputs and thus you should spend a reasonable proportion of your time on this task.

3/4

3.2.2 Build a supervised learning model for text analysis

The objective of this sub-task is to build a supervised learning model that predicts the polarity (positive or negative) of the

venue from the data, based on the different features of each review included in the dataset. Positive polarity here is defined as a

venue rating of 4 or more stars and negative polarity here is defined as a venue rating of 3 or less stars. You can choose a subset

of venues to review for example based on a general category(use) the venue falls under. You can use a combination of text and

non-text features, and below are some guidelines that you could follow:

Firstly, tokenize the pre-processed review text data to give a bag-of-words feature that can be used in your model.

Create polarity score from the stars rating.

Split dataset (eg. train and test-set).

Train and compare the efficacy of not fewer than two machine learning models predicting its polarity. The student can

decide what they would like to vary.

Report the model results (on out-of-sample testset).

Discuss and interpret the results you obtained.

3.2.3 Geospatial analysis and visualisation of review data

Having explored the dataset, its constituent variables and coverage above, the objective of this sub-task is for you to visualise

any of the spatial patterns that emerge from the data that you find interesting. This task is intentionally open-ended and leaves

you with some choice. To achieve this, you should:

Choose 1 or 2 variables (including any variables you generated from 3.2.2) that you wish to explore and from the list of

variables available in the dataset

Use either or both of the geopandas and folium libraries in Python to produce up to 3 visualisations

Comment on the spatial distributions of the 1-2 variables you chose, any trends or outliers that emerge and if they have

any notable implications.

Note: You may use any subset of the dataset instead of the entire dataset, but comment on why you chose this subset.

3.2.4 Business Intelligence Application Question

Imagine that you are taking the role of a restaurant owner in Calgary, select a location you would like to open your restaurant in.

Use the outputs of your analysis from the task above and any relevant knowledge of the local area to justify your decision.

[Note: You do not have to do any further analysis/visualisation. However, if you feel like your response could benefit

from further analysis, you can choose to briefly describe what accompanying analysis you would undertake.]

3.2.5 Extra task (Optional)

For extra marks, you could choose 1 of EITHER:

(a) Use a pretrained neural word embedding method (eg. word2vec) for the supervised learning task and compare the

results with the bag of words features, OR,

(b) Apply topic modelling (eg. LDA) on the text data and give a characterisation of each of the topics that your topic

model generates. Comment briefly on whether these characterisations were roughly what you expected before.

4/4


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp