BIG DATA SCIENCE
HW- 3 : Topic Modeling
Topic modeling is a natural language processing technique used to uncover hidden themes or topics within a large collection of documents. It involves identifying patterns in the words and phrases used across the corpus to group related documents together, aiding in tasks like document clustering, summarization, and content recommendation.
You are given 32 documents in PDF format. Your task is to extract important themes from the given corpus.
I. Perform. the following tasks:
1. Data Cleaning: [10 points]
a. Convert PDFs to text, and extract the title and abstract from each document.
b. Programmatically preprocess the texts, which involves removing stop words, removing special or unusual characters, lowercasing, lemmatization, etc.
** You may have to do some manual preprocessing
2. Embeddings [10 points]
Embeddings are dense vector representations of data, designed to capture meaningful relationships and patterns within the data's high-dimensional space. They are widely used in machine learning and natural language processing to enhance the performance of various tasks, such as word similarity, document classification, image recognition, and knowledge graph analysis. Embeddings facilitate the transformation of raw data into a form. that is more amenable to mathematical operations and meaningful comparisons, thereby enabling improved data understanding and predictive modeling.
Some of the widely used embedding techniques involve:
a. Word Embeddings [Word2Vec, GloVe, FastText]
b. Document Embeddings [Doc2Vec]
c. Sentence and Text Embeddings [BERT, Universal Sentence Encoder, GPT]
For this assignment, use any two pre-trained embedding models from distinct embedding techniques.
2.1 The output from this task should be a vector representation of each document your algorithm processed and generated embedding for. The embeddings will be used in further steps.
2.2 Pick 5 random documents. Calculate the 3 most similar papers (using cosine similarity) using the embedding and provide a short analysis of whether or not the results are relevant (when you actually look at the document). Add a summary of your findings in your_netID.txt
3. Dimensionality reduction [10 + 10 points]
Since the input embeddings are often high in dimensionality, clustering becomes difficult due to the curse of dimensionality. Apply any dimension reduction technique (PCA, SVD, etc.) to reduce the dimension of the embeddings to a new space of 30 dimensions.
For this task, you'll need to install Apache Spark and initiate a Spark session, utilizing its MLLIB for data processing. Although Apache Spark is designed for Big Data and might be overkill for a small dataset, gaining familiarity with it can be beneficial for future large-scale projects. You can find more info on it by going through the links attached below.
4. Clustering [20 points]
Cluster the reduced embeddings into groups of similar embeddings to extract your topics. You can use any clustering technique of your choice (KMeans, DBSCAN, Agglomerative Clustering, etc.). If you are using algorithms like KMeans, justify the choice of the number of clusters (hyperparameter) using the Elbow method or Silhouette Score. You may need to convert your spark dataframe into pandas dataframe. for this task.
a. Save the following information in a CSV file [cluster_info.csv]: Cluster number (integer), Representative documents (string: title+abstract)
**Hint: For representative documents: Calculate the cluster centroids and pick 3 documents closest to the centroid.
b. Visualize the clusters created using any visualization tool (matplotlib, plotly etc). This includes reducing the dimensionality of the document embeddings to a 2D space. [clusters.png/jpeg]
5. Class-based TF-IDF [20 points]
To obtain themes from each cluster, we will apply TF-IDF on the cluster level (and not the document level). Create a class-based TF-IDF matrix. For example, if there are 5 clusters, then you will have 5 entries (rows) in your TF-IDF matrix. (In other words, you are combining the documents in a particular cluster and treating it as one document in the corpus)
Extract the top six keywords from each cluster. Append this information to the cluster_info.csv file. [Cluster number, representative documents, top keywords]
II. Based on your understanding, answer the following questions:
1. What is the curse of dimensionality? [10 points]
2. Explain in brief the embedding techniques you have used. [10 points]
3. From the two embedding models used, which model performs better? Why do you think so? [10 points]
Submission Instructions:
Submit the following files in a zip file [your_netID.zip]:
a) Cluster_info.csv
Format: Cluster Number, Representative documents, Top Keywords
Example of an entry: 1, [‘title+abstract from doc 1’, ‘title+abstract from doc 2’, ‘title+abstract from doc 3’], [‘Keyword1’, ‘Keyword2’,’ Keyword3’,’ Keyword4’]
b) Clusters.jpeg/png
Clearly show document embeddings in each cluster in 2D space.
[Title, legend, axes]
c) Text file ,Word Document (your_netID.txt)
Answers to theory questions and problem results.
d) Code files (.py/.ipynb)
Additional Resources:
1. PDF to text converter: https://pypi.org/project/pdfplumber/0.1.2/
2. Apache Spark : https://spark.apache.org/docs/latest/
3. MLLIB: https://spark.apache.org/docs/latest/ml-guide.html
4. https://www.analyticsvidhya.com/blog/2020/11/introduction-to-spark-mllib-for-big-data-and -machine-learning/
5. MORE INFO ON HOW TO START A SPARK SESSION:
https://medium.com/@dipan.saha/pyspark-made-easy-day-2-execute-pyspark-on-google- colabs-f3e57da946a
HPC ACCESS:
Follow the links if you need HPC access. You won't require it for this assignment since this one won’t require any heavy computing resources and you can simply do it in your local Jupyter
Notebook or Collab, but if you're considering using it for your projects, it's a great opportunity to begin and practice. Below, I've included links and information on how to gain access and get started.
NYU HPC Home page: https://sites.google.com/nyu.edu/nyu-hpc/home?authuser=0 Getting HPC
account: https://sites.google.com/nyu.edu/nyu-hpc/accessing-hpc/getting-and-renewing-an-accou nt?authuser=0
You can add Professor Bari’s name for sponsorship. If you have any further doubts, feel free to email me at [email protected] (Kartik Kanotra).
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。