Graphical Models for Complex Health Data (P8124 Fall 2024)
Due date: Friday December 20th, 8pm
FINAL PROJECT DESCRIPTION
This is an open-ended data analysis project, where you can gain some experience applying methods based on graphical models to real data. Several data sources are made available below. You may elect to use their own data only if you obtain permission from Professor Malinsky. (The student must also have permission to use the data — publicly available data is ok if the data is of high quality. If you want to use data not listed below, please email Prof. Malinsky by 11/29.) You should define your own analysis objective: what do you want to learn/estimate from this data? What scientific or inferential question do you want to address? You can use whatever methods you like, as long as they are related to the graphical methods we discuss in the course. There is a bit of leeway here — you need not limit yourself to methods we’ve directly discussed (e.g., if you want to use a graphical structure learning algorithm we have not mentioned, or an estimation procedure that we have not explicitly gone over in class, a different Monte Carlo sampling method, etc.), but whatever methods you use should be clearly related to the course material. (In particular: do not simply train a neural net, or use K-means clustering, or some other standard machine learning (ML) method you’ve learned outside this course and call it a day! This would not be acceptable. You may, however, use other ML methods not discussed in class in conjunction with graphical methods.) You may use whatever software is publicly available to do your analysis, or implement things yourself. You will write a short paper in the style. of an ML conference paper. Remember to clearly introduce your problem, data, methods, approach, and results. In fact, a good way to organize your paper is into the following sections: Introduction, Data, Methods, Results, Discussion. (You do not need to discuss “related work” though you are welcome to use already published work as an inspiration, as long as you cite it. Pure re-implementations of already published work are to be avoided.) You should justify all your analysis choices — how you chose tuning parameters, why you chose certain parametric forms or model classes, etc. Make sure your work is reproducible, so someone using the same data could implement your method and achieve the same results.
Requirements:
1) Specify, either based on background knowledge, problem design, or learning from the data, at least one graphical model that will serve as the basis of your analysis. It can be any kind of graphical model that you think is appropriate. Then, perform. some sort of “inference” task based on the model. Here I mean “inference” broadly construed: test some scientific hypothesis, estimate some scientifically-meaningful parameter, use the graph somehow for prediction/classification, perform. causal effect estimation (if appropriate), etc.
2) You should compare at least two approaches/methods/settings. That is, you should consider what someone else might do alternatively to your proposal, try it, and compare results. How you do this is up to you: the important thing is that you try more than one thing.
3) Write a paper using the formatting guidelines of the NeurIPS conference. Formatting guidelines and sample Latex document/style. file can be found here: https://neurips.cc/Conferences/2021/PaperInformation/StyleFiles (Note: do not anonymize your submission!)
4) Minimum length: 4 pages. Maximum length: 8 pages, including all tables and figures (not including references). You may include additional supplementary material if you wish but the grading will be based entirely on the content of the main paper, and supplementary material will probably not be examined at all.
5) A 500 word (approx) project proposal is due December 6th at 8pm (on Courseworks). Describe which data, methods, and software do you intend to use, and the goal of your analysis. This does not need to contain all the details, but it should be clear that you have a well-formed idea and a rough plan for executing it. The final project should also be submitted via Courseworks. Late submissions will not be accepted.
Note: there is a list of GM-related software packages in R here: https://cran.r-project.org/web/views/ GraphicalModels.html This is of course not a complete list, new packages are added all the time and many are on Github or other sites. However, this list is a good place to start.
AVAILABLE DATA
1) fMRI dataset. Download here: https://rutgers.box.com/s/imgbdaqhzlkbunf52ia8xfmiu97b05xh
This data set contains fMRI brain scan results for individuals with Autism Spectrum Disorder (ASD) as well as “neurotypical” controls. It was taken from the Autism Brain Data Exchange:
http://fcon_1000.projects.nitrc.org/indi/abide/abide_I.html
Specifically, we have the Carnegie Mellon University dataset, where the age range of subjects is between 19 and 40. The preprocessing pipeline used is NIAK with 160 Regions of Interest (ROIs):
http://preprocessed-connectomes-project.org/abide/Pipelines.html
On the above website you can see the preprocessing steps performed by NIAK (last column in the comparison tables). And in the last section of the website you can see the description of the parcellation to get the ROIs: “Dosenbach 160.”
There are two diagnostic categories of patients, 14 ASD individuals and 13 controls. The individuals all have the same number of ROIs/variables (columns), but potentially different number of samples (rows), because some samples are dropped by the data preprocessing to remove artifacts of head motion (akin to outliers), etc. The data was sampled every 2 seconds, and the size of the voxels is 3mm x 3mm x 3mm. There is a file called “phenotypic_CMU.csv” which has some metadata on the individuals in the data set, including which diagnostic category each individual belongs to (1=ASD, 2=control). Note: due to a bug there is an extra ROI (column 161) in this data which you should just remove before analysis. Also, the files are really .csv files but for some reason the file extension was dropped. You might need to manually rename each data file as “CMU a 005[….]1D.csv” so that your computer has an easier time opening and reading the files.
Alternative: the fMRI data from the Dajani et al. (2017) paper is available here: https://github.com/cheninstitutecaltech/Caltech_DATASAI_Neuroscience_23/blob/main/07_20_23_day9_causal_modeling/ code/solutions/exercise3.ipynb (instructions described in a Python notebook)
2) Genetics data. The data used by Wang et al. (2016) in their “FastGGM” paper is available here: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1425/
Another source of genetics data is here: https://jhubiostatistics.shinyapps.io/recount/ Note: Use the “TCGA” data and focus on a specific tissue (e.g., “lung”).
Note: the TCGA data is a relatively “clean” RNA-seq data which is easy to download (instructions on the above website). However, it is not very easy to understand if you don’t already have some familiarity with data of this type. So, if you have no experience with such data, I would probably advise against using it.
3) X-ray image data. The ChestX-Ray8 from NIH contains >100K chest x-ray images of >30K unique patients, along with radiologist labels to indicate 14 common pathologies of the thorax. The data (both raw images and labels/other metadata) are available here: https://nihcc.app.box.com/v/ChestXray-NIHCC
GRADING RUBRIC
Meeting the basic requirements: 40 points
Clearly stated objectives: 10 points
Appropriateness of methods for the stated task(s): 10 points
Adequate description of the methods used (including all modeling choices, any tuning parameters, parametric forms, etc): 20 points
Informative presentation of the results: 10 points
Clear and understandable writing: 10 points
(Creativity/novelty: up to 5 points extra)
Total: 100 points
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。