Submission materials
Please submit the following to make grading simpler:
-A single zip file (called LastName_FirstName_A4.zip) containing:
oA report (word document or pdf) in scientific style describing the analysis you have done (Brief introduction, Methods, Results, Discussion and any References), including embedded images, and chunks of code showing how each part was produced. Note, it is not expected that you write a lot of text here about the biology of the analysis, it is more important to describe what analysis you did, and the results you generated.
oThe output CSV files from steps 1 & 2
oYour code and any additional plots you have produced in pdf format (i.e. as supplementary information), which don’t easily fit into a report format (non-compulsory).
oDo not include the input files in the zip file (to save space) – so please setup your code to read the files using the names of the original files, so we can run some tests easily.
Background and materials
On VITAL, you will find two Excel sheets, pertaining to a gene expression (microarray study) investigating whether there are changes in expression of genes within individuals suffering from autism spectrum disorders, compared with healthy controls: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7329
I have downloaded the quantitative values (matrix sheet in tab-separated text from here: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE7nnn/GSE7329/matrix/), and opened it in Excel. I saved it as Excel xlsx format and put on VITAL, but you can work with the original data if you prefer, or save the Excel version as csv for data analysis.
The first 76 rows are various meta-data describing what is contained within the file. When you read this into python using pandas, you can tell it to ignore these rows (starting with !), using the comments keyword (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
Figure 1 Data contained with the gene expression matrix file.
The genes measured by microarray in column 1 are only identified by an integer code. To work out which genes these correspond to, you need data about the microarray used, which can be downloaded from here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL1708 (Download full table – “GPL1708-20418.txt”). I have also done this and then loaded the file into Excel, and put on VITAL if you prefer to use this version (PlatformIDs.xlsx)
Figure 2 Platform IDs to cross-reference from gene expression data.
Assignment instructions:
Part 1 – Calculate differential expression for genes (40%)
You are going to process the gene expression data, using Pandas in python.
Read the two sheets in (from text files, csv files or direct from Excel based on your own preference)
You should then make a single data frame, containing the result of combining the two sheets using the ID and IDREF columns, noting that there are a few IDs not present in the data sheet, which need to be handled. The new data frame should have the same number of rows as the data sheet (~44000), but around 50 columns (32 from GSE7329_series_matrix.csv, 19 from PlatformIDs.csv)
First, you need find out which genes are differentially expressed:
oRow 48 of the matrix sheet tells us about how we can group samples:
Group 1: Autism with a chromosomal duplication (columns 2-8 in Excel)
Group 2: Autism with fragile X mutation (columns 9-16 in Excel)
Control group: The rest of the columns
oGene expression values are shown as a log ratio between two channels, where the numerator is the value from the sample, the denominator is a pooled sample from 14 controls – then converted to a log2 scale
oTo calculate p-values for each gene, within each of the three groups, you should calculate a one-sample t-test, with an expected population mean of zero. The following code will help
import numpy as np
from scipy import stats
log_ratios = np.array([0.7,0.8,1.1])
ttest = stats.ttest_1samp(log_ratios,0.0)
print(str(ttest.pvalue))
For cases where every sample has a value of 0.0, you will get an error. In these cases, you should write code to set the p-value to 1
Within each of the three groups, now calculate the differentially expressed gene set, with p-value < 0.05 and where gene expression (mean of log2 values from the file) > 0.5 or <-0.5.
Add a Boolean (true/false) column to the data frame for each group (three new columns) for whether it has passed the threshold
Write out a CSV file containing the following columns only and one row per input ID (gene):
oNAME,GB_ACC,GENE_SYMBOL,Group1_pvalue, Group2_pvalue,Control_pvalue
Part 2 – Analysis and visualisation of results (40%)
You should produce a “volcano plot” for each of the three comparisons. This is a scatter plot (one point per gene) where the x-axis is log2 expression values and the y-axis is the -10* log10 (p-value). You can find plenty of examples online of a good volcano plot. For top marks, you could colour significantly changing genes in a different colour, and/or add gene names as labels to the genes with most significant p-values.
First find the genes differentially expressed (DE) in the both of the autism groups, but not in the control groups. To demonstrate you can do this, write out this sub-set of data to a new CSV file (same columns as above)
Repeat for the genes differentially expressed in either of the autism groups but not in the control groups
Add a table to your report showing summary counts of how many genes are in each of these categories.
For these two subsets of data, make some plots of the expression values across replicates for the significant genes e.g. as scatter plots, box plots, bar charts or line plots (your choice of what best represents the data)
One of the categories of data in the PlatformIDs sheet is GO annotations – see if you can make a bar chart to show GO categories that DE genes (from three groups) are matched against
oThis is a method of summarising the functions of your changing genes. You could just do counts of genes in each category. For top marks, you could attempt to calculate enrichment factors for GO categories. EF = (a / b) / (A/B)
a = number of DE genes mapping to a given category
b = total DE genes
A = total genes mapping to a given category
B = total genes
oEF counts are usually converted to a log2 scale before plotting
oSave the result as a figure to be embedded in your report
Part 3 – Advanced analysis options (20%) – not for the faint hearted!
In the paper from which the data set originates (https://academic.oup.com/hmg/article-lookup/doi/10.1093/hmg/ddm116, Figure 1), they do some clustering on the data, to produce a heat map and a PCA plot. These can be done using pandas from first principles (although it is certainly not as easy as doing this in R), or using other libraries.
For bonus marks on this assignment, see if you can do any advanced analysis on the data to cluster genes based on their expression profile, produce a heat map and/or a PCA plot. You might also choose to use RandomForest as demonstrated in Lecture 9 to search for sets of genes that can classify samples. If you try this, you may need to a) transpose your data frame, so that genes are columns, rather than rows, b) select a small number genes as features for RandomForest. You could select only those genes that are already differentially expressed (Feature Selection is a whole data science topic in its own right and beyond the scope of what we teach here).
Important note – there is code or libraries to be found online that can do these things. If you use code online, this is okay but make sure you reference it carefully, and you will get more marks if you demonstrate you understand what you are doing rather than just copy-paste some code from elsewhere.
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。