联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2024-08-27 04:13

Assignment 2

BS6202

Please find attached with this assignment, data pertaining to gene expression profiles of lymphoblastoid cells.

Dataset Description

1. “data.csv” – Gene expression profiles with rows representing the genes and columns the samples.

2. “meta_data.csv” – Meta. data corresponding to the gene expression profiles with rows representing the samples and columns the various clinical attributes such as age, treatment status, etc.

Task 1: Cluster the samples using the gene expression profile and evaluate the goodness of your clustering. Also, describe the rationale behind choosing a specific clustering algorithm.

We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters. If we use python to deliver this diagrams, there should be two plots. Each diagram has its PC1 on x-axis and PC2 on y-axis. The different data sets will be grouped into different colors and different groups. The k-mean is around 0.05. PC1 is ranged from almost -40 to 80 and PC2 is ranged from -40 to 100 on y-axis.

Task 2: Create a predictive model to predict “sex” using the given gene expression profile and evaluate your predictive model. Also, describe the rationale behind choosing a specific predictive algorithm.

For task2, the data sets contain more information about the personal information such as sex. We should use PCA first, and then use Kmeans and finally apply clustering to finish the questions. The data.csv file contains the gene sets for each person in the census. PCA is principle component analysis. It can reduce larger data sets but maintain the patterns and trends. We need to reduce the dimensions of such complex sets of data. K-means is another algorithm which can group the unlabeled data sets into different clusters.I calculated the average values of the genes and used python to read through the data of those gene sets. And then we can check which sex is closer to those average values of the gene sets. If they are close, we can select that pair of sex. In my diagram, the PC1 on x-axis is around 80% and PC2 on y-axis is around 8%.

You may perform. the above tasks in your groups using a variety of methods and strategies. However, each person is to take this preliminary analysis, further develop and refine. Write into a short 2-4 page report and submit individually.

 

 

 

 


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp