联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2019-04-30 10:16

Introduction

We are interested in the segmentation of a sequence of DNA over a genome. A sequence of DNA is

composed of four bases “A”, “C”, “G” and “T”. The objective is to automatically segment the DNA

sequence into regions that are AT-rich, and regions that are CG-rich. The data come from the R

package seqinr.

#install.packages("seqinr") # if needed

library("seqinr")

ff <- system.file("sequences/ECOUNC.fsa", package = "seqinr")

seq <- read.fasta(ff)

dnaseq = seq$ECOUNC

dnaseq[1:50] # DNA sequence

## [1] "a" "a" "a" "g" "c" "a" "a" "a" "t" "a" "a" "a" "a" "t" "t" "t" "a"

## [18] "a" "t" "t" "t" "t" "t" "a" "t" "c" "a" "a" "a" "a" "a" "a" "a" "t"

## [35] "c" "a" "t" "a" "a" "a" "a" "a" "a" "t" "t" "g" "a" "c" "c" "g"

# Convert letters to numbers

y =c()

y[which(dnaseq=="a")] = 1

y[which(dnaseq=="c")] = 2

y[which(dnaseq=="g")] = 3

y[which(dnaseq=="t")] = 4

y[1:50]

## [1] 1 1 1 3 2 1 1 1 4 1 1 1 1 4 4 4 1 1 4 4 4 4 4 1 4 2 1 1 1 1 1 1 1 4 2

## [36] 1 4 1 1 1 1 1 1 4 4 3 1 2 2 3

n = length(y)

plot(y[1:1500],xlab="Genome Index", ylab="Base", , yaxt="n")

ytick<-seq(1, 4, by=1)

axis(side=2, at=ytick, labels = FALSE)

text(par("usr")[1], ytick,

labels = c('A','C', 'G', 'T'), srt = 45, pos = 2, xpd = TRUE)

0 500 1000 1500

Genome Index

We consider a hidden Markov model (X1, Y1,...,Xn, Yn) where (Y1,...,Yn) is the DNA sequence,

with Yt 2 {1, 2, 3, 4} (1 = A, 2 = C, 3 = G, 4 = T) and Xt 2 {1, 2} are the hidden states, taking value

1 (AT-rich) or 2 (CG-rich). We define the state transition probabilities as

(1)

where p = 0.995 and uniform initial probability mass function.

The emission probabilities are given by Pr(Yt = j | Xt = i) = Bi,j where

B = 0.3 0.2 0.2 0.3

0.25 0.25 0.3 0.2. (2)

1 Exercise 1

(a) Use the viterbi R function provided in the lecture notes to calculate the maximum a posteriori

(MAP) estimate of the hidden state sequence

xb1:n = arg max x1:n2{1,2}n Pr(X1:n = x1:n|y1:n).

Check the numerical value of the message mt within that function. Discuss any numerical issue

arising. Propose and implement a modification of that R function to resolve this numerical issue,

and report the obtained MAP segmentation.

(b) Run the R function alpha recursion given in the lecture notes on the dataset and plot the filtering

probability mass function Pr(Xt = k|y1:t) over the genome index t for each class k = 1, 2. Again,

what numerical issue arise with this implementation?

(c) Create a novel function alpha2 recursion which takes the same inputs (y,mu,A,B) and returns the

filtering pmfs at time t = 1,...,n, using the following two-steps recursion, for t= 2,...,nt(xt) := p(xt, yt|y1:t1)= p(yt|xt)X

xt1

p(xt|xt1)p(xt1|y1:t1)

p(xt|y1:t) = t(xt)

with 1(x1) = p(y1|x1)p(x1) and p(x1|y1) = P1(x1)

x 1(x) .

(d) Run the function alpha2 recursion on the data, and plot the resulting filtering pmf over the

genome.

2 Exercise 2

(a) Obtain the MAP segmentation for di?erent values of p. Discuss how the segmentation relates

to the value of this parameter.

(b) Noting that

log(p(y1:n)) = log(p(y1)) +Xn

modify the function alpha2 recursion so that it also returns the loglikelihood.

(c) Using a grid-search approach, find the approximate value of p that maximizes the loglikelihood.

Briefly discuss any alternative approach that you could use to find the maximum likelihood

estimate of p (you do not need to implement this alternative approach).


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp