\documentclass{article}
\begin{document}

\noindent 
{\bf 18.417 - Introduction to Computational Molecular Biology}
\hfill{\bf PS 1}\\
{\bf Bonnie Berger and Manolis Kamvysselis} \hfill September 6, 2001\\
\noindent

\begin{center}
{\Large \bf Problem Set 1}\\
{\large \bf Due Date: Thursday, September 20}\\
\end{center}
\vspace{0.25in}

In this and subsequent problem sets, you will need to run some commands
on Athena from the course locker, which you can get access to with the
command
%
\[{\tt athena\%\ add\ 18.417} \]
%
If you don't want to type this every time you work on 18.417, put the
line
%
\[{\tt add\ 18.417}\]
%
followed by a newline in your {\tt \~{}/.environment} file.

\begin{enumerate}
\vspace{0.25in}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
Suppose we sequenced a small bacterium and found the following really
short gene: {\tt ATGGCAAGAAGCGCAACAACGGCGTGTAAGAGTTAA}

Hand-translate the sequence, assuming the bacterium uses the same
genetic code as we do ({\it Hint:} ``Why can't we all just \ldots {\it
get along?}'')

There are 64 codons ($4^3$) out of which 61 encode amino acids.
However, only 20 amino acids exist.  Many amino acids therefore can be
encoded in more than one ways.  Write the most divergent sequence you
can construct that translates to the same amino acid sequence, using the
genetic code in reverse.  What percent identity do you find at the
nucleotide level?  At the amino acid level?

How many possible sequences translate to the same protein?  R.A.Bowen
has written a reverse-translator that you may find helpful for this.
You can use it through the web at:
%
\[{\tt http://arbl.cvmbs.colostate.edu/molkit/rtranslate/}\]
%
(Optional:) Why does the last nucleotide in a codon triplet have very
little influence on the amino acid residue the codon translates to?

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
Score the alignment\\

$\begin{array}{l}
GA{\bf-}CGGATTAG\\
\vspace{0.25in}
GATCGGAATAG\\

\end{array} $

Give a match a score of $+1$, a mismatch $0$, and a gap $-1$.

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
Find the optimal global alignment (and the resulting score)
between GAGC and CCG, with the scoring system used in class (i.e., $+1$
for a match, $-1$ for a mismatch, and $-2$ for a gap).

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
Find the best local alignment between ATACTCTCCTAAG and
GACTCGTAACGTAT,  with the scoring system used in class. Also turn in the
entire scoring matrix which the dynamic programming algorithm would
compute (writing this out should take less than 15 minutes). If you need
to break ties between local alignments with the same score, choose the
longer subsequences.

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
{\bf Getting genomes:}(Optional)
%
This is just an example to show you how to get access to the genomes
yourself, if you want to.  Go to the webpage
%
{\tt http://www.ncbi.nlm.nih.gov/Entrez/} using netscape or some other
browser.  Click on the ``Genome'' link.  In the search field at the top
of the page that takes you to, enter ``GAL4'' and hit return.  This will
take to a page with a list of matches.  Click on the second match, and
then click on the link {\tt NC\_001148}.

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
{\bf Playing with genomes:} Once you've {\tt add}'ed the {\tt 18.417}
locker, run the command
%
{\tt idle -e /mit/18.417/share/problem\_code/ps1.py \&}.  This will
bring up a python file that you can run by holding down the control key
and pressing the F5 key.  It computes how many of each of the
nucleotides used in the genes in the first chromosome of Baker's Yeast,
S. Cerevisiae.  It uses the functions in
%
{\tt /mit/18.417/share/problem\_code/Sequence.py}.

If you know the name of a gene on this chromosome, you look at it with
commands like these:

{\tt 
\begin{verbatim}
print Chromosome.genes['FUN26'].gene
print Chromosome.genes['FUN26'].protein
print Chromosome.genes['FUN26'].CDS().data
print Chromosome.genes['FUN26'].CDS().reverse_complement().data
print Sequence.translate(Chromosome.genes['FUN26'].CDS())
\end{verbatim}
}

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
Write a script similar to the ps1.py script that counts how often each
codon is used in the genes on the first chromosome.  ({\it Hint:} For
each coding sequence, instead of counting individual nucleotides, modify
your code to count triplets of nucleotides (codons).  Now group the 64
possible codons by the amino acid they translate to.  What do you
notice?  Is S.cerevisiae indifferent to which codon it uses to encode
each amino acid?  Extra points for noticing a correlation between the
codon preferences you discover and the frequency of G's and C's that you
computed in the last question. :)

}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\item{
%
{\bf Alignment implementations:} Have a look at the program in \\
%
{\tt/mit/18.417/share/problem\_code/alignment.py}.  The code in \\
%
{\tt /mit/18.417/share/problem\_code/tests/alignment.py} shows how to
use it.  Currently this gives the best {\it global} alignment between
two sequences.  Modify it so that it gives the best {\it local}
aligment.

(Optional) Genes which are descended from a common ancestral gene are
called {\it orthologous.}  Find two genes that probably come from common
ancestor (e.g., a human and a mouse gene that generate proteins with
identical functions) and use your program to align their coding
sequences.  Within the aligned codon triplets, which of the three
nucleotides is the least likely to match?  If you didn't already know
the frame in which the sequences are translated, how might you guess it
from the alignments?

}

\end{enumerate}

\end{document}
