Assignment #4
Due date: 12/4/04
Go to the Pfam database site and
pull out the page for globins.
Download the "seed" alignment file which contains a multiple alignment
similar to the one in the book. Estimate a profile HMM model from this
alignment and use that to scan, as described next, the
UniProt/SWISS-PROT database which you can download
in FASTA format (28MB).
For each sequence whose LLR is greater than 0 look for the longest word
in its title line that contains the string "globin". That word could be
null, "Globin", or something else. For example, for GLBB_OLIMA, the word is
Hemoglobin:
>GLBB_OLIMA (Q7M419)
Hemoglobin, extracellular, major globin chain b
Finally, compute and output the median score of each word in your
dictionary (including the null word).