Project 2

BIOL/CMPU 353 - Bioinformatics
Smith and Schwarz
Spring 2012

Assigned: Thu, Feb 9
Due: Tue, Feb 14

DNA: Playing with Regular expressions

This assignment is loosely based on the first assignment. Instead of using index and substring functions, you will use regular expressions (Regex) to produce the sample output.

It may be easier to start with your perl script from the first assignment than to start from scratch. Alternatively, you can copy/paste our solution to Project 1 listed below.

Sample Output

A sample output is shown below. Your program’s output need not be exactly identical but you should print out the information in the same order and obviously your answers should agree.

+++++++++ Upstream and Genic Report ++++++++++++++++

Starting sequence is: cgccatataatgctcgtccgcgcccta 
Converted to uppercase: CGCCATATAATGCTCGTCCGCGCCCTA 

Length of starting sequence is: 27 
----------------------------------------------------

Upstream sequence is: CGCCATATA 

Gene sequence is: ATGCTCGTCCGCGCCCTA

Codon 1 = ATG
Codon 2 = CTC
Codon 3 = GTC
----------------------------------------------------

Upstream length (bp): 9
Gene length (bp): 18

----------------------------------------------------

What to Submit:

Your completed program is due on Tue, Feb 14. Submit your program electronically, using the submit353 script.

Starter Code (Solution from Project 1)

If you would like to start by modifying the code from Project 1, here is some help getting started...

#!/usr/bin/perl
use strict;
use warnings;
#================================================================
#
# BIOL/CMPU-353 
# Spring 2012
# Project 1 Solution
#
# Summary: This Perl program isolates the upstream and genic
#          regions of a sequence. A report is printed, a sample
#          of which is shown below:
#
#          (you paste a sample of your program's output here)
#
# Programmer: Marc Smith
#
# Date Last Modified:
# 02/11/2008 -- started program
#
#===============================================================
 
print "+++++++++ Upstream and Genic Report ++++++++++++++++\n\n";
 
my $someSequence; # upstream and start of a gene ...
 
$someSequence = "cgccatataatgctcgtccgcgcccta";
 
print "Starting sequence is: $someSequence \n";
 
# convert all nucleotides to uppercase
$someSequence = uc($someSequence);
print "Converted to uppercase: $someSequence \n\n";
 
my $seqLength = length($someSequence);
print "Length of starting sequence is: $seqLength \n";
 
print "----------------------------------------------------\n\n";
 
# get the position of the start codon "ATG"
my $ATGPosition = index($someSequence, "ATG");
my $codon2Pos = $ATGPosition + 3;
my $codon3Pos = $ATGPosition + 6;
 
# get the first three codons
my $codon2 = substr($someSequence, $codon2Pos, 3);
my $codon3 = substr($someSequence, $codon3Pos, 3);
 
 
print "ATG start codon begins in position (bp) ", 
       $ATGPosition+1, "\n";
print "    followed by codon $codon2 in position (bp) ", 
       $codon2Pos+1, "\n";
print "    followed by codon $codon3 in position (bp) ", 
       $codon3Pos+1, "\n\n";
 
print "----------------------------------------------------\n\n";
 
my $upStream;
$upStream = substr($someSequence, 0, $ATGPosition);
 
print "Upstream sequence is: $upStream \n\n";
 
my $upStreamLength = length($upStream);
 
print "Upstream length (bp): $upStreamLength \n\n";
print "----------------------------------------------------\n\n";
 
my $genicSeq = substr($someSequence, $ATGPosition);
my $genicSeqLen = length($genicSeq);
 
print "Gene sequence is: $genicSeq\n\n";
print "Gene length (bp): $genicSeqLen\n\n";
 
print "----------------------------------------------------\n\n";
 
my $reverseCompSeq = reverse($someSequence);
$reverseCompSeq =~ tr/ACTG/TGAC/;
 
print "Gene + Strand: $someSequence\n\n";
print "Gene - Strand: $reverseCompSeq\n\n";
 
print "----------------------------------------------------\n\n";
 
my $origSeqHilighted = lc($someSequence);
$origSeqHilighted =~ s/atg/ATG/;
print "Original sequence highlighted: $origSeqHilighted \n\n";
 
print "----------------------------------------------------\n\n";
 
my $numA = $upStream =~ tr/A/A/;
my $numT = $upStream =~ tr/T/T/;
 
print "Measures of AT-richness:\n";
print "\tA:\t$numA\n";
print "\tT:\t$numT\n\n";
 
print "----------------------------------------------------\n\n";

You should copy/paste the above code into jEdit, then save it via secure FTP on junior:

  • from the File System Browser window that pops up when you go to save, navigate from your home directory to your course directory (bioinf), then
  • create a new directory named project2, then
  • navigate into your newly-created project directory to save your file, finally
  • save your program and give it a good name.

Once your program works properly, use the submit353 script to submit your program electronically. As a reminder, here’s what to do:

  • from your ssh connection to junior, change directories to your course directory: cd ~/bioinf
  • use the submit353 command to submit your project2 directory (or whatever you named your project directory, if different from project2:

‘‘submit353 project2”

courses/cs353-201201/assigns/assign02.txt · Last modified: 2012/02/07 12:42 by mlsmith
VCCS Top Events Extended Site Search Login Vassar Science Web Vassar Home Driven by DokuWiki Valid XHTML 1.0