mens ugg boots clearance A DNA Sequence Class in Perl
A DNA Sequence Class in PerlUsing Perl’s object oriented and text manipulation features By Lincoln SteinLincoln develops databases, applications, and user interfaces for the Human Genome Project at Cold Spring Harbor Laboratory in Long Island, NY. Obtaining this information means pushing around massive amounts of information estimates quickly run into the terabytes. This in turn requires sophisticated software engineering, fault tolerant information systems, and rapid application development.
In this article, I describe a Perl library for manipulating DNA and RNA sequences. In the course of examining this library, you’ll see how Perl’s object oriented features work together to create an elegant API. And hopefully, you’ll learn a little biology as well.
DNA, RNA, and ProteinsThe stuff of the genome is deoxyribonucleic acid (DNA), a long thin molecule that is usually compactly coiled into the chromosomes of our cells. DNA consists of four distinct subunits, called “nucleotide bases,” which are repeated across its entire length. The four bases have been assigned the convenient single letter names A, G, C, T abbreviations for their longer chemical names.
In DNA, the nucleotide bases are linked together into long chains that can be written down as an ASCII string. Figure 1(a), for example, is a DNA sequence consisting of 39 nucleotides.
DNA doesn’t usually float around the cell in its single stranded form. Instead it spends its time in a stable double stranded form, the famous “double helix.” In the double stranded form, each nucleotide is paired with another nucleotide. Because of their chemical nature, A always pairs with T, and G pairs with C. Written down in a text representation, the double stranded form of this short sequence looks like Figure 1(b).
Because the nucleotide bases are paired, they are often referred to as base pairs (bp). I’ve labeled the left end of the top strand 5′ and the right end 3′. On the bottom strand, the numbering of the two ends is reversed. This numbering system is related to the way that DNA is put together chemically. Here, the only significance of this is that it emphasizes that DNA strands are directional. The two strands are often arbitrarily labeled the “plus” and “minus” strands to distinguish them.
DNA can do just two things: It can replicate, and it can be transcribed into RNA. The replication process is the key to both cell replication and to propagation of the species. The two strands of DNA unwind like a zipper, and each strand dictates the assembly of its complementary second strand. Schematically, the process looks like Figure 1(c).
More interesting is the transcription and translation process. Along its length, DNA encodes the instructions for many thousands of proteins, everything from the crystalline protein of the eye lens to the enzymes that make up the digestive juices of the gastrointestinal tract. These protein coding regions, separated from each other by large tracts of DNA of unknown function, are in fact genes.
To make a protein from the DNA sequence of a gene, the cell performs two phases of chemical transformation. In the first phase, the gene is transcribed into ribonucleic acid (RNA). RNA is like DNA in many ways, but instead of being double stranded it usually exists in single stranded form. In addition, instead of being composed of the four bases A, G, C, and T, RNA has no T, but uses a different nucleotide abbreviated as U.
To transcribe RNA, DNA unwinds just a bit in the region of an activated gene, and the nucleotide sequence of the DNA is read off by enzymes that synthesize an RNA copy of the gene. Sometimes the plus DNA strand is transcribed, and sometimes the minus strand, depending on whether the gene is oriented right to left or left to right.
Represented in text form, an RNA strand looks just like its parent DNA strand except for the substitution of U for every T. Figure 2(a) depicts our example DNA in RNA form.
Unlike DNA (which never leaves the nucleus of the cell), RNA is free to travel through the nuclear envelope into the cellular cytoplasm. Once there, the RNA is translated into a protein. Like RNA and DNA, proteins are also long strands of repeating units. However, instead of there being only four units, proteins are made up of 21 different “amino acid” subunits. Proteins fold into complex structures dictated by the order of their amino acids. The folding determines the protein’s structure and function.
Like the nucleotide bases, biologists use one letter abbreviations to refer to the amino acids as well. Protein sequences use the letters A, C, D, E, F, G, H, I, K, L, M,
N, P, Q, R, S, T, V, W, and Y. Because there just aren’t enough letters in the Latin alphabet to go around, the protein alphabet overlaps with the nucleotide alphabet, but don’t let that confuse you. An A found in a nucleic acid sequence has nothing to do with the A of a protein sequence.
Because only four RNA bases must dictate the order of 20 amino acids, there is obviously more to protein translation than the simple one to one encoding that takes place during transcription. In fact, the protein translation machinery uses a three letter code to translate RNA into protein. During translation, the RNA is divided into groups of three letter “codons,” as in Figure 2(b).
The codons are used as a template to synthesize a protein sequence, using a little lookup table that’s hardwired into the biological machinery. AUG becomes the amino acid M, UUC becomes F, CGA becomes R, and so forth. Our example DNA is translated into a 12 amino acid protein in Figure 2(c).
There are two things to notice in this example. One is that certain amino acids are encoded by several different codons. For example, the amino acid K is encoded by both AAA and AAG. This should be expected from the fact that there are 64 possible codons, and only 20 amino acids for them to encode. The other thing to notice is that certain codons (three in all) don’t encode any amino acids. Instead they are “stop codons,” which tell the translation machinery to stop translating and release the finished protein. Generally, the RNA molecule extends farther to the left and right than the protein it encodes (I’ve glossed over this fact for simplicity of illustration). Like the stop codons, the AUG codon is special because it tells the protein translation machinery with which codon to begin.
A Sequence Class Library for PerlA lot of Genome informatics involves splicing, dicing, and processing long strings of DNA sequences. I created a library of Perl routines specialized for dealing with DNA (available electronically; see “Resource Center,” page 5), with a small class hierarchy like Figure 3.
Sequence::Generic is an abstract class that implements a few generic methods that all biological sequences share, such as a method for determining the sequence’s length and a method for concatenating two sequences together. Sequence::Nucleotide is a subclass of Sequence::Generic that adds support for DNA and RNA specific operations. One of these new operations is the reverse complementation method, which transforms one strand of DNA into its complement; another is a method to translate RNA into protein.
Sequence::Nucleotide::Subsequence is a descendent of Sequence::Nucleotide. Because the chunks of DNA that need to be analyzed are usually quite long (100,000 bp is not unusual), it’s typical to work with one subregion at a time. A Subsequence represents a subregion of a longer sequence.
The Sequence::Alignment class is a utility class that stores information about how two similar sequences are related. It is useful for figuring out how a smaller sequence fits into a larger one.
For completeness, there should also be a Sequence::Protein class descended from Sequence::Generic, but that was too much to squeeze into this article. Instead of returning a real Sequence::Protein object, the method that translates RNA into proteins just returns a simple character string.
The Sequence::Generic ClassSequence::Generic (Listing One) defines three methods that are intended to be overridden by child classes: new(), seq(), and type(). The new() method is the object constructor. It does nothing but call the croak() function from the Carp package to abort the program with an error message. This prevents the generic class from being instantiated. The seq() method is a low level routine that returns the raw sequence information as a text string. This method also croaks in case. Sequence ::Generic is subclassed without the seq() method being overridden. The type() method returns a human readable string describing the type of the sequence, and is intended for debugging work. It’s intended to return something like DNA, RNA, or “Protein.” In the abstract class, this method returns “Generic Sequence.”
The remainder of the methods are generic ones that will work with almost any biological sequence. One of these is length(), which returns the length of the sequence data; see Example 1. By convention, Perl methods are invoked with a reference to the object as the first argument on the subroutine argument list. This method begins with the idiom my $self = shift. The effect of this statement is to shift the object off the argument list and to copy it into a local variable named $self. The methods then invoke our object’s seq() method with the Perl method invocation syntax $self >seq and pass it to the Perl string length function length() (this is a normal function call, not a method call). The result is then returned to the caller.
Another method defined in this file, concatenate() (see Example 2), concatenates two Sequence::Generic objects together or concatenates a Sequence::Generic object with a string, returning a new sequence object as the result.
In addition to its object reference, the method takes two arguments. The first is the new sequence to concatenate to the current one. The second argument is a flag that indicates whether the new sequence is to be prepended (true) or appended (false). concatenate() is usually called via operator overloading, and the Perl overload machinery actually takes care of setting up the two arguments.
The method first checks whether the new sequence is an object by calling the Perl built in ref(), which returns the class name for objects, and the undefined value for nonobjects. If ref() indicates that the new sequence is an object, concatenate() checks whether it is a subclass of Sequence::Generic by using the built in isa() method. The __PACKAGE__ token is replaced by the Perl run time with the name of the current package, and avoids having to hardcode the name of the class. If the object is not a subclass, the routine aborts with an error message. Otherwise, it recovers the sequence as a string by calling its seq() method. If the $new_seq argument isn’t an object at all, the method treats it as a string.
The last statement of this method uses the Perl built in concatenation operator “.” to combine the sequence strings together in the order dictated by the $prepend flag. The concatenated string is passed to the object’s new() constructor to create a new Sequence object, which is returned to the caller. Because concatenate() will be called from a subclass of Sequence::Generic, the new() constructor that gets called will belong to the subclass, not to Sequence::Generic. In Perl there is no strong distinction between constructors and object methods, which may be a source of confusion for C++ and Java programmers.
Perl lets you overload many of its built in operators so that when they are applied to objects they invoke a method call rather than take their default actions. I overload three different operators in the Sequence::Generic class (Example 3). For example, by binding the “.” operator to concatenate(), each of the constructions in Example 4 will work in the natural way.
The Sequence::Nucleotide ClassSequence::Nucleotide (Listing Two) is a dual purpose class that represents both DNA and RNA. Because DNA can be transformed into RNA and vice versa simply by exchanging Ts and Us,
I store the data as DNA and transform it into an RNA form on demand.