class Bio::FastaFormat
Treats a FASTA formatted entry, such as:
>id and/or some comments <== definition line ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines ATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGC
The precedent '>' can be omitted and the trailing '>' will be removed automatically.
Examples¶ ↑
fasta_string = <<END_OF_STRING >gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c] MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP INRISARRAAIHPYFQES END_OF_STRING f = Bio::FastaFormat.new(fasta_string) f.entry #=> ">gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]\n"+ # MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\n"+ # VRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\n"+ # NLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\n"+ # IFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\n"+ # INRISARRAAIHPYFQES"
Methods related to the name of the sequence¶ ↑
A larger range of methods for dealing with Fasta definition lines can be found in FastaDefline, accessed through the #identifiers method.
f.entry_id #=> "gi|398365175" f.definition #=> "gi|398365175|ref|NP_009718.3| Cdc28p [Saccharomyces cerevisiae S288c]" f.identifiers #=> Bio::FastaDefline instance f.accession #=> "NP_009718" f.accessions #=> ["NP_009718"] f.acc_version #=> "NP_009718.3" f.comment #=> nil
Methods related to the actual sequence¶ ↑
f.seq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES" f.data #=> "\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNI\nVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQ\nNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKP\nIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDP\nINRISARRAAIHPYFQES\n" f.length #=> 298 f.aaseq #=> "MSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEGVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYMEGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNLKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGCIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFPQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES" f.aaseq.composition #=> {"M"=>5, "S"=>15, "G"=>21, "E"=>16, "L"=>36, "A"=>17, "N"=>8, "Y"=>13, "K"=>22, "R"=>20, "V"=>18, "T"=>7, "D"=>23, "P"=>17, "Q"=>10, "I"=>23, "H"=>7, "F"=>12, "C"=>4, "W"=>4} f.aalen #=> 298
A less structured fasta entry¶ ↑
f.entry #=> ">abc 123 456\nASDF" f.entry_id #=> "abc" f.definition #=> "abc 123 456" f.comment #=> nil f.accession #=> nil f.accessions #=> [] f.acc_version #=> nil f.seq #=> "ASDF" f.data #=> "\nASDF\n" f.length #=> 4 f.aaseq #=> "ASDF" f.aaseq.composition #=> {"A"=>1, "S"=>1, "D"=>1, "F"=>1} f.aalen #=> 4
References¶ ↑
-
FASTA format (WikiPedia) en.wikipedia.org/wiki/FASTA_format
Constants
- DELIMITER
Entry delimiter in flatfile text.
- DELIMITER_OVERRUN
(Integer) excess read size included in DELIMITER.
Attributes
The seuqnce lines in text.
The comment line of the FASTA formatted data.
Public Class Methods
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
# File lib/bio/db/fasta.rb, line 131 def initialize(str) @definition = str[/.*/].sub(/^>/, '').strip # 1st line @data = str.sub(/.*/, '') # rests @data.sub!(/^>.*/m, '') # remove trailing entries for sure @entry_overrun = $& end
Public Instance Methods
Returens the length of Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 221 def aalen self.aaseq.length end
Returens the Bio::Sequence::AA.
# File lib/bio/db/fasta.rb, line 216 def aaseq Sequence::AA.new(seq) end
Returns accession number with version.
# File lib/bio/db/fasta.rb, line 277 def acc_version identifiers.acc_version end
Returns an accession number.
# File lib/bio/db/fasta.rb, line 265 def accession identifiers.accession end
Parsing FASTA Defline (using identifiers method), and shows accession numbers. It returns an array of strings.
# File lib/bio/db/fasta.rb, line 272 def accessions identifiers.accessions end
Returns comments.
# File lib/bio/db/fasta.rb, line 195 def comment seq @comment end
Returns the stored one entry as a FASTA format. (same as #to_s)
# File lib/bio/db/fasta.rb, line 139 def entry @entry = ">#{@definition}\n#{@data.strip}\n" end
Parsing FASTA Defline (using identifiers method), and shows a possibly unique identifier. It returns a string.
# File lib/bio/db/fasta.rb, line 251 def entry_id identifiers.entry_id end
Parsing FASTA Defline (using identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
# File lib/bio/db/fasta.rb, line 260 def gi identifiers.gi end
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.
# File lib/bio/db/fasta.rb, line 241 def identifiers unless defined?(@ids) then @ids = FastaDefline.new(@definition) end @ids end
Returns sequence length.
# File lib/bio/db/fasta.rb, line 201 def length seq.length end
Returns locus.
# File lib/bio/db/fasta.rb, line 282 def locus identifiers.locus end
Returens the length of Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 211 def nalen self.naseq.length end
Returens the Bio::Sequence::NA.
# File lib/bio/db/fasta.rb, line 206 def naseq Sequence::NA.new(seq) end
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby require 'bio' factory = Bio::Fasta.local('fasta34', 'db/swissprot.f') flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f') flatfile.each do |entry| p entry.definition result = entry.fasta(factory) result.each do |hit| print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at " p hit.lap_at end end
# File lib/bio/db/fasta.rb, line 162 def query(factory) factory.query(entry) end
Returns a joined sequence line as a String.
# File lib/bio/db/fasta.rb, line 169 def seq unless defined?(@seq) unless /\A\s*^\#/ =~ @data then @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up else a = @data.split(/(^\#.*$)/) i = 0 cmnt = {} s = [] a.each do |x| if /^# ?(.*)$/ =~ x then cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 else x.tr!(" \t\r\n0-9", '') # lazy clean up i += x.length s << x end end @comment = cmnt @seq = Bio::Sequence::Generic.new(s.join('')) end end @seq end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fasta.rb, line 232 def to_biosequence Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat) end