class Bio::GFF::GFF3::Record::Gap

Bio:GFF::GFF3::Record::Gap is a class to store data of “Gap” attribute.

Constants

Code

Code is a class to store length of single-letter code.

Attributes

data[R]

Internal data. Users must not use it.

Public Class Methods

new(str = nil) click to toggle source

Creates a new Gap object.


Arguments:

  • str: a formatted string, or nil.

# File lib/bio/db/gff.rb, line 1275
def initialize(str = nil)
  if str then
    @data = str.split(/ +/).collect do |x|
      if /\A([A-Z])([0-9]+)\z/ =~ x.strip then
        Code.new($1.intern, $2.to_i)
      else
        warn "ignored unknown token: #{x}.inspect" if $VERBOSE
        nil
      end
    end
    @data.compact!
  else
    @data = []
  end
end
new_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) click to toggle source

Creates a new Gap object from given sequence alignment.

Note that sites of which both reference and target are gaps are silently removed.


Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (nucleotide sequence)

  • gap_regexp: regexp to identify gap

# File lib/bio/db/gff.rb, line 1391
def self.new_from_sequences_na(reference, target,
                               gap_regexp = /[^a-zA-Z]/)
  gap = self.new
  gap.instance_eval { 
    __initialize_from_sequences_na(reference, target,
                                   gap_regexp)
  }
  gap
end
new_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\ click to toggle source

Creates a new Gap object from given sequence alignment.

Note that sites of which both reference and target are gaps are silently removed.

For incorrect alignments that break 3:1 rule, gap positions will be moved inside codons, unwanted gaps will be removed, and some forward or reverse frameshift will be inserted.

For example,

atgg-taagac-att
M  V  K  -  I

is treated as:

atggt<aagacatt
M  V  K  >>I

Incorrect combination of frameshift with frameshift or gap may cause undefined behavior.

Forward frameshifts are recomended to be indicated in the target sequence. Reverse frameshifts can be indicated in the reference sequence or the target sequence.

Priority of regular expressions:

space > forward/reverse frameshift > gap

Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (amino acid sequence)

  • gap_regexp: regexp to identify gap

  • space_regexp: regexp to identify space character which is completely ignored

  • forward_frameshift_regexp: regexp to identify forward frameshift

  • reverse_frameshift_regexp: regexp to identify reverse frameshift

# File lib/bio/db/gff.rb, line 1587
def self.new_from_sequences_na_aa(reference, target,
                                  gap_regexp = /[^a-zA-Z]/,
                                  space_regexp = /\s/,
                                  forward_frameshift_regexp = /\>/,
                                  reverse_frameshift_regexp = /\</)
  gap = self.new
  gap.instance_eval { 
    __initialize_from_sequences_na_aa(reference, target,
                                      gap_regexp,
                                      space_regexp,
                                      forward_frameshift_regexp,
                                      reverse_frameshift_regexp)
  }
  gap
end
parse(str) click to toggle source

Same as new(str).

# File lib/bio/db/gff.rb, line 1292
def self.parse(str)
  self.new(str)
end

Public Instance Methods

==(other) click to toggle source

If self == other, returns true. otherwise, returns false.

# File lib/bio/db/gff.rb, line 1615
def ==(other)
  if other.class == self.class and
      @data == other.data then
    true
  else
    false
  end
end
process_sequences_na(reference, target, gap_char = '-') click to toggle source

Processes nucleotide sequences and returns gapped sequences as an array of sequences.

Note for forward/reverse frameshift: Forward/Reverse_frameshift is simply treated as gap insertion to the target/reference sequence.


Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (nucleotide sequence)

  • gap_char: gap character

# File lib/bio/db/gff.rb, line 1715
def process_sequences_na(reference, target, gap_char = '-')
  s_ref, s_tgt = dup_seqs(reference, target)

  s_ref, s_tgt = __process_sequences(s_ref, s_tgt,
                                     gap_char, gap_char,
                                     1, 1,
                                     gap_char, gap_char)

  if $VERBOSE and s_ref.length != s_tgt.length then
    warn "returned sequences not equal length"
  end
  return s_ref, s_tgt
end
process_sequences_na_aa(reference, target, gap_char = '-', space_char = ' ', forward_frameshift = '>', reverse_frameshift = '<') click to toggle source

Processes sequences and returns gapped sequences as an array of sequences. reference must be a nucleotide sequence, and target must be an amino acid sequence.

Note for reverse frameshift: Reverse_frameshift characers are inserted in the reference sequence. For example, alignment of “Gap=M3 R1 M2” is:

atgaagat<aatgtc
M  K  I  N  V

Alignment of “Gap=M3 R3 M3” is:

atgaag<<<attaatgtc
M  K  I  I  N  V

Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (amino acid sequence)

  • gap_char: gap character

  • space_char: space character inserted to amino sequence for matching na-aa alignment

  • forward_frameshift: forward frameshift character

  • reverse_frameshift: reverse frameshift character

# File lib/bio/db/gff.rb, line 1752
def process_sequences_na_aa(reference, target,
                            gap_char = '-',
                            space_char = ' ',
                            forward_frameshift = '>',
                            reverse_frameshift = '<')
  s_ref, s_tgt = dup_seqs(reference, target)
  s_tgt = s_tgt.gsub(/./, "\\0#{space_char}#{space_char}")
  ref_increment = 3
  tgt_increment = 1 + space_char.length * 2
  ref_gap = gap_char * 3
  tgt_gap = "#{gap_char}#{space_char}#{space_char}"
  return __process_sequences(s_ref, s_tgt,
                             ref_gap, tgt_gap,
                             ref_increment, tgt_increment,
                             forward_frameshift,
                             reverse_frameshift)
end
to_s() click to toggle source

string representation

# File lib/bio/db/gff.rb, line 1604
def to_s
  @data.collect { |x| x.to_s }.join(" ")
end

Private Instance Methods

__initialize_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) click to toggle source

(private method) Parses given reference-target sequence alignment and initializes self. Existing data will be erased.

# File lib/bio/db/gff.rb, line 1322
def __initialize_from_sequences_na(reference, target,
                                   gap_regexp = /[^a-zA-Z]/)
  
  data_ref = __scan_gap(reference, gap_regexp, :I, :M)
  data_tgt = __scan_gap(target,    gap_regexp, :D, :M)
  data = []

  while !data_ref.empty? and !data_tgt.empty?
    ref = data_ref.shift
    tgt = data_tgt.shift
    if ref.length > tgt.length then
      x = Code.new(ref.code, ref.length - tgt.length)
      data_ref.unshift x
      ref.length = tgt.length
    elsif ref.length < tgt.length then
      x = Code.new(tgt.code, tgt.length - ref.length)
      data_tgt.unshift x
      tgt.length = ref.length
    end
    case ref.code
    when :M
      if tgt.code == :M then
        data.push ref
      elsif tgt.code == :D then
        data.push tgt
      else
        raise 'Bug: should not reach here.'
      end
    when :I
      if tgt.code == :M then
        data.push ref
      elsif tgt.code == :D then
        # This site is ignored,
        # because both reference and target are gap
      else
        raise 'Bug: should not reach here.'
      end
    end
  end #while

  # rest of data_ref
  len = 0
  data_ref.each do |r|
    len += r.length if r.code == :M
  end
  data.push Code.new(:D, len) if len > 0

  # rest of data_tgt
  len = 0
  data_tgt.each do |t|
    len += t.length if t.code == :M
  end
  data.push Code.new(:I, len) if len > 0

  @data = data
  true
end
__initialize_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\ click to toggle source

(private method) Parses given reference(nuc)-target(amino) sequence alignment and initializes self. Existing data will be erased.

# File lib/bio/db/gff.rb, line 1451
def __initialize_from_sequences_na_aa(reference, target,
                                      gap_regexp = /[^a-zA-Z]/,
                                      space_regexp = /\s/,
                                      forward_frameshift_regexp =
                                      /\>/,
                                      reverse_frameshift_regexp =
                                      /\</)

  data = []
  sc_ref = StringScanner.new(reference)
  sc_tgt = StringScanner.new(target)

  re_one = /./mn

  while !sc_tgt.eos?
    if len = sc_tgt.skip(space_regexp) then
      # ignored
    elsif len = sc_tgt.skip(forward_frameshift_regexp) then
      cur = __push_code_to_data(cur, data, :F, len)
      len.times { sc_ref.scan(re_one) }

    elsif len = sc_tgt.skip(reverse_frameshift_regexp) then
      cur = __push_code_to_data(cur, data, :R, len)
      pos = sc_ref.pos
      pos -= len
      if pos < 0 then
        warn "Incorrect reverse frameshift" if $VERBOSE
        pos = 0
      end
      sc_ref.pos = pos

    elsif len = sc_tgt.skip(gap_regexp) then
      len.times do
        ref_gaps, ref_fs = __scan_codon(sc_ref,
                                        gap_regexp,
                                        space_regexp,
                                        forward_frameshift_regexp,
                                        reverse_frameshift_regexp)
        case ref_gaps
        when 3
          # both ref and tgt are gap. ignored the site
        when 2, 1
          # forward frameshift inserted
          ref_fs += (3 - ref_gaps)
        when 0
          cur = __push_code_to_data(cur, data, :D, 1)
        else
          raise 'Bug: should not reach here'
        end
        if ref_fs < 0 then
          cur = __push_code_to_data(cur, data, :R, -ref_fs)
        elsif ref_fs > 0 then
          cur = __push_code_to_data(cur, data, :F, ref_fs)
        end
      end #len.times
    elsif len = sc_tgt.skip(re_one) then
      # always 1-letter
      ref_gaps, ref_fs = __scan_codon(sc_ref,
                                      gap_regexp,
                                      space_regexp,
                                      forward_frameshift_regexp,
                                      reverse_frameshift_regexp)
      case ref_gaps
      when 3
        cur = __push_code_to_data(cur, data, :I, 1)
      when 2, 1, 0
        # reverse frameshift inserted when gaps exist
        ref_fs -= ref_gaps
        # normal site
        cur = __push_code_to_data(cur, data, :M, 1)
      else
        raise 'Bug: should not reach here'
      end
      if ref_fs < 0 then
        cur = __push_code_to_data(cur, data, :R, -ref_fs)
      elsif ref_fs > 0 then
        cur = __push_code_to_data(cur, data, :F, ref_fs)
      end
    else
      raise 'Bug: should not reach here'
    end
  end #while

  if sc_ref.rest_size > 0 then
    rest = sc_ref.scan(/.*/mn)
    rest.gsub!(space_regexp, '')
    rest.gsub!(forward_frameshift_regexp, '')
    rest.gsub!(reverse_frameshift_regexp, '')
    rest.gsub!(gap_regexp, '')
    len = rest.length.div(3)
    cur = __push_code_to_data(cur, data, :D, len) if len > 0
    len = rest.length % 3
    cur = __push_code_to_data(cur, data, :F, len) if len > 0
  end

  @data = data
  self
end
__process_sequences(s_ref, s_tgt, ref_gap, tgt_gap, ref_increment, tgt_increment, forward_frameshift, reverse_frameshift) click to toggle source

(private method) insert gaps refers to the gap rule inside the object

# File lib/bio/db/gff.rb, line 1638
def __process_sequences(s_ref, s_tgt,
                        ref_gap, tgt_gap,
                        ref_increment, tgt_increment,
                        forward_frameshift,
                        reverse_frameshift)
  p_ref = 0
  p_tgt = 0
  @data.each do |c|
    #$stderr.puts c.inspect
    #$stderr.puts "p_ref=#{p_ref} s_ref=#{s_ref.inspect}"
    #$stderr.puts "p_tgt=#{p_tgt} s_tgt=#{s_tgt.inspect}"
    case c.code
    when :M # match
      p_ref += c.length * ref_increment
      p_tgt += c.length * tgt_increment
    when :I # insert a gap into the reference sequence
      begin
        s_ref[p_ref, 0] = ref_gap * c.length
      rescue IndexError
        raise 'reference sequence too short'
      end
      p_ref += c.length * ref_increment
      p_tgt += c.length * tgt_increment
    when :D # insert a gap into the target (delete from reference)
      begin
        s_tgt[p_tgt, 0] =  tgt_gap * c.length
      rescue IndexError
        raise 'target sequence too short'
      end
      p_ref += c.length * ref_increment
      p_tgt += c.length * tgt_increment
    when :F # frameshift forward in the reference sequence
      begin
        s_tgt[p_tgt, 0] = forward_frameshift * c.length
      rescue IndexError
        raise 'target sequence too short'
      end
      p_ref += c.length
      p_tgt += c.length
    when :R # frameshift reverse in the reference sequence
      p_rev_frm = p_ref - c.length
      if p_rev_frm < 0 then
        raise 'too short reference sequence, or too many reverse frameshifts'
      end
      begin
        s_ref[p_rev_frm, 0] = reverse_frameshift * c.length
      rescue IndexError
        raise 'reference sequence too short'
      end
      
    else
      warn "ignored #{c.to_s.inspect}" if $VERBOSE
    end
  end

  if s_ref.length < p_ref then
    raise 'reference sequence too short'
  end
  if s_tgt.length < p_tgt then
    raise 'target sequence too short'
  end
  return s_ref, s_tgt
end
__push_code_to_data(cur, data, code, len) click to toggle source

(private method) internal use only

# File lib/bio/db/gff.rb, line 1437
def __push_code_to_data(cur, data, code, len)
  if cur and cur.code == code then
    cur.length += len
  else
    cur = Code.new(code, len)
    data.push cur
  end
  return cur
end
__scan_codon(sc_ref, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) click to toggle source

(private method) scans a codon or gap in reference sequence

# File lib/bio/db/gff.rb, line 1403
def __scan_codon(sc_ref, 
                 gap_regexp, space_regexp,
                 forward_frameshift_regexp,
                 reverse_frameshift_regexp)
  chars = []
  gap_count = 0
  fs_count = 0

  while chars.size < 3 + fs_count and char = sc_ref.scan(/./mn)
    case char
    when space_regexp
      # ignored
    when forward_frameshift_regexp
      # next char is forward frameshift
      fs_count += 1
    when reverse_frameshift_regexp
      # next char is reverse frameshift
      fs_count -= 1
    when gap_regexp
      chars.push char
      gap_count += 1
    else
      chars.push char
    end
  end #while
  if chars.size < (3 + fs_count) then
    gap_count += (3 + fs_count) - chars.size
  end
  return gap_count, fs_count
end
__scan_gap(str, gap_regexp = /[^a-zA-Z]/, code_i = :I, code_m = :M) click to toggle source

(private method) Scans gaps and returns an array of Code objects

# File lib/bio/db/gff.rb, line 1298
def __scan_gap(str, gap_regexp = /[^a-zA-Z]/,
               code_i = :I, code_m = :M)
  sc = StringScanner.new(str)
  data = []
  while len = sc.skip_until(gap_regexp)
    mlen = len - sc.matched_size
    data.push Code.new(code_m, mlen) if mlen > 0
    g = Code.new(code_i, sc.matched_size)
    while glen = sc.skip(gap_regexp)
      g.length += glen
    end
    data.push g
  end
  if sc.rest_size > 0 then
    m = Code.new(code_m, sc.rest_size)
    data.push m
  end
  data
end
dup_seqs(*arg) click to toggle source

duplicates sequences

# File lib/bio/db/gff.rb, line 1625
def dup_seqs(*arg)
  arg.collect do |s|
    begin
      s = s.seq
    rescue NoMethodError
    end
    s.dup
  end
end