class EmailReplyParser::Email

An Email instance represents a parsed body String.

Constants

COMMON_REPLY_HEADER_REGEXES
EMPTY
MULTI_LINE_SIGNATURE_REGEX

Line optionally starts with whitespace, contains two or more hyphens or underscores, and ends with optional whitespace. Example: '—' or '_' or '— '

ONE_LINE_SIGNATURE_REGEX

Line optionally starts with whitespace, followed by one hyphen, followed by a word character Example: '-Sandro'

ORIGINAL_MESSAGE_SIGNATURE_REGEX
QUOTE_HEADER_LABELS

TODO: refactor out in a i18n.yml file Supports English, French, Es-Mexican, Pt-Brazilian Maps a label to a label-group

SENT_FROM_REGEX

No block-quotes (> or <), followed by up to three words, followed by “Sent from my”. Example: “Sent from my iPhone 3G”

SIGNATURE_REGEX

Attributes

fragments[R]

Emails have an Array of Fragments.

Public Class Methods

new() click to toggle source
# File lib/email_reply_parser.rb, line 62
def initialize
  @fragments = []
end

Public Instance Methods

read(text, from_address = "") click to toggle source

Splits the given text into a list of Fragments. This is roughly done by reversing the text and parsing from the bottom to the top. This way we can check for 'On <date>, <author> wrote:' lines above quoted blocks.

text - A String email body. from_address - from address of the email (optional)

Returns this same Email instance.

# File lib/email_reply_parser.rb, line 81
def read(text, from_address = "")
  # parse out the from name if one exists and save for use later
  @from_name_raw = parse_raw_name_from_address(from_address)
  @from_name_normalized = normalize_name(@from_name_raw)
  @from_email = parse_email_from_address(from_address)

  text = normalize_text(text)

  # The text is reversed initially due to the way we check for hidden
  # fragments.
  text = text.reverse

  # This determines if any 'visible' Fragment has been found.  Once any
  # visible Fragment is found, stop looking for hidden ones.
  @found_visible = false

  # This instance variable points to the current Fragment.  If the matched
  # line fits, it should be added to this Fragment.  Otherwise, finish it
  # and start a new Fragment.
  @fragment = nil

  # Use the StringScanner to pull out each line of the email content.
  @scanner = StringScanner.new(text)
  while line = @scanner.scan_until(/\n/n)
    scan_line(line)
  end

  # Be sure to parse the last line of the email.
  if (last_line = @scanner.rest.to_s).size > 0
    scan_line(last_line, true)
  end

  # Finish up the final fragment.  Finishing a fragment will detect any
  # attributes (hidden, signature, reply), and join each line into a
  # string.
  finish_fragment

  @scanner = @fragment = nil

  # Now that parsing is done, reverse the order.
  @fragments.reverse!
  self
end
visible_text() click to toggle source

Public: Gets the combined text of the visible fragments of the email body.

Returns a String.

# File lib/email_reply_parser.rb, line 69
def visible_text
  fragments.select{|f| !f.hidden?}.map{|f| f.to_s}.join("\n").rstrip
end

Private Instance Methods

finish_fragment() click to toggle source

Builds the fragment string, after all lines have been added. It also checks to see if this Fragment is hidden. The hidden Fragment check reads from the bottom to the top.

Any quoted Fragments or signature Fragments are marked hidden if they are below any visible Fragments. Visible Fragments are expected to contain original content by the author. If they are below a quoted Fragment, then the Fragment should be visible to give context to the reply.

some original text (visible)

> do you have any two's? (quoted, visible)

Go fish! (visible)

> --
> Player 1 (quoted, hidden)

--
Player 2 (signature, hidden)
# File lib/email_reply_parser.rb, line 383
def finish_fragment
  if @fragment
    @fragment.finish
    if !@found_visible
      if @fragment.quoted? || @fragment.signature? ||
          @fragment.reply_header? || @fragment.to_s.strip == EMPTY
        @fragment.hidden = true
      else
        @found_visible = true
      end
    end
    @fragments << @fragment
  end
  @fragment = nil
end
generate_regexp_for_name() click to toggle source

generates regexp which always for additional words or initials between first and last names

# File lib/email_reply_parser.rb, line 355
def generate_regexp_for_name
  name_parts = @from_name_normalized.split(" ")
  seperator = '[\w.\s]*'
  regexp = Regexp.new(name_parts.join(seperator), Regexp::IGNORECASE)
end
line_is_reply_header?(line) click to toggle source

Detects if a given line is a common reply header.

line - A String line of text from the email.

Returns true if the line is a valid header, or false.

# File lib/email_reply_parser.rb, line 337
def line_is_reply_header?(line)
  COMMON_REPLY_HEADER_REGEXES.each do |regex|
    return true if line =~ regex
  end
  false
end
line_is_signature_name?(line) click to toggle source

Detects if the @from name is a big part of a given line and therefore the beginning of a signature

line - A String line of text from the email.

Returns true if @from_name is a big part of the line, or false.

# File lib/email_reply_parser.rb, line 349
def line_is_signature_name?(line)
  regexp = generate_regexp_for_name()
  @from_name_normalized != "" && (line =~ regexp) && ((@from_name_normalized.size.to_f / line.size) > 0.25)
end
make_name_first_then_last(name) click to toggle source
# File lib/email_reply_parser.rb, line 227
def make_name_first_then_last(name)
  split_name = name.split(',')
  if split_name[0].include?(" ")
    split_name[0].to_s
  else
    split_name[1].strip + " " + split_name[0].strip
  end
end
multiline_quote_header_in_fragment?() click to toggle source

Returns true if the current block in the current fragment has a multiline quote header, false otherwise.

The quote header we're looking for is mainly generated by Outlook clients. It's considered a quote header if the first 4 folded lines have one of the following forms:

label: some text

*label:* some text

where a line like this:

label: some text

possibly indented text that belongs to the previous line

is folded into:

label: some text possibly indented text that belongs to the previous line

and where label is a value from QUOTE_HEADER_LABELS that appears only once in the first 4 lines and where each group of a label is represented at most once.

# File lib/email_reply_parser.rb, line 300
def multiline_quote_header_in_fragment?
  folding = false
  label_groups = []
  @fragment.current_block.split("\n").each do |line|
    if line =~ /\A\s*\*?([^:]+):(\s|\*)/
      label = QUOTE_HEADER_LABELS[$1.downcase]
      if label
        return false if label_groups.include?(label)
        return true if label_groups.length == 3
        label_groups << label
        folding = true
      elsif !folding
        return false
      end
    elsif !folding
      return false
    else
      folding = true
    end
  end
  return false
end
normalize_name(name) click to toggle source

Normalize a name to First Last

name - name to normailze.

Returns a String.

# File lib/email_reply_parser.rb, line 219
def normalize_name(name)
  if name.include?(',')
    make_name_first_then_last(name)
   else
    name
  end
end
normalize_text(text) click to toggle source

normalize text so it is easier to parse

text - text to normalize

Returns a String

# File lib/email_reply_parser.rb, line 171
def normalize_text(text)
  # in 1.9 we want to operate on the raw bytes
  text = text.dup.force_encoding('binary') if text.respond_to?(:force_encoding)

  # Normalize line endings.
  text.gsub!("\r\n", "\n")

  # Check for multi-line reply headers. Some clients break up
  # the "On DATE, NAME <EMAIL> wrote:" line into multiple lines.
  if match = text.match(/^(On\s(.+)wrote:)$/m)
    # Remove all new lines from the reply header. as long as we don't have any double newline
    # if we do they we have grabbed something that is not actually a reply header
    text.gsub! match[1], match[1].gsub("\n", " ") unless match[1] =~ /\n\n/
  end

  # Some users may reply directly above a line of underscores.
  # In order to ensure that these fragments are split correctly,
  # make sure that all lines of underscores are preceded by
  # at least two newline characters.
  text.gsub!(/([^\n])(?=\n_{7}_+)$/m, "\\1\n")

  text
end
parse_email_from_address(address) click to toggle source
# File lib/email_reply_parser.rb, line 209
def parse_email_from_address(address)
  match = address.match /<(.*)>/
  match ? match[1] : address
end
parse_name_from_address(address) click to toggle source

Parse a person's name from an e-mail address

email - email address.

Returns a String.

# File lib/email_reply_parser.rb, line 200
def parse_name_from_address(address)
  normalize_name(parse_raw_name_from_address(address))
end
parse_raw_name_from_address(address) click to toggle source
# File lib/email_reply_parser.rb, line 204
def parse_raw_name_from_address(address)
  match = address.match(/^["']*([\w\s,]+)["']*\s*</)
  match ? match[1].strip.to_s : EMPTY
end
scan_line(line, last = false) click to toggle source

Scans the given line of text and determines which fragment it belongs to.

# File lib/email_reply_parser.rb, line 239
def scan_line(line, last = false)
  line.chomp!("\n")
  line.reverse!
  line.rstrip!

  # Mark the current Fragment as a signature if the current line is empty
  # and the Fragment starts with a common signature indicator.
  # Mark the current Fragment as a quote if the current line is empty
  # and the Fragment starts with a multiline quote header.
  scan_signature_or_quote if @fragment && line == EMPTY

  # We're looking for leading `>`'s to see if this line is part of a
  # quoted Fragment.
  is_quoted = !!(line =~ /^>+/n)

  # Note that a common reply header also counts as part of the quoted
  # Fragment, even though it doesn't start with `>`.
  unless @fragment &&
      ((@fragment.quoted? == is_quoted) ||
       (@fragment.quoted? && (line_is_reply_header?(line) || line == EMPTY)))
    finish_fragment
    @fragment = Fragment.new
    @fragment.quoted = is_quoted
  end

  @fragment.add_line(line)
  scan_signature_or_quote if last
end
scan_signature_or_quote() click to toggle source
# File lib/email_reply_parser.rb, line 268
def scan_signature_or_quote
  if signature_line?(@fragment.lines.first)
    @fragment.signature = true
    finish_fragment
  elsif multiline_quote_header_in_fragment?
    @fragment.quoted = true
    finish_fragment
  end
end
signature_line?(line) click to toggle source

Detects if a given line is the beginning of a signature

line - A String line of text from the email.

Returns true if the line is the beginning of a signature, or false.

# File lib/email_reply_parser.rb, line 328
def signature_line?(line)
  line =~ SIGNATURE_REGEX || line_is_signature_name?(line)
end