class EmailReplyParser::Email
An Email instance represents a parsed body String.
Constants
- COMMON_REPLY_HEADER_REGEXES
- EMPTY
- MULTI_LINE_SIGNATURE_REGEX
Line optionally starts with whitespace, contains two or more hyphens or underscores, and ends with optional whitespace. Example: '—' or '_' or '— '
- ONE_LINE_SIGNATURE_REGEX
Line optionally starts with whitespace, followed by one hyphen, followed by a word character Example: '-Sandro'
- ORIGINAL_MESSAGE_SIGNATURE_REGEX
- QUOTE_HEADER_LABELS
TODO: refactor out in a i18n.yml file Supports English, French, Es-Mexican, Pt-Brazilian Maps a label to a label-group
- SENT_FROM_REGEX
No block-quotes (> or <), followed by up to three words, followed by “Sent from my”. Example: “Sent from my iPhone 3G”
- SIGNATURE_REGEX
Attributes
Emails have an Array of Fragments.
Public Class Methods
# File lib/email_reply_parser.rb, line 62 def initialize @fragments = [] end
Public Instance Methods
Splits the given text into a list of Fragments. This is roughly done by reversing the text and parsing from the bottom to the top. This way we can check for 'On <date>, <author> wrote:' lines above quoted blocks.
text - A String email body. from_address - from address of the email (optional)
Returns this same Email instance.
# File lib/email_reply_parser.rb, line 81 def read(text, from_address = "") # parse out the from name if one exists and save for use later @from_name_raw = parse_raw_name_from_address(from_address) @from_name_normalized = normalize_name(@from_name_raw) @from_email = parse_email_from_address(from_address) text = normalize_text(text) # The text is reversed initially due to the way we check for hidden # fragments. text = text.reverse # This determines if any 'visible' Fragment has been found. Once any # visible Fragment is found, stop looking for hidden ones. @found_visible = false # This instance variable points to the current Fragment. If the matched # line fits, it should be added to this Fragment. Otherwise, finish it # and start a new Fragment. @fragment = nil # Use the StringScanner to pull out each line of the email content. @scanner = StringScanner.new(text) while line = @scanner.scan_until(/\n/n) scan_line(line) end # Be sure to parse the last line of the email. if (last_line = @scanner.rest.to_s).size > 0 scan_line(last_line, true) end # Finish up the final fragment. Finishing a fragment will detect any # attributes (hidden, signature, reply), and join each line into a # string. finish_fragment @scanner = @fragment = nil # Now that parsing is done, reverse the order. @fragments.reverse! self end
Public: Gets the combined text of the visible fragments of the email body.
Returns a String.
# File lib/email_reply_parser.rb, line 69 def visible_text fragments.select{|f| !f.hidden?}.map{|f| f.to_s}.join("\n").rstrip end
Private Instance Methods
Builds the fragment string, after all lines have been added. It also checks to see if this Fragment is hidden. The hidden Fragment check reads from the bottom to the top.
Any quoted Fragments or signature Fragments are marked hidden if they are below any visible Fragments. Visible Fragments are expected to contain original content by the author. If they are below a quoted Fragment, then the Fragment should be visible to give context to the reply.
some original text (visible) > do you have any two's? (quoted, visible) Go fish! (visible) > -- > Player 1 (quoted, hidden) -- Player 2 (signature, hidden)
# File lib/email_reply_parser.rb, line 383 def finish_fragment if @fragment @fragment.finish if !@found_visible if @fragment.quoted? || @fragment.signature? || @fragment.reply_header? || @fragment.to_s.strip == EMPTY @fragment.hidden = true else @found_visible = true end end @fragments << @fragment end @fragment = nil end
generates regexp which always for additional words or initials between first and last names
# File lib/email_reply_parser.rb, line 355 def generate_regexp_for_name name_parts = @from_name_normalized.split(" ") seperator = '[\w.\s]*' regexp = Regexp.new(name_parts.join(seperator), Regexp::IGNORECASE) end
Detects if a given line is a common reply header.
line - A String line of text from the email.
Returns true if the line is a valid header, or false.
# File lib/email_reply_parser.rb, line 337 def line_is_reply_header?(line) COMMON_REPLY_HEADER_REGEXES.each do |regex| return true if line =~ regex end false end
Detects if the @from name is a big part of a given line and therefore the beginning of a signature
line - A String line of text from the email.
Returns true if @from_name is a big part of the line, or false.
# File lib/email_reply_parser.rb, line 349 def line_is_signature_name?(line) regexp = generate_regexp_for_name() @from_name_normalized != "" && (line =~ regexp) && ((@from_name_normalized.size.to_f / line.size) > 0.25) end
# File lib/email_reply_parser.rb, line 227 def make_name_first_then_last(name) split_name = name.split(',') if split_name[0].include?(" ") split_name[0].to_s else split_name[1].strip + " " + split_name[0].strip end end
Returns true
if the current block in the current fragment has
a multiline quote header, false
otherwise.
The quote header we're looking for is mainly generated by Outlook clients. It's considered a quote header if the first 4 folded lines have one of the following forms:
label: some text
*label:* some text
where a line like this:
label: some text
possibly indented text that belongs to the previous line
is folded into:
label: some text possibly indented text that belongs to the previous line
and where label is a value from QUOTE_HEADER_LABELS
that
appears only once in the first 4 lines and where each group of a label is
represented at most once.
# File lib/email_reply_parser.rb, line 300 def multiline_quote_header_in_fragment? folding = false label_groups = [] @fragment.current_block.split("\n").each do |line| if line =~ /\A\s*\*?([^:]+):(\s|\*)/ label = QUOTE_HEADER_LABELS[$1.downcase] if label return false if label_groups.include?(label) return true if label_groups.length == 3 label_groups << label folding = true elsif !folding return false end elsif !folding return false else folding = true end end return false end
Normalize a name to First Last
name - name to normailze.
Returns a String.
# File lib/email_reply_parser.rb, line 219 def normalize_name(name) if name.include?(',') make_name_first_then_last(name) else name end end
normalize text so it is easier to parse
text - text to normalize
Returns a String
# File lib/email_reply_parser.rb, line 171 def normalize_text(text) # in 1.9 we want to operate on the raw bytes text = text.dup.force_encoding('binary') if text.respond_to?(:force_encoding) # Normalize line endings. text.gsub!("\r\n", "\n") # Check for multi-line reply headers. Some clients break up # the "On DATE, NAME <EMAIL> wrote:" line into multiple lines. if match = text.match(/^(On\s(.+)wrote:)$/m) # Remove all new lines from the reply header. as long as we don't have any double newline # if we do they we have grabbed something that is not actually a reply header text.gsub! match[1], match[1].gsub("\n", " ") unless match[1] =~ /\n\n/ end # Some users may reply directly above a line of underscores. # In order to ensure that these fragments are split correctly, # make sure that all lines of underscores are preceded by # at least two newline characters. text.gsub!(/([^\n])(?=\n_{7}_+)$/m, "\\1\n") text end
# File lib/email_reply_parser.rb, line 209 def parse_email_from_address(address) match = address.match /<(.*)>/ match ? match[1] : address end
Parse a person's name from an e-mail address
email - email address.
Returns a String.
# File lib/email_reply_parser.rb, line 200 def parse_name_from_address(address) normalize_name(parse_raw_name_from_address(address)) end
# File lib/email_reply_parser.rb, line 204 def parse_raw_name_from_address(address) match = address.match(/^["']*([\w\s,]+)["']*\s*</) match ? match[1].strip.to_s : EMPTY end
Scans the given line of text and determines which fragment it belongs to.
# File lib/email_reply_parser.rb, line 239 def scan_line(line, last = false) line.chomp!("\n") line.reverse! line.rstrip! # Mark the current Fragment as a signature if the current line is empty # and the Fragment starts with a common signature indicator. # Mark the current Fragment as a quote if the current line is empty # and the Fragment starts with a multiline quote header. scan_signature_or_quote if @fragment && line == EMPTY # We're looking for leading `>`'s to see if this line is part of a # quoted Fragment. is_quoted = !!(line =~ /^>+/n) # Note that a common reply header also counts as part of the quoted # Fragment, even though it doesn't start with `>`. unless @fragment && ((@fragment.quoted? == is_quoted) || (@fragment.quoted? && (line_is_reply_header?(line) || line == EMPTY))) finish_fragment @fragment = Fragment.new @fragment.quoted = is_quoted end @fragment.add_line(line) scan_signature_or_quote if last end
# File lib/email_reply_parser.rb, line 268 def scan_signature_or_quote if signature_line?(@fragment.lines.first) @fragment.signature = true finish_fragment elsif multiline_quote_header_in_fragment? @fragment.quoted = true finish_fragment end end
Detects if a given line is the beginning of a signature
line - A String line of text from the email.
Returns true if the line is the beginning of a signature, or false.
# File lib/email_reply_parser.rb, line 328 def signature_line?(line) line =~ SIGNATURE_REGEX || line_is_signature_name?(line) end