Files

ClassifierReborn::Hasher

Constants

CORPUS_SKIP_WORDS

Public Instance Methods

clean_word_hash(str) click to toggle source

Return a word hash without extra punctuation or short symbols, just stemmed words

# File lib/classifier-reborn/extensions/hasher.rb, line 28
def clean_word_hash(str)
  word_hash_for_words str.gsub(/[^\w\s]/,"").split
end
without_punctuation(str) click to toggle source

Removes common punctuation symbols, returning a new string. E.g.,

"Hello (greeting's), with {braces} < >...?".without_punctuation
=> "Hello  greetings   with  braces         "
# File lib/classifier-reborn/extensions/hasher.rb, line 15
def without_punctuation(str)
  str .tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
end
word_hash(str) click to toggle source

Return a Hash of strings => ints. Each word in the string is stemmed, interned, and indexes to its frequency in the document.

# File lib/classifier-reborn/extensions/hasher.rb, line 21
def word_hash(str)
  word_hash   = clean_word_hash(str)
  symbol_hash = word_hash_for_symbols(str.gsub(/[\w]/," ").split)
  return clean_word_hash(str).merge(symbol_hash)
end
word_hash_for_symbols(words) click to toggle source
# File lib/classifier-reborn/extensions/hasher.rb, line 43
def word_hash_for_symbols(words)
  d = Hash.new(0)
  words.each do |word|
    d[word.intern] += 1
  end
  return d
end
word_hash_for_words(words) click to toggle source
# File lib/classifier-reborn/extensions/hasher.rb, line 32
def word_hash_for_words(words)
  d = Hash.new(0)
  words.each do |word|
    word.downcase!
    if ! CORPUS_SKIP_WORDS.include?(word) && word.length > 2
      d[word.stem.intern] += 1
    end
  end
  return d
end

[Validate]

Generated with the Darkfish Rdoc Generator 2.