Parent

String

Author

Lucas Carlson (lucas@rufy.com)

Copyright

Copyright (c) 2005 Lucas Carlson

License

LGPL


These are extensions to the String class to provide convenience methods for the Classifier package.

Constants

CORPUS_SKIP_WORDS

Public Instance Methods

clean_word_hash() click to toggle source

Return a word hash without extra punctuation or short symbols, just stemmed words

# File lib/classifier/extensions/word_hash.rb, line 24
def clean_word_hash
        word_hash_for_words gsub(/[^\w\s]/,"").split
end
paragraph_summary( count=1, separator=" [...] " ) click to toggle source
# File lib/classifier/lsi/summary.rb, line 10
def paragraph_summary( count=1, separator=" [...] " )
   perform_lsi split_paragraphs, count, separator
end
split_paragraphs() click to toggle source
# File lib/classifier/lsi/summary.rb, line 18
def split_paragraphs
   split /(\n\n|\r\r|\r\n\r\n)/ # TODO: make this less primitive
end
split_sentences() click to toggle source
# File lib/classifier/lsi/summary.rb, line 14
def split_sentences
   split /(\.|\!|\?)/ # TODO: make this less primitive
end
summary( count=10, separator=" [...] " ) click to toggle source
# File lib/classifier/lsi/summary.rb, line 6
def summary( count=10, separator=" [...] " )
   perform_lsi split_sentences, count, separator
end
without_punctuation() click to toggle source

Removes common punctuation symbols, returning a new string. E.g.,

"Hello (greeting's), with {braces} < >...?".without_punctuation
=> "Hello  greetings   with  braces         "
# File lib/classifier/extensions/word_hash.rb, line 13
def without_punctuation
  tr( ',?.!;:"@#$%^&*()_=+[]{}\|<>/`~', " " ) .tr( "'\-", "")
end
word_hash() click to toggle source

Return a Hash of strings => ints. Each word in the string is stemmed, interned, and indexes to its frequency in the document.

# File lib/classifier/extensions/word_hash.rb, line 19
def word_hash
        word_hash_for_words(gsub(/[^\w\s]/,"").split + gsub(/[\w]/," ").split)
end

[Validate]

Generated with the Darkfish Rdoc Generator 2.