class Ai4r::Classifiers::NaiveBayes

= Introduction

This is an implementation of a Naive Bayesian Classifier without any
specialisation (ie. for text classification)
Probabilities P(a_i | v_j) are estimated using m-estimates, hence the
m parameter as second parameter when isntantiating the class.
The estimation looks like this:

(n_c + mp) / (n + m)

the variables are:
n = the number of training examples for which v = v_j
n_c = number of examples for which v = v_j and a = a_i
p = a priori estimate for P(a_i | v_j)
m = the equivalent sample size

stores the conditional probabilities in an array named @pcp and in this form:
@pcp[attributes][values][classes]

This kind of estimator is useful when the training data set is relatively small.
If the data set is big enough, set it to 0, which is also the default value

For further details regarding Bayes and Naive Bayes Classifier have a look at those websites:
http://en.wikipedia.org/wiki/Naive_Bayesian_classification
http://en.wikipedia.org/wiki/Bayes%27_theorem

= Parameters

* :m => Optional. Default value is set to 0. It may be set to a value greater than 0 when
the size of the dataset is relatively small

= How to use it

  data = DataSet.new.load_csv_with_labels "bayes_data.csv"
  b = NaiveBayes.new.
    set_parameters({:m=>3}).
    build data
  b.eval(["Red", "SUV", "Domestic"])

Public Class Methods

new() click to toggle source
# File lib/ai4r/classifiers/naive_bayes.rb, line 63
def initialize
  @m = 0
  @class_counts = []
  @class_prob = [] # stores the probability of the classes
  @pcc = [] # stores the number of instances divided into attribute/value/class
  @pcp = [] # stores the conditional probabilities of the values of an attribute
  @klass_index = {} # hashmap for quick lookup of all the used klasses and their indice
  @values = {} # hashmap for quick lookup of all the values
end

Public Instance Methods

build(data) click to toggle source

counts values of the attribute instances and calculates the probability of the classes and the conditional probabilities Parameter data has to be an instance of CsvDataSet

# File lib/ai4r/classifiers/naive_bayes.rb, line 105
def build(data)
  raise 'Error instance must be passed' unless data.is_a?(Ai4r::Data::DataSet)
  raise 'Data should not be empty' if data.data_items.length == 0

  initialize_domain_data(data)
  initialize_klass_index
  initialize_pc
  calculate_probabilities

  self
end
eval(data) click to toggle source

You can evaluate new data, predicting its category. e.g.

b.eval(["Red", "SUV", "Domestic"])
  => 'No'
# File lib/ai4r/classifiers/naive_bayes.rb, line 77
def eval(data)
  prob = @class_prob.dup
  prob = calculate_class_probabilities_for_entry(data, prob)
  index_to_klass(prob.index(prob.max))
end
get_probability_map(data) click to toggle source

Calculates the probabilities for the data entry Data. data has to be an array of the same dimension as the training data minus the class column. Returns a map containint all classes as keys: {Class_1 => probability, Class_2 => probability2 … } Probability is <= 1 and of type Float. e.g.

b.get_probability_map(["Red", "SUV", "Domestic"])
  => {"Yes"=>0.4166666666666667, "No"=>0.5833333333333334}
# File lib/ai4r/classifiers/naive_bayes.rb, line 92
def get_probability_map(data)
  prob = @class_prob.dup
  prob = calculate_class_probabilities_for_entry(data, prob)
  prob = normalize_class_probability prob
  probability_map = {}
  prob.each_with_index { |p, i| probability_map[index_to_klass(i)] = p }

  probability_map
end

Private Instance Methods

build_array(index) click to toggle source

builds an array of the form: array[values]

# File lib/ai4r/classifiers/naive_bayes.rb, line 186
def build_array(index)
  domains = Array.new(@domains[index].length)
  domains.map do
    Array.new @klasses.length, 0
  end
end
calculate_class_probabilities() click to toggle source
# File lib/ai4r/classifiers/naive_bayes.rb, line 213
def calculate_class_probabilities
  @data_items.each do |entry|
    @class_counts[klass_index(entry.klass)] += 1
  end

  @class_counts.each_with_index do |k, index|
    @class_prob[index] = k.to_f / @data_items.length
  end
end
calculate_class_probabilities_for_entry(data, prob) click to toggle source

calculates the klass probability of a data entry as usual, the probability of the value is multiplied with every conditional probability of every attribute in condition to a specific class this is repeated for every class

# File lib/ai4r/classifiers/naive_bayes.rb, line 131
def calculate_class_probabilities_for_entry(data, prob)
  0.upto(prob.length - 1) do |prob_index|
    data.each_with_index do |att, index|
      next if value_index(att, index).nil?
      prob[prob_index] *= @pcp[index][value_index(att, index)][prob_index]
    end
  end
  
  prob
end
calculate_conditional_probabilities() click to toggle source

calculates the conditional probability and stores it in the @pcp-array

# File lib/ai4r/classifiers/naive_bayes.rb, line 233
def calculate_conditional_probabilities
  @pcc.each_with_index do |attributes, a_index|
    attributes.each_with_index do |values, v_index|
      values.each_with_index do |klass, k_index|
        @pcp[a_index][v_index][k_index] = (klass.to_f + @m * @class_prob[k_index]) / (@class_counts[k_index] + @m)
      end
    end
  end
end
calculate_probabilities() click to toggle source

calculates the occurrences of a class and the instances of a certain value of a certain attribute and the assigned class. In addition to that, it also calculates the conditional probabilities and values

# File lib/ai4r/classifiers/naive_bayes.rb, line 205
def calculate_probabilities
  @klasses.each { |dl| @class_counts[klass_index(dl)] = 0 }

  calculate_class_probabilities
  count_instances
  calculate_conditional_probabilities
end
count_instances() click to toggle source

counts the instances of a certain value of a certain attribute and the assigned class

# File lib/ai4r/classifiers/naive_bayes.rb, line 224
def count_instances
  @data_items.each do |item|
    0.upto(@data_labels.length - 1) do |dl_index|
      @pcc[dl_index][value_index(item[dl_index], dl_index)][klass_index(item.klass)] += 1
    end
  end
end
index_to_klass(index) click to toggle source

returns the name of the class when the index is found

# File lib/ai4r/classifiers/naive_bayes.rb, line 156
def index_to_klass(index)
  @klass_index.has_value?(index) ? @klass_index.key(index) : nil
end
initialize_domain_data(data) click to toggle source
# File lib/ai4r/classifiers/naive_bayes.rb, line 119
def initialize_domain_data(data)
  @domains = data.build_domains
  @data_items = data.data_items.map { |item| DataEntry.new(item[0...-1], item.last) }
  @data_labels = data.data_labels[0...-1]
  @klasses = @domains.last.to_a
end
initialize_klass_index() click to toggle source

initializes @values and @klass_index; maps a certain value to a uniq index

# File lib/ai4r/classifiers/naive_bayes.rb, line 161
def initialize_klass_index
  @klasses.each_with_index do |dl, index|
    @klass_index[dl] = index
  end

  0.upto(@data_labels.length - 1) do |index|
    @values[index] = {}
    @domains[index].each_with_index do |d, d_index|
      @values[index][d] = d_index
    end
  end
end
initialize_pc() click to toggle source

initializes the two array for storing the count and conditional probabilities of the attributes

# File lib/ai4r/classifiers/naive_bayes.rb, line 195
def initialize_pc
  0.upto(@data_labels.length - 1) do |index|
    @pcc << build_array(index)
    @pcp << build_array(index)
  end
end
klass_index(klass) click to toggle source

returns the index of a class

# File lib/ai4r/classifiers/naive_bayes.rb, line 175
def klass_index(klass)
  @klass_index[klass]
end
normalize_class_probability(prob) click to toggle source

normalises the array of probabilities so the sum of the array equals 1

# File lib/ai4r/classifiers/naive_bayes.rb, line 143
def normalize_class_probability(prob)
  prob_sum = sum(prob)
  prob_sum > 0 ?
    prob.map { |prob_entry| prob_entry / prob_sum } :
    prob
end
sum(array) click to toggle source

sums an array up; returns a number of type Float

# File lib/ai4r/classifiers/naive_bayes.rb, line 151
def sum(array)
  array.inject(0.0) { |b, i| b + i }
end
value_index(value, dl_index) click to toggle source

returns the index of a value, depending on the attribute index

# File lib/ai4r/classifiers/naive_bayes.rb, line 180
def value_index(value, dl_index)
  @values[dl_index][value]
end