Parse the KEGG ‘taxonomy’ file which describes taxonomic classification of organisms.
The KEGG ‘taxonomy’ file is available at
# File lib/bio/db/kegg/taxonomy.rb, line 26 def initialize(filename, orgs = []) # Stores the taxonomic tree as a linked list (implemented in Hash), so # every node need to have unique name (key) to work correctly @tree = Hash.new # Also stores the taxonomic tree as a list of arrays (full path) @path = Array.new # Also stores all leaf nodes (organism codes) of every intermediate nodes @leaves = Hash.new # tentative name for the root node (use accessor to change) @root = 'Genes' hier = Array.new level = 0 label = nil File.open(filename).each do |line| next if line.strip.empty? # line for taxonomic hierarchy (indent according to the number of # marks) if line[/^#/] level = line[/^#+/].length label = line[/[A-z].*/] hier[level] = sanitize(label) # line for organims name (unify different strains of a species) else tax, org, name, desc = line.chomp.split("\t") if orgs.nil? or orgs.empty? or orgs.include?(org) species, strain, = name.split('_') # (0) Grouping of the strains of the same species. # If the name of species is the same as the previous line, # add the species to the same species group. # ex. Gamma/enterobacteria has a large number of organisms, # so sub grouping of strains is needed for E.coli strains etc. # # However, if the species name is already used, need to avoid # collision of species name as the current implementation stores # the tree as a Hash, which may cause the infinite loop. # # (1) If species name == the intermediate node of other lineage # Add '_sp' to the species name to avoid the conflict (1-1), and if # 'species_sp' is already taken, use 'species_strain' instead (1-2). # ex. Bacteria/Proteobacteria/Beta/T.denitrificans/tbd # Bacteria/Proteobacteria/Epsilon/T.denitrificans_ATCC33889/tdn # -> Bacteria/Proteobacteria/Beta/T.denitrificans/tbd # Bacteria/Proteobacteria/Epsilon/T.denitrificans_sp/tdn # # (2) If species name == the intermediate node of the same lineage # Add '_sp' to the species name to avoid the conflict. # ex. Bacteria/Cyanobacgteria/Cyanobacteria_CYA/cya # Bacteria/Cyanobacgteria/Cyanobacteria_CYB/cya # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_MC1/mgm # -> Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya # Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_sp/mgm sp_group = "#{species}_sp" if @tree[species] if hier[level+1] == species # case (0) else # case (1-1) species = sp_group # case (1-2) if @tree[sp_group] and hier[level+1] != species species = name end end else if hier[level] == species # case (2) species = sp_group end end # 'hier' is an array of the taxonomic tree + species and strain name. # ex. [nil, Eukaryotes, Fungi, Ascomycetes, Saccharomycetes] + # [S_cerevisiae, sce] hier[level+1] = species # sanitize(species) hier[level+2] = org ary = hier[1, level+2] warn ary.inspect if $DEBUG add_to_tree(ary) add_to_leaves(ary) add_to_path(ary) end end end return tree end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores leaf nodes to the every intermediate nodes as an Array.
# File lib/bio/db/kegg/taxonomy.rb, line 140 def add_to_leaves(ary) leaf = ary.last ary.each do |node| @leaves[node] ||= Array.new @leaves[node] << leaf end end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores the path itself in an Array.
# File lib/bio/db/kegg/taxonomy.rb, line 150 def add_to_path(ary) @path << ary end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and every intermediate nodes stores their child nodes as a Hash.
# File lib/bio/db/kegg/taxonomy.rb, line 129 def add_to_tree(ary) parent = @root ary.each do |node| @tree[parent] ||= Hash.new @tree[parent][node] = nil parent = node end end
Compaction of intermediate nodes of the resulted taxonomic tree.
- If child node has only one child node (grandchild), make the child of grandchild as a grandchild. ex. Plants / Monocotyledons / grass family / osa --> Plants / Monocotyledons / osa
# File lib/bio/db/kegg/taxonomy.rb, line 161 def compact(node = root) # if the node has children if subnodes = @tree[node] # obtain grandchildren for each child subnodes.keys.each do |subnode| if subsubnodes = @tree[subnode] # if the number of grandchild node is 1 if subsubnodes.keys.size == 1 # obtain the name of the grandchild node subsubnode = subsubnodes.keys.first # obtain the child of the grandchlid node if subsubsubnodes = @tree[subsubnode] # make the child of grandchild node as a chlid of child node @tree[subnode] = subsubsubnodes # delete grandchild node @tree[subnode].delete(subsubnode) warn "--- compact: #{subsubnode} is replaced by #{subsubsubnodes}" if $DEBUG # retry until new grandchild also needed to be compacted. retry end end end # repeat recurseively compact(subnode) end end end
Traverse the taxonomic tree by the depth first search method under the given (root or intermediate) node.
# File lib/bio/db/kegg/taxonomy.rb, line 224 def dfs(parent, &block) if children = @tree[parent] yield parent, children children.keys.each do |child| dfs(child, &block) end end end
Similar to the dfs method but also passes the current level of the nest to the iterator.
# File lib/bio/db/kegg/taxonomy.rb, line 235 def dfs_with_level(parent, &block) @level ||= 0 if children = @tree[parent] yield parent, children, @level @level += 1 children.keys.each do |child| dfs_with_level(child, &block) end @level -= 1 end end
# File lib/bio/db/kegg/taxonomy.rb, line 123 def organisms(group) @leaves[group] end
Reduction of the leaf node of the resulted taxonomic tree.
- If the parent node have only one leaf node, replace parent node with the leaf node. ex. Plants / Monocotyledons / osa --> Plants / osa
# File lib/bio/db/kegg/taxonomy.rb, line 196 def reduce(node = root) # if the node has children if subnodes = @tree[node] # obtain grandchildren for each child subnodes.keys.each do |subnode| if subsubnodes = @tree[subnode] # if the number of grandchild node is 1 if subsubnodes.keys.size == 1 # obtain the name of the grandchild node subsubnode = subsubnodes.keys.first # if the grandchild node is a leaf node unless @tree[subsubnode] # make the grandchild node as a child node @tree[node].update(subsubnodes) # delete child node @tree[node].delete(subnode) warn "--- reduce: #{subnode} is replaced by #{subsubnode}" if $DEBUG end end end # repeat recursively reduce(subnode) end end end
Generated with the Darkfish Rdoc Generator 2.