Package Halberd :: Package clues :: Module analysis
[show private | hide private]
[frames | no frames]

Module Halberd.clues.analysis

Utilities for clue analysis.
Function Summary
list analyze(clues)
Draw conclusions from the clues obtained during the scanning phase.
dict classify(seq, *classifiers)
Classify a sequence according to one or several criteria.
tuple clusters(clues, step)
Finds clusters of clues.
list deltas(xs)
Computes the differences between the elements of a sequence of integers.
list diff_fields(clues)
Study differences between fields.
list filter_proxies(clues, maxdelta)
Detect and merge clues pointing to a proxy cache on the remote end.
str get_digest(clue)
Returns the specified clue's digest.
int hits(clues)
Compute the total number of hits in a sequence of clues.
  ignore_changing_fields(clues)
Tries to detect and ignore MIME fields with ever changing content.
Clue merge(clues)
Merges a sequence of clues into one.
  reanalyze(clues, analyzed, threshold)
Identify and ignore changing header fields.
list sections(classified, sects)
Returns sections (and their items) from a nested dict.
list of slice slices(start, xs)
Returns slices of a given sequence separated by the specified indices.
  sort_clues(clues)
Sorts clues according to their time difference.
list uniq(clues)
Return a list of unique clues.

Function Details

analyze(clues)

Draw conclusions from the clues obtained during the scanning phase.
Parameters:
clues - Unprocessed clues obtained during the scanning stage.
           (type=list)
Returns:
Coherent list of clues identifying real web servers.
           (type=list)

classify(seq, *classifiers)

Classify a sequence according to one or several criteria.

We store each item into a nested dictionary using the classifiers as key generators (all of them must be callable objects).

In the following example we classify a list of clues according to their digest and their time difference.
>>> a, b, c = Clue(), Clue(), Clue()
>>> a.diff, b.diff, c.diff = 1, 2, 2
>>> a.info['digest'] = 'x'
>>> b.info['digest'] = c.info['digest'] = 'y'
>>> get_diff = lambda x: x.diff
>>> classified = classify([a, b, c], get_digest, get_diff)
>>> digests = classified.keys()
>>> digests.sort()  # We sort these so doctest won't fail.
>>> for digest in digests:
...     print digest
...     for diff in classified[digest].keys():
...         print ' ', diff
...         for clue in classified[digest][diff]:
...             if clue is a: print '    a'
...             elif clue is b: print '    b'
...             elif clue is c: print '    c'
...

x
  1
    a
y
  2
    b
    c
Parameters:
seq - A sequence to classify.
           (type=list or tuple)
classifiers - A sequence of callables which return specific fields of the items contained in seq
           (type=list or tuple)
Returns:
A nested dictionary in which the keys are the fields obtained by applying the classifiers to the items in the specified sequence.
           (type=dict)

clusters(clues, step=3)

Finds clusters of clues.

A cluster is a group of at most step clues which only differ in 1 seconds between each other.
Parameters:
clues - A sequence of clues to analyze
           (type=list or tuple)
step - Maximum difference between the time differences of the cluster's clues.
           (type=int)
Returns:
A sequence with merged clusters.
           (type=tuple)

deltas(xs)

Computes the differences between the elements of a sequence of integers.
>>> deltas([-1, 0, 1])
[1, 1]

>>> deltas([1, 1, 2, 3, 5, 8, 13])
[0, 1, 1, 2, 3, 5]
Parameters:
xs - A sequence of integers.
           (type=list)
Returns:
A list of differences between consecutive elements of xs.
           (type=list)

diff_fields(clues)

Study differences between fields.
Parameters:
clues - Clues to analyze.
           (type=list)
Returns:
Fields which were found to be different among the analyzed clues.
           (type=list)

filter_proxies(clues, maxdelta=3)

Detect and merge clues pointing to a proxy cache on the remote end.
Parameters:
clues - Sequence of clues to analyze
           (type=list)
maxdelta - Maximum difference allowed between a clue's time difference and the previous one.
           (type=int)
Returns:
Sequence where all irrelevant clues pointing out to proxy caches have been filtered out.
           (type=list)

get_digest(clue)

Returns the specified clue's digest.

This function is usually passed as a parameter for classify so it can separate clues according to their digest (among other fields).
Returns:
The digest of a clue's parsed headers.
           (type=str)

hits(clues)

Compute the total number of hits in a sequence of clues.
Parameters:
clues - Sequence of clues.
           (type=list)
Returns:
Total hits.
           (type=int)

ignore_changing_fields(clues)

Tries to detect and ignore MIME fields with ever changing content.

Some servers might include fields varying with time, randomly, etc. Those fields are likely to alter the clue's digest and interfer with analyze, producing many false positives and making the scan useless. This function detects those fields and recalculates each clue's digest so they can be safely analyzed again.
Parameters:
clues - Sequence of clues.
           (type=list or tuple)

merge(clues)

Merges a sequence of clues into one.

A new clue will store the total count of the clues.

Note that each Clue has a starting count of 1
>>> a, b, c = Clue(), Clue(), Clue()
>>> sum([x.getCount() for x in [a, b, c]])
3

>>> a.incCount(5), b.incCount(11), c.incCount(23)
(None, None, None)

>>> merged = merge((a, b, c))
>>> merged.getCount()
42

>>> merged == a
True
Parameters:
clues - A sequence containing all the clues to merge into one.
           (type=list or tuple)
Returns:
The result of merging all the passed clues into one.
           (type=Clue)

reanalyze(clues, analyzed, threshold)

Identify and ignore changing header fields.

After initial analysis one must check that there aren't as many realservers as obtained clues. If there were it could be a sign of something wrong happening: each clue is different from the others due to one or more MIME header fields which change unexpectedly.
Parameters:
clues - Raw sequence of clues.
           (type=list)
analyzed - Result from the first analysis phase.
           (type=list)
threshold - Minimum clue-to-realserver ratio in order to trigger field inspection.
           (type=float)

sections(classified, sects=None)

Returns sections (and their items) from a nested dict.

See also: classify
Parameters:
classified - Nested dictionary.
           (type=dict)
sects - List of results. It should not be specified by the user.
           (type=list)
Returns:
A list of lists in where each item is a subsection of a nested dictionary.
           (type=list)

slices(start, xs)

Returns slices of a given sequence separated by the specified indices.

If we wanted to get the slices necessary to split range(20) in sub-sequences of 5 items each we'd do:
>>> seq = range(20) 
>>> indices = [5, 10, 15]
>>> for piece in slices(0, indices):
...     print seq[piece]
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[10, 11, 12, 13, 14]
[15, 16, 17, 18, 19]
Parameters:
start - Index of the first element of the sequence we want to partition.
           (type=int.)
xs - Sequence of indexes where 'cuts' must be made.
           (type=list)
Returns:
A sequence of slice objects suitable for splitting a list as specified.
           (type=list of slice)

sort_clues(clues)

Sorts clues according to their time difference.

uniq(clues)

Return a list of unique clues.

This is needed when merging clues coming from different sources. Clues with the same time diff and digest are not discarded, they are merged into one clue with the aggregated number of hits.
Parameters:
clues - A sequence containing the clues to analyze.
           (type=list)
Returns:
Filtered sequence of clues where no clue has the same digest and time difference.
           (type=list)

Generated by Epydoc 2.1 on Wed Jul 18 22:25:57 2007 http://epydoc.sf.net