DataparkSearch Engine 4.46 reference manual: The Web searching software | ||
---|---|---|
Prev | Chapter 8. Searching documents | Next |
DataparkSearch by default sorts results first by relevency and second by popularity rank.
Relevance for every found document is calculated as 100% multiply by cosine of an angle formed by weights vector for request and weights vector for document found. The number of vector coordinates is equal to multiplication of the number words forms in search query and the number of sections defined in indexer.conf. Every vector's coordinate is corresponds to a word in search query that fit one of document section. The values of this coordinate is depends on weight for this section defined by wf parameter (see Section 8.1.3) and what this word is: exactly the same as in search query or it's word form or synonym. And one more coordinate is equal to average distance between searched words in document. For query related vector this coordinate is equal to 0.
Since sections definition located only in indexer.conf file, use NumSections command in searchd.conf or in search.htm to specify the number od section used. By default, this value is 256. But note, NumSections do not affect document ordering, only the relevance value.
Table 8-3. Configure-time parameters to tune relevance calculation (switches for configure)
--enable-fullrel | This option enables full version of relevance calculation. Value by default: disabled (or fast relevance calculation). |
--disable-reldistance | This option disables accounting of average word distance for relevance calculation. Value by default: enabled. |
--disable-relposition | This option disables accounting of first query word position for relevance calculation. Value by default: enabled. |
--disable-relwrdcount | This option disables accounting of word counts for relevance calculation. Value by default: enabled. |
--with-bestpos=NUM | This option specify the NUM as the best value of first word position in document found. Value by default: 4. |
--with-bestwrdcnt=NUM | This option specify the NUM as the best number of each query word in document found. Value by default: 11. |
--with-distfactor=NUM | This option specify the NUM as a factor for average word distance for relevance calculation. Value by default: 0.2. |
--with-posfactor=NUM | This option specify the NUM as factor for difference between first query word position in document found and best position specified by --with-bestpos option. Value by default: 0.5. |
--with-wrdcntfactor=NUM | This option specify the NUM as factor for difference between count of query words in document found and the best value specified by --with-bestwrdcnt option. Value by default: 0.4. |
--with-wrdunifactor=NUM | This option specify the NUM as factor for difference of query word counts from uniform distribution. Value by default: 1.5. |
DataparkSearch support two methods for popularity rank calculation. A method used in previous versions called "Goo", and new method is called "Neo". By default, the Goo method is used. To select desired PopRank calculation method use PopRankMethod command:
PopRankMethod Neo
You need enable links collection by CollectLinks yes command in your indexer.conf file for Neo method and for full functionality of Goo method. But this slow down a bit indexing speed. By default, links collection is not enabled.
If you place PopRankSkipSameSite yes command in indexer.conf file, indexer will take only inter site links (i.e. links from a page on one site to a page on another site) for popularity rank calculation.
You may assign initial value for page popularity rank using DP.PopRank META tag (see Section 4.3).
The popularity rank calculation is made in two stages. At first stage, the value of Weight parameter for every server is divide by number of links from this server. Thus, the weight of one link from this server is calculated. At second stage, for every page we find the sum of weights of all links pointed to this page. This sum is popularity rank for this page.
By default, the value of Weight parameter is equal to 1 for all servers indexed. You may change this value by Weight command in indexer.conf file or directly in server table, if you load servers configuration from this table.
If you place PopRankFeedBack yes command in indexer.conf file, indexer will calculate site weights before page rank calculation. To do that, indexer calculate sum of popularity rank for all pages from same site. If this sum will great 1, the weight for site set to this sum, otherwise, site weight is set to 1.
If you place PopRankUseTracking yes command in indexer.conf file, indexer will calculate site weight as the number of tracked queries with restriction on this site.
If you place PopRankUseShowCnt yes command in search.htm (or searchd.conf) file, then for every result shown to user corresponding url.shows value will be increased on 1, if relevance for this result is great or equal to value specified by PopRankShowCntRatio command (default value is 25.0). If you place PopRankUseShowCnt yes in indexer.conf file, indexer will add to url's PopularityRank the value of url.shows multiplied by value, specified in PopRankShowCntWeight command (default value is 0.01).
For this method is supposed all pages are neurons and links between pages are links between neurons. So it's possible use an error back-propagation algorithm to train this neural network. Popularity rank for a pages is the activity level for corresponding neuron.
You may use PopRankNeoIterations command to specify the number of iterations of the Neo Popularity Rank calculation. Default value is 3.
By default, the Neo Popularity Rank is caclulated along with indexing. To speed up indexing, you may postpone Popularity Rank execution using PopRankPostpone command:
PopRankPostpone yes
Then you may calculate the Neo Popularity Rank after indexing in same way as for method Goo, i.e.: indexer -TR
Please note that in case of boolean searching of two or more words, you have to enter operators (&, |, ~). I.e. it is necessary to enter "a & book" instead of "a book" (with no quotation marks).
This feature allows to assign words between <a href="xxx"> and </a> also to a document this link leads to. It works in SQL database mode and is not supported in cache mode. To enable Crosswords, please use CrossWords yes command in indexer.conf and search.htm.
The Summary Exctraction Algorith (SEA) allow to build the summary in three most relevant sentences of each document indexed, if this document consist of six or more sentences. To enable this feature, add this command to your seaction.conf file:
Section sea x ywhere x - the number of section and y - the maximum length of this section value, leave 0, if you do not want show this in result pages. If you specify y non-zero, you may use $(sea) meta-variable in your search template to show the summary in result pages.
Related configuration directives:
The SEASentenceMinLength command specify the minimal length of sentence to be used in summary construction using the SEA. Default value: 32.
The SEASentences command is uses to specify the maximal number of sentences with length greater or equal to value defined by the SEASentenceMinLength command, which are using in summary construction using the SEA. Default value: 64. Since calculation of the summary using SEA is nonlinear expensive (affects only indexing), you may adjust this value according desired indexing performance.
This algorithm of automatic summary construction is based on ideas of Rada Mihalcea described in the paper Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005.
Differences in DataparkSearch's SEA:
Initial weights for graph edges are calculates as a measure of similarity between 3-gram distributions of corresponding sentences.
All initial values for graph vertexes are equal to some initial value ( 1 / (number of sentences + 1) in current implementtion).
The Neo PopRank algorithm is used as ranking algorithm to iterate values assigned to vertexes.
After indexing of document collection with this section defined, you may use $(sea) meta-variable in your template to show summary for a search result.