SoftComplete Development

Fuzzy Search technique versus Stemming

So as it mentioned below our full text search and retrieval tools are based on Fuzzy principle. There the fuzzy search technique is used as more advanced alternative to text search and retrieval based on Stemming principle. Where is the difference between these two approaches?

Searching and retrieving a stemmer reduces the query to its word root form and matches results containing this stem. For example for query 'specially' a stemming algorithm will find the results "especially", "special", "specialize", "specializing", "specification" and other having the root "spec". However if in the query word will be casual mismatch like 'spesial' or 'spetial' the search engine based on a stemming algorithm will show zero results. Or by example for the root "use" a stemmer will additionally match "user", "useful" but not "using" or "usage".

Fuzzy technique uses approximate full text search and retrieval. This means it will match all possible results for a search query despite its form or spelling mistakes/mismatches presence no matter what part of word they will be in. This way it will retrieve "special" even either your query will be "spesial", "spetial" or "spizial". It will show all related results by relevancy and similarity degree.

Another significant advantage of Fuzzy search technique versus Stemming is its approximate matching can be applied to multilanguage search while the Stemming cannot work with more than one language texts. There are several known stemmers created for most spoken world languages. At that time every stemmer can work only with one language i.e. it will be impossible apply it to index another language text. This is very inconveniently when working with texts containing citations, passages, remarks and other info in different languages. And it will impossible to apply full text search / retrieval / indexing for a text written in any language that no stemmers does exist for. In the same time creating of self stemming solution will take a lot of time, investments and require profound linguistic knowledge. Furthermore it usually is very hard to fit a stemming algorithm to a language nature specificity that will work correctly because the accidence principles of languages are very different while the most stemmers are based on to word root reducing. And there are many languages where the accidence is based on root structure change (for example "man" and "men" in English).

The language neutrality of Fuzzy technique makes it the best of existing solutions for such problem. Its full text search, retrieval and indexing supports different languages simultaneously. There is no need to fit it for a specific language or alphabet. Based on Unicode Fuzzy solutions are linguistic universal. The only requirement is the text has be written from left to right.

More ...

FuzzySearch Library

FuzzySearch library functions allow both exact and approximate (containing different sorts of mismatches, errors, mistypes, mistakes and misspells) string comparison and matching. At the same time the approximate matching degree is flexible. I.e. you can set your custom matching percent for search results to be found and shown. All functions are optimized to search speed and oriented to text processing in natural and native languages. The library supports ANSI and Unicode strings.

More ...
 

AlphaTIX Library

AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library tool that will meet all your application's indexing and retrieval needs. AlphaTIX indexing technology provides you with highest indexing performance, unbelievable fast query processing speed and ability to index very large sets of data in minimal time even with minimum memory resources.

The main and unique AlphaTIX's feature that makes it top of range and different from the same type developer's solutions is the Fuzzy Search Technology (Approximate Search) used in. This means you have possibility to provide your application with search engine allowing to retrieve a text containing mistakes, "mismatches" and in addition to determinate its similarity percent. You can enable/disable this feature: depending of you needs you can set either strict correspondence or custom similarity percent search results matching. This way by setting the similarity percent you handle the match relevancy degree and flexibility of your search.

Unlike other analogical libraries AlphaTIX does not use the stemming. Instead it uses Fuzzy technique that allows to create search engines with cross-language retrieval support without need to integrate additional stemmer for each language (that is required by systems using stemming). That means when you're working with some multilingual texts and documents you need no additional language support tools. Unicode and ANSI support provides you with enough flexibility allowing to find not only regular English words but also other many languages ones and work with multiple character sets. This way your retrieval system is not restricted by the only one language nor alphabet.

AlphaTIX provides with form flexibility of cross-language words and phrases indexing in the same document that is allowed by its approximate search feature. You can misspell or make a mistake in typing your query but it will not you intervene to find words or phrases you're looking for.

AlphaTIX's flexibility and possibility to set search results similarity degree functions let your user narrow a search or find a large variety of words and phrases with various similarity to the query. Having built-in support for vector space model it makes possible to range these search results by query's similarity.

More ...