If your are planning to implement a custom document library application using SQL Server and .NET, full-text search comes handy to implement content search on your document. Storing your documents in SQL Server FileSteam gives you many advantages. You can implement Google-like search on your document as well. In PART I of this post I tried to provide some insight into some of these features. In Part II, I will show you how to configure SQL Server to enable filestream and do full-text search and I will also include a sample .NET application.
SQL Server 2008 provides the functionality for applications and users to issue full-text queries against character-based data in SQL Server tables. SQL Server supports full text indexing on columns with any of the following data types: char, varchar, nchar, nvarchar, text, ntext, image, xml, varbinary, or varbinary(max). Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on rules of a particular language such as English. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.
Full-text search supports more than 50 languages, such as English, Spanish, Chinese, Japanese, Arabic and German. For each supported language, SQL Server provides language-specific linguistic components, including a word breaker and stemmer and an empty thesaurus file. For each full-text language, SQL Server also provides a file in which you can optionally define language-specific synonyms to extend the scope of search queries (a thesaurus file).
For writing full-text queries, SQL Server provides a set of full-text predicates (CONTAINS and FREETEXT) and rowset-valued functions (CONTAINSTABLE and FREETEXTTABLE). Using these, applications and users can perform a variety of types of full-text searches, such as searching on a single word or phrase (and optionally ranking the result set), searching on a word or phrase close to another word or phrase, or searching on synonymous forms of a specific word.
SQL Server File Stream
FILESTREAM enables SQL Server-based applications to store unstructured data, such as documents and images, on the file system. Applications can leverage the rich streaming APIs and performance of the file system and at the same time maintain transactional consistency between the unstructured data and corresponding structured data. FILESTREAM integrates the SQL Server Database Engine with an NTFS file system by storing varbinary(max) binary large object (BLOB) data as files on the file system. Transact-SQL statements can insert, update, query, search, and back up FILESTREAM data. Win32 file system interfaces provide streaming access to the data. FILESTREAM uses the NT system cache for caching file data. This helps reduce any effect that FILESTREAM data might have on Database Engine performance. The SQL Server buffer pool is not used; therefore, this memory is available for query processing.
FILESTREAM is not automatically enabled when you install or upgrade SQL Server. You must enable FILESTREAM by using SQL Server Configuration Manager and SQL Server Management Studio. To use FILESTREAM, you must create or modify a database to contain a special type of filegroup. Then, create or modify a table so that it contains a varbinary(max) column with the FILESTREAM attribute. After you complete these tasks, you can use Transact-SQL and Win32 to manage the FILESTREAM data.
When you use FILESTREAM to store binary large object (BLOB) data, you can use Win32 APIs to work with the files. To support working with FILESTREAM BLOB data in Win32 applications, SQL Server provides couple of functions and some APIs.
PathName returns a path as a token to a BLOB. An application uses this token to obtain a Win32 handle and operate on BLOB data.
GET_FILESTREAM_TRANSACTION_CONTEXT() returns a token that represents the current transaction of a session. An application uses this token to bind FILESTREAM file system streaming operations to the transaction.
The OpenSqlFilestream API obtains a Win32 file handle. The application uses the handle to stream the FILESTREAM data.
All FILESTREAM data container access is performed in a SQL Server transaction. Transact-SQL statements can be executed in the same transaction to maintain consistency between SQL data and FILESTREAM data.
SQL Server Full-Text Engine
The SQL Server Full-Text Engine is a full-text indexing and search engine. In SQL Server 2008, the Full-Text Engine has been fully integrated into the Database Engine. The Full-Text Engine now resides in the SQL Server process rather than residing in a separate process like in the previous versions of SQL Servers. Integrating the Full-Text Engine into the Database Engine has improved full-text manageability, optimization of mixed query, and overall performance. For each instance of SQL Server, there is a dedicated instance of the Full-Text Engine, including dedicated components such as word breakers and filters, resources such as memory, and configuration such as service-level settings at the instance level.
Full-text search uses the following components of the SQL Server process:
User tables: These tables contain the data to be full-text indexed.
Crawl Thread (Full-text gatherer): The full-text gatherer works with the full-text crawl threads. It is responsible for scheduling and driving the population of full-text indexes, and also for monitoring full-text catalogs.
Thesaurus files: These files contain synonyms of search terms
Stoplist objects: Stoplist objects contain a list of common words that are not useful for the search.
SQL Server query processor: The query processor compiles and executes SQL queries. If a SQL query includes a full-text search query, the query is sent to the Full-Text Engine, both during compilation and during execution. The query result is matched against the full-text index.
Full-Text Engine: The Full-Text Engine in SQL Server is now fully integrated with the query processor. The Full-Text Engine compiles and executes full-text queries. As part of query execution, the Full-Text Engine might receive input from the thesaurus and stoplist. In SQL Server 2008 and later versions, the Full-Text Engine for SQL Server runs inside the SQL Server query processor.
Indexer: The index writer builds the structure that is used to store the indexed tokens.
Filter daemon manager: The filter daemon manager is responsible for monitoring the status of the Full-Text Engine filter daemon host.
The filter daemon host is a process that is started by the Full-Text Engine. It runs the following full-text search components, which are responsible for accessing, filtering, and word breaking data from tables, as well as for word breaking and stemming the query input:
Protocol handler: One of its responsibilities is to gather data from the columns being full-text indexed and pass it to the filter daemon host, which will apply filtering and word breaker as required.
Filters: Some data types require filtering before the data in a document can be full-text indexed, including data in varbinary, varbinary(max), image, or xml columns. The filter used for a given document depends on its document type. For example, different filters are used for Microsoft Word (.doc) documents, Microsoft Excel (.xls) documents, and XML (.xml) documents. Then the filter extracts chunks of text from the document, removing embedded formatting and retaining the text and, potentially, information about the position of the text. The result is a stream of textual information.
Word breakers and stemmers: A word breaker is a language-specific component that finds word boundaries based on the lexical rules of a given language (word breaking). Each word breaker is associated with a language-specific stemmer component that conjugates verbs and performs inflectional expansions. At indexing time, the filter daemon host uses a word breaker and stemmer to perform linguistic analysis on the textual data from a given table column. The language that is associated with a table column in the full-text index determines which word breaker and stemmer are used for indexing the column.
Full Text Catalog
The most commonly used indexes in a SQL Server database are clustered and non-clustered indexes that are organized in a B-tree structure. You can create these types of indexes on most columns in a table or a view, except those columns configured with large object (LOB) data types, such as text and varchar(max). Although this limitation is not a problem in many cases, there will be times when you’ll want to query such column types. However, without indexes defined on the columns, the query engine must perform a full table scan to locate the necessary data.
A full-text index is made up of word tokens that are derived from the text being indexed. For example, if the indexed text contains the phrase “tables can include indexes,” the full-text index would contain four tokens: “tables,” “can,” “include,” and “indexes.” Because the list of tokens can be easily searched, full-text queries can quickly locate the necessary records.
A full-text catalog provides a mechanism for organizing full-text indexes. Each catalog can contain zero or more indexes, but each index can be associated with only one catalog. Catalogs are implemented differently in SQL Server 2005 and 2008:
SQL Server 2005: A full-text catalog is a physical structure that must reside on the local hard drive associated with the SQL Server instance. Each catalog is part of a specific filegroup. If no filegroup is specified when the catalog is created, the default filegroup is used.
SQL Server 2008: A full-text catalog is a logical concept that refers to a group of full-text indexes. The catalog is not associated with a filegroup.
To prevent a full-text index from becoming bloated, SQL Server has a mechanism that discards commonly occurring strings that do not help the search. These discarded strings are called stopwords (aka noise words). During index creation, the Full-Text Engine omits stopwords from the full-text index. This means that full-text queries will not search on stopwords.
When a cell in a varbinary(max), or image column contains a document with a supported document-file extension, full-text search uses a filter to interpret the binary data. The filter, which implements the iFilter interface, extracts the textual information from the document and submits it for indexing. To identify the filters included in an instance of SQL Server, use the sp_help_fulltext_system_components (Transact-SQL) stored procedure, which returns information for the registered word-breakers, filter, and protocol handlers.
Many document types can be stored in a single varbinary(max), or image column. For each document, SQL Server chooses the correct filter based on the file extension. Because the file extension is not visible when the file is stored in a varbinary(max), or image column, the file extension must be stored in a separate column in the table, called a type column. This type column can be of any character-based data type and contains the document file extension, such as .doc for a Microsoft Word document.
When creating a full-text index on a varbinary(max), or image column you must identify a corresponding type column that has the extension information so that SQL Server knows which filter to use. The IDs of the full-text indexed column and its associated type column can be found using the sys.full-text_index_columns catalog view.
After the varbinary(max), or image column is full-text indexed, it can be queried using the search predicates CONTAINS and FREETEXT.
Thesaurus provides the ability to define synonyms and acronyms and use the synonyms in full text searches. With SQL commands like CONTAINS, CONTAINSTABLE, the thesaurus is used to identify expressions or replacements for the searched terms. The thesaurus files are in XML format with a global file (tsGLOBAL.xml) and then 18 language specific files. By default all of the files have sample XML that is commented out. So by default no synonyms are setup when a full text search is issued.
The files are located in SQL_Server_install_path\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\FTDATA\.
The thesaurus file includes a <diacritics_sensitive> tag. This element contains an integer value that controls accent sensitivity. For example, suppose you specify the pattern “café” to be replaced by other patterns in a Full-Text Search query. If the thesaurus file is accent-insensitive, Full-Text Search replaces the patterns “café” and “cafe”. If the thesaurus file is accent-sensitive, Full-Text Search replaces only the pattern “café”. Note that this setting can only be applied one time in the file, and applies to all the search patterns in the file.
In this post, I tried to explain the filestream and full-text search feature in SQL Server. In SQL Sever 2008, there is great improvement in these features compared to older versions of SQL Server. One of the limitations with full text search is lack of support for soundex search, but we can use Google spell checker or some of the third-party tools such as aspNetSpell or the RADSpell from telerik.
Stay tuned for the PART II .