BEFORE Triggers: Best practices for Unicode column supporting mixed languages and searching

(I posted this to sqlserver.server newsgroup but did not get a helpful
response)
Globalization gurus,
The problem is storing in a single column character data from mixed
languages and then providing a search capability to find the best match
given a search string in some arbitrary language. The column, of course, is
Unicode data type with some collation. What is recommendation for table
design for efficiency?
As an example, consider an international directory of business names and a
function to search for a name "like" <some string> where <some string> is
user input in any language. Since collation determines comparison rules it
seems the appropriate collation is one that best matches the language of the
search string. Further, to facilitate matching it seems appropriate to relax
restrictions such as case and accent sensitivity. That is a strict binary
comparison is not "user friendly." However there are performance
implications when the collation of the search string does not match the
collation of the database column.
Has anyone solved this problem or know of a good discussion?
Thank you,
SethSeth,
Hmm... This seem a very good fit for SQL Server 2005 Full-Text Search (FTS).
Note, that I'm NOT recommending SQL Server 2000 FTS for this as there have
been specific SQL FTS enhancements for these requirements that are not in
SQL 2000. Specifically, in SQL Server 2005, you can define multiple
languages in one Unicode column and then do CONTAINS or FREETEXT and specify
the language (via LCID) on a per query basis, for example:
select * from table where
contains(*,'formsof(inflectional,"englishword")',language 1033) OR
contains(*,'formsof(inflectional,"chineseword")',language 2052)
The above query will find the englishword or chineseword and inflectional
variations from the same FT-enable column. You can also use this query with
CONTAINSTABLE or FREETEXTTABLE and use the RANK value. As for table design,
that depend upon what you're storing and what combined column you want
returned to your searchers.
You can also control the accent sensitivity via
"CREATE FULLTEXT CATALOG <FT_Catalog_Name> WITH ACCENT_SENSITIVITY = [OFF |
ON]
However, FTS is not case sensitive, but based upon research in to how people
search, case sensitivity is usually viewed as a hindrance as most internet
search engines are case-insensitive as well. Why do you need this level of
control?
Feel free to email me directly if you'd like to continue this discuss in
more details.
Thanks,
John
--
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"Seth" <seth@.no*spam.com> wrote in message
news:Orp0SUb2FHA.2472@.TK2MSFTNGP12.phx.gbl...
> (I posted this to sqlserver.server newsgroup but did not get a helpful
> response)
> Globalization gurus,
> The problem is storing in a single column character data from mixed
> languages and then providing a search capability to find the best match
> given a search string in some arbitrary language. The column, of course,
> is
> Unicode data type with some collation. What is recommendation for table
> design for efficiency?
> As an example, consider an international directory of business names and a
> function to search for a name "like" <some string> where <some string> is
> user input in any language. Since collation determines comparison rules it
> seems the appropriate collation is one that best matches the language of
> the
> search string. Further, to facilitate matching it seems appropriate to
> relax
> restrictions such as case and accent sensitivity. That is a strict binary
> comparison is not "user friendly." However there are performance
> implications when the collation of the search string does not match the
> collation of the database column.
> Has anyone solved this problem or know of a good discussion?
> Thank you,
> Seth
>
>|||If anyone has ideas on non-FTS solutions I would appreciate that as well.
John,
Good idea!
I have read about FTS enhancements in SQL Server 2005 but I admit to having
an aversion to FTS and dismissed it as an option prematurely. It is
definitely worth consideration.
Seth
--
"John Kane" wrote in message news:OVefrPe2FHA.1140@.tk2msftngp13.phx.gbl...
Seth,
Hmm... This seem a very good fit for SQL Server 2005 Full-Text Search (FTS).
Note, that I'm NOT recommending SQL Server 2000 FTS for this as there have
been specific SQL FTS enhancements for these requirements that are not in
SQL 2000. Specifically, in SQL Server 2005, you can define multiple
languages in one Unicode column and then do CONTAINS or FREETEXT and specify
the language (via LCID) on a per query basis, for example:
select * from table where
contains(*,'formsof(inflectional,"englishword")',language 1033) OR
contains(*,'formsof(inflectional,"chineseword")',language 2052)
The above query will find the englishword or chineseword and inflectional
variations from the same FT-enable column. You can also use this query with
CONTAINSTABLE or FREETEXTTABLE and use the RANK value. As for table design,
that depend upon what you're storing and what combined column you want
returned to your searchers.
You can also control the accent sensitivity via
"CREATE FULLTEXT CATALOG <FT_Catalog_Name> WITH ACCENT_SENSITIVITY = [OFF |
ON]
However, FTS is not case sensitive, but based upon research in to how people
search, case sensitivity is usually viewed as a hindrance as most internet
search engines are case-insensitive as well. Why do you need this level of
control?
Feel free to email me directly if you'd like to continue this discuss in
more details.
Thanks,
John
--
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"Seth" <seth@.no*spam.com> wrote in message
news:Orp0SUb2FHA.2472@.TK2MSFTNGP12.phx.gbl...
> (I posted this to sqlserver.server newsgroup but did not get a helpful
> response)
> Globalization gurus,
> The problem is storing in a single column character data from mixed
> languages and then providing a search capability to find the best match
> given a search string in some arbitrary language. The column, of course,
> is
> Unicode data type with some collation. What is recommendation for table
> design for efficiency?
> As an example, consider an international directory of business names and a
> function to search for a name "like" <some string> where <some string> is
> user input in any language. Since collation determines comparison rules it
> seems the appropriate collation is one that best matches the language of
> the
> search string. Further, to facilitate matching it seems appropriate to
> relax
> restrictions such as case and accent sensitivity. That is a strict binary
> comparison is not "user friendly." However there are performance
> implications when the collation of the search string does not match the
> collation of the database column.
> Has anyone solved this problem or know of a good discussion?
> Thank you,
> Seth
>
>

Thursday, March 22, 2012

Best practices for Unicode column supporting mixed languages and searching

No comments:

Post a Comment

BEFORE Triggers

Blog Archive

About Me