Studyand DesignofChineseConcept-Based SearchEngine

Thispaperproposes a newkindofChinese concept-based searchengine andgives itstheory model, working mechanism anddesigning procedure. Itskernel is theknowledge database andanewweighting algorithm on counting HTML tags' weight. Using ofthese twotechniques hasnotonlyimproved theexactitude ofindexdatabase but alsotheaccuracy ofusers' query. Sotheprecision ratio and recall ratio ofsearch engine havebeenimproved essentially. Keywords-searchengine;concept;weighting; intelligence; knowledge database. I. INTRODUCTION Rapiddevelopment andextensive popularization of Internet isdriving search engine toupdate rapidly. But mostofsearch engines arebased onkeywords. Thatisto saytheycan't distinguish thehomographs andcan't associate withsynonyms ofkeywords. Search engine has already comeintoa newfield toberesearched and developed, especially theChinese search engine, because ofthecomplexity ofChinese semantic meaning. Atthis point, thepaperputsforward a newkindofChinese intelligent search engine which isbased onconcept (1)and knowledge database (2). Itcandistinguish homographs andcanassociate with synonyms ofkeywords andcanget ridofthose high frequency words, suchas"t . X " etc. These words arefrequent butinsignificant andthey will wastealotofstorage room.What's more, itusesa newalgorithm oncounting HTML tags' weight. This algorithm considers all kinds oftags, forinstance "TITLE, H,P,B"andweight ofeach tags. Ifatagisimportant then its weight ishigh. These weights aregained fromlots of experiments andtheory foundation. Therefore ithas greatly enhanced performance ofsearch engine. transport protocol, Robotransacks allovertheWWW spaceincluding allhyperlinks inWebpage tocollect Webpageinformation andstores theinformation into Webpage database. Sowe cananalyze theWebpage's information andprocess theinformation. Compared with other Robot, this Robotcandiscover thedeadlinks and find thenewlyaddedlinks. Itissynchronous withall Internet resource. Therearetwowaysofobtaining preliminary URL.Oneisitself collecting regularly andthe other isuser referral. B.Indexer Theindexer's goal istobuild webindex database which canberetrieved bySearch Module. Index database isthe soulofsearch engine1. Bythelookofsomerespect, itis theIndex Database that determines thequality ofsearch engine. Therefore, design ofindexer isimportant and pivotal. Inorder toincarnate this tenet, twomeasures are introduced: knowledge database technique andnew weighting algorithm. Because therearea lotof parasitological, semantic andlexical knowledge and commonsense, language material, wordsdatabase, statistical table forreverse words frequency (STRWF) etc inknowledge database, itismoreexacttosegment sentences andwords than before andthewords segmented ismoreexpressive. Ofcourse, these words will stand for theWebpagemeanings. Generally, indexer builds Webpage index record byautomatically picking upsome characteristic information orthelabels which canexpress Webpage theme, suchasWebpage title, Webaddress, hyperlink, people name, organization name,place name andsomeanterior words intheWebpage etc. Forexample, weighting measureadopted byWebsite AltaVista is showed intable 1. Itcanbeseenfromtable 1that AltaVista didn't consider other HTML tagsbutonlytag'title'. Obviously, this weighting measure isunilateral. Compared with AltaVista,
