a国产,中文字幕久久波多野结衣AV,欧美粗大猛烈老熟妇,女人av天堂

當(dāng)前位置:主頁 > 文藝論文 > 廣告藝術(shù)論文 >

Web信息智能抽取技術(shù)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-01-14 08:37

  本文關(guān)鍵詞:Web信息智能抽取技術(shù)的研究與實(shí)現(xiàn) 出處:《電子科技大學(xué)》2009年碩士論文 論文類型:學(xué)位論文


  更多相關(guān)文章: 信息抽取 規(guī)則生成器 模板生成器 增量/多頁處理


【摘要】: 隨著我國(guó)經(jīng)濟(jì)的迅速發(fā)展,國(guó)家信息基礎(chǔ)設(shè)施建設(shè)強(qiáng)度加大加強(qiáng)和人民生活質(zhì)量的提高,網(wǎng)絡(luò)已經(jīng)深入人們生活的方方面面,成為工作或生活中不可缺少的一部分,怎樣快速有效的獲取Web上的信息,已經(jīng)成為了一個(gè)重要的研究課題。但是網(wǎng)絡(luò)上的信息種類繁多、網(wǎng)頁結(jié)構(gòu)形式多變,大多數(shù)網(wǎng)頁上還包含了許多廣告、導(dǎo)航、熱點(diǎn)鏈接等噪音信息,這些問題給研究者帶來了很大的困擾。而目前的信息抽取技術(shù)還存在很多不足:如僅能處理一種類型網(wǎng)頁,提取的信息細(xì)化程度低,準(zhǔn)確率與效率矛盾、人工干預(yù)與智能化操作、不支持增量信息處理等問題。這就迫切需要一種全新的信息提取方法來解決這些問題,本課題就是在這種需求下產(chǎn)生的。 本文主要采用的是模板化的信息提取算法,先利用規(guī)則生成器識(shí)別網(wǎng)頁上的目標(biāo)實(shí)體分隔符,然后由模板生成器把這些分割標(biāo)記配置到模板中,最后由信息抽取器根據(jù)模板提取該站點(diǎn)的相關(guān)信息。具體創(chuàng)新點(diǎn)或關(guān)鍵技術(shù)如下: 1、通過分析的站點(diǎn)網(wǎng)頁結(jié)構(gòu),分析網(wǎng)頁結(jié)構(gòu)布局形式和標(biāo)簽的分布規(guī)律,并結(jié)合目前國(guó)內(nèi)外的信息抽取技術(shù),發(fā)明了一套可以定義任何網(wǎng)頁結(jié)構(gòu)形式的模板,并設(shè)計(jì)出了一套模板自動(dòng)配置方案; 2、設(shè)計(jì)了信息抽取器:實(shí)現(xiàn)了讀取模板,以及根據(jù)模板配置進(jìn)行信息抽取的方法,并在此過程中增加了信息增量/多頁處理算法:采用增量/多頁算法來解決同一主題的內(nèi)容分布在多個(gè)網(wǎng)頁的問題,即需要進(jìn)行融合計(jì)算,以及解決不同時(shí)間段,主題網(wǎng)頁內(nèi)容動(dòng)態(tài)更新的問題,即要進(jìn)行增量提取;去重處理算法:處理站點(diǎn)間相似或相同主題重復(fù)問題; 3、結(jié)果的結(jié)構(gòu)化存儲(chǔ):根據(jù)模板的配置,提取相關(guān)的信息,并采用結(jié)構(gòu)化的形式進(jìn)行保存;設(shè)計(jì)一個(gè)可動(dòng)態(tài)擴(kuò)展的信息提取系統(tǒng):根據(jù)不同的需要,動(dòng)態(tài)配置模板,不需要更改代碼。 本文在理論上提出了一套依據(jù)模板能自動(dòng)提取各種類型網(wǎng)頁的信息抽取方案,并開發(fā)了相應(yīng)的系統(tǒng)IWIES。實(shí)踐結(jié)果證明,本方案相對(duì)于常見的Web信息抽取技術(shù)方法具有更好的提取速度以及更高的準(zhǔn)確率、召回率。
[Abstract]:With the rapid development of our country's economy, the strengthening of the national information infrastructure construction and the improvement of the people's quality of life, the network has gone deep into all aspects of people's life. Become an indispensable part of work or life, how to quickly and effectively obtain information on Web, has become an important research topic, but there are many kinds of information on the network. The structure of the web page is changeable, and most web pages also contain a lot of noise information, such as advertisement, navigation, hot link and so on. These problems have brought a great deal of trouble to the researchers. However, the current information extraction technology still has many shortcomings: only one type of web pages can be processed, the degree of information refinement is low, and the accuracy and efficiency are contradictory. Artificial intervention and intelligent operation do not support incremental information processing and so on. Therefore, a new information extraction method is urgently needed to solve these problems. This paper mainly uses the template-based information extraction algorithm, first using the rule generator to identify the target entity separator on the web page, and then the template generator to configure these segmentation tags into the template. Finally, the information extractor extracts the relevant information of the site according to the template. The specific innovation points or key technologies are as follows: 1. Through the analysis of the structure of the web page, the layout of the page structure and the distribution of tags, and combined with the current information extraction technology at home and abroad. A set of templates can define any form of web page structure, and a set of template automatic configuration scheme is designed. 2. The information extractor is designed: the method of reading the template and extracting the information according to the configuration of the template is implemented. In this process, the information increment / multi-page processing algorithm is added: the incremental / multi-page algorithm is used to solve the problem that the content of the same topic is distributed in multiple pages, that is, the fusion calculation is needed. And to solve the problem of dynamic updating of theme pages in different time periods, that is to do incremental extraction; De-reprocessing algorithm: to deal with similar or the same topic repeat problem between sites; (3) structured storage of results: according to the configuration of templates, the relevant information is extracted and stored in a structured form; Design a dynamic extensible information extraction system: according to different needs, dynamically configure the template without changing the code. In this paper, we propose a set of information extraction schemes based on template which can automatically extract all kinds of web pages, and develop the corresponding system IWIES. the practical results prove that. This scheme has better extraction speed, higher accuracy and higher recall than common Web information extraction methods.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2009
【分類號(hào)】:TP391.1

【引證文獻(xiàn)】

相關(guān)期刊論文 前2條

1 鄭思婷;楊p芑,

本文編號(hào):1422856


資料下載
論文發(fā)表

本文鏈接:http://www.wukwdryxk.cn/wenyilunwen/guanggaoshejilunwen/1422856.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶1ca14***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
日韩人妻无码潮喷中文视频| 中日韩无砖码一线二线| 印度黄色片| 高清无码色大片中文| 女人爽到高潮视频免费直播| 国产精品天干天干| 四虎影视久久久免费观看| 丰满少妇被猛烈进入高清APP| 男女27报xxoo做爰高潮| 韩国女同性做爰三级| 免费又黄又爽又猛的毛片| 色天使色妺姝在线视频| 成人精品视频一区二区三区| 成 人 免费 黄 色 网站视频| 慈利县| 国产精品一区二区三区在线| 国产破处| 91麻豆精品秘密入口| 久久成人综合| 亚洲AV网站| 精品无码综合一区二区三区| 亚洲日本中文字幕乱码在线| 国产成人精品A视频免费福利| 国产成年无码久久久免费| 国产伦精品一区二区三区视频| 无码国产精品成人| 国产成人无码A区视频在线观看| 国产亚洲精品一二区| 日韩精品久久久肉伦网站| 最新国产精品自在线观看| 极品人妻少妇一区二区三区| 国产成人亚洲综合色影视| 久久精品国产亚洲av四区| 后入极品少妇| 香蕉网| 久久99精品久久久久久| 公和我在野外做好爽爱爱| 欧美亚洲精品一区二区三| 久久综合久| 一个色| 嫩草嫩草嫩草嫩草|