微博熱點話題檢測與跟蹤技術研究
發(fā)布時間:2018-10-23 20:31
【摘要】:話題檢測與跟蹤是指從海量數(shù)據(jù)中發(fā)現(xiàn)被最多討論的話題并在后續(xù)信息中跟進話題的發(fā)展變化狀態(tài),為人們解決愈發(fā)嚴重的信息爆炸問題。話題檢測與跟蹤可以節(jié)省用戶時間,跟進事件發(fā)展動態(tài);為輿情監(jiān)控提供數(shù)據(jù)支持,有重要的實際價值和安全意義。隨著越來越多的用戶使用微博進行信息發(fā)布和話題討論,熱點話題展示也逐漸變成微博平臺的一個重要功能。由于微博的即時性很強,突發(fā)新聞在微博上的傳播速度很快,而且對于影響力較大的新聞事件,參與報道、轉發(fā)、評論的用戶數(shù)量也很大,往往能夠先于傳統(tǒng)新聞媒體做出反應。因此,針對微博的特點,本文通過過濾無效微博,設計并實現(xiàn)了一種針對微博的熱點話題跟蹤及檢測方法,主要工作如下:1)分析了微博特性,過濾了無效微博。微博用戶人群復雜,涵蓋范圍廣,差別大,內容駁雜。通過分析微博用戶特征,包括用戶粉絲數(shù)與用戶每日發(fā)布微博數(shù),過濾廣告用戶與僵尸用戶;通過分析微博內容,過濾商家推廣活動,與用戶分享內容,用戶參與的活動等大量對話題無貢獻的微博;通過分析分詞后的微博數(shù)據(jù),過濾包含詞數(shù)過多和過少的微博,去除無意義的過短文本,和重復過多的過長文本,有效過濾無效微博,降低計算復雜度。2)設計并實現(xiàn)了基于時間特性的微博熱點話題檢測算法。將微博按時間遞增順序處理,通過改進Single-Pass聚類算法,包括相似度計算方法的改進,結合用戶影響力的話題向量更新方法的改進,進行初步話題檢測;利用FP-Growth頻繁項集發(fā)現(xiàn)算法,挖掘頻繁特征詞集,修正SP算法的錯誤;利用改進的K-MEDOIDS算法對頻繁特征詞集進行聚類,抽取最終話題,提高了計算效率與話題檢測的準確率。3)設計并實現(xiàn)了基于時間特性的多查詢向量自適應話題跟蹤算法。基于微博數(shù)量在時間維度上的分布特征,將微博按時段分組,并按時間遞增順序處理;將每個時段的話題與已存在所有話題組的所有話題進行相似度計算對比,根據(jù)閾值選擇將其歸入已存在話題組或創(chuàng)建新的話題組,自適應更改加入話題組的話題向量。有效的跟蹤話題發(fā)展狀態(tài),提高了準確率,減少了話題漂移。
[Abstract]:Topic detection and tracking is to find the most discussed topic from the massive data and follow up the development and change of the topic in the follow-up information to solve the increasingly serious problem of information explosion for people. Topic detection and tracking can save user time, follow up the development of events, and provide data support for public opinion monitoring, which has important practical value and security significance. As more and more users use Weibo to publish information and discuss topics, hot topic display has gradually become an important function of Weibo platform. Because Weibo's immediacy is very strong, breaking news spreads very quickly on Weibo, and the number of users who participate in reporting, forwarding, and commenting on news events with great influence is also very large. It is often possible to react before the traditional news media. Therefore, according to the characteristics of Weibo, this paper designs and implements a method of tracking and detecting hot topics for Weibo by filtering invalid Weibo. The main work is as follows: 1) analyzing the characteristics of Weibo, filtering the invalid Weibo. Weibo user crowd is complex, covers a wide range, the difference is big, the content is complicated. By analyzing Weibo's user characteristics, including the number of users' fans and the number of users issuing Weibo daily, filtering advertising users and zombie users, analyzing the content of Weibo, filtering merchants' promotional activities, and sharing content with users, Weibo, who has no contribution to the topic, participated in a large number of activities such as user participation. By analyzing the Weibo data after the participle, he filtered too many words and too few words to remove meaningless and too short text, and repeated too many long texts. Effectively filter invalid Weibo, reduce the computational complexity. 2) designed and implemented the algorithm based on the time characteristics of Weibo hot topic detection. Weibo is processed in the order of increasing time, by improving the Single-Pass clustering algorithm, including the improvement of similarity calculation method, combining with the improvement of the topic vector updating method of user's influence, the preliminary topic detection is carried out, and the FP-Growth frequent itemset discovery algorithm is used. Mining frequent feature word sets, correcting errors of SP algorithm, clustering frequent feature words set with improved K-MEDOIDS algorithm, extracting final topic, The computational efficiency and the accuracy of topic detection are improved. 3) A multi-query vector adaptive topic tracking algorithm based on time characteristic is designed and implemented. On the basis of the distribution of Weibo's quantity in time dimension, Weibo is grouped according to the period of time and processed in the order of increasing time, and the similarity calculation between the topics of each time period and all the topics that already exist in all the topic groups is compared. According to the threshold selection, the topic vector is changed adaptively to the existing topic group or to create a new topic group. Tracking the status of topic development effectively improves the accuracy and reduces the topic drift.
【學位授予單位】:東南大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1;TP393.092
本文編號:2290384
[Abstract]:Topic detection and tracking is to find the most discussed topic from the massive data and follow up the development and change of the topic in the follow-up information to solve the increasingly serious problem of information explosion for people. Topic detection and tracking can save user time, follow up the development of events, and provide data support for public opinion monitoring, which has important practical value and security significance. As more and more users use Weibo to publish information and discuss topics, hot topic display has gradually become an important function of Weibo platform. Because Weibo's immediacy is very strong, breaking news spreads very quickly on Weibo, and the number of users who participate in reporting, forwarding, and commenting on news events with great influence is also very large. It is often possible to react before the traditional news media. Therefore, according to the characteristics of Weibo, this paper designs and implements a method of tracking and detecting hot topics for Weibo by filtering invalid Weibo. The main work is as follows: 1) analyzing the characteristics of Weibo, filtering the invalid Weibo. Weibo user crowd is complex, covers a wide range, the difference is big, the content is complicated. By analyzing Weibo's user characteristics, including the number of users' fans and the number of users issuing Weibo daily, filtering advertising users and zombie users, analyzing the content of Weibo, filtering merchants' promotional activities, and sharing content with users, Weibo, who has no contribution to the topic, participated in a large number of activities such as user participation. By analyzing the Weibo data after the participle, he filtered too many words and too few words to remove meaningless and too short text, and repeated too many long texts. Effectively filter invalid Weibo, reduce the computational complexity. 2) designed and implemented the algorithm based on the time characteristics of Weibo hot topic detection. Weibo is processed in the order of increasing time, by improving the Single-Pass clustering algorithm, including the improvement of similarity calculation method, combining with the improvement of the topic vector updating method of user's influence, the preliminary topic detection is carried out, and the FP-Growth frequent itemset discovery algorithm is used. Mining frequent feature word sets, correcting errors of SP algorithm, clustering frequent feature words set with improved K-MEDOIDS algorithm, extracting final topic, The computational efficiency and the accuracy of topic detection are improved. 3) A multi-query vector adaptive topic tracking algorithm based on time characteristic is designed and implemented. On the basis of the distribution of Weibo's quantity in time dimension, Weibo is grouped according to the period of time and processed in the order of increasing time, and the similarity calculation between the topics of each time period and all the topics that already exist in all the topic groups is compared. According to the threshold selection, the topic vector is changed adaptively to the existing topic group or to create a new topic group. Tracking the status of topic development effectively improves the accuracy and reduces the topic drift.
【學位授予單位】:東南大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.1;TP393.092
【參考文獻】
相關期刊論文 前5條
1 周剛;鄒鴻程;熊小兵;黃永忠;;MB-SinglePass:基于組合相似度的微博話題檢測[J];計算機科學;2012年10期
2 廉捷;周欣;曹偉;劉云;;新浪微博數(shù)據(jù)挖掘方案[J];清華大學學報(自然科學版);2011年10期
3 張輝;周敬民;王亮;趙莉萍;;基于三維文檔向量的自適應話題追蹤器模型[J];中文信息學報;2010年05期
4 洪宇;張宇;劉挺;李生;;話題檢測與跟蹤的評測及研究綜述[J];中文信息學報;2007年06期
5 王會珍;朱靖波;季鐸;葉娜;張斌;;基于反饋學習自適應的中文話題追蹤[J];中文信息學報;2006年03期
,本文編號:2290384
本文鏈接:http://www.wukwdryxk.cn/wenyilunwen/guanggaoshejilunwen/2290384.html