[1]陈德意,张宏怡,刘彩玲,等.基于关键词策略和CNN的中文文本有害信息分类[J].集美大学学报(自然科学版),2020,25(5):392-400.
CHEN Deyi,ZHANG Hongyi,LIU Cailing,et al.Classification of Chinese Text Harmful Information Based on Keywords Strategy and Convolutional Neural Network[J].Journal of Jimei University,2020,25(5):392-400.
点击复制
基于关键词策略和CNN的中文文本有害信息分类(PDF)
《集美大学学报(自然科学版)》[ISSN:1007-7405/CN:35-1186/N]
- 卷:
-
第25卷
- 期数:
-
2020年第5期
- 页码:
-
392-400
- 栏目:
-
数理科学与信息工程
- 出版日期:
-
2020-09-30
文章信息/Info
- Title:
-
Classification of Chinese Text Harmful Information Based on Keywords Strategy and Convolutional Neural Network
- 作者:
-
陈德意1; 张宏怡1; 刘彩玲1; 张光斌2
-
(1.厦门理工学院光电与通信工程学院,福建 厦门 361024; 2.厦门市美亚柏科信息股份有限公司,福建 厦门 361005)
- Author(s):
-
CHEN Deyi1; ZHANG Hongyi1; LIU Cailing1; ZHANG Guangbin2
-
CHEN Deyi1,ZHANG Hongyi1,LIU Cailing1,ZHANG Guangbin2
-
- 关键词:
-
词向量; 分词频文档频率; 特征词集合; Word2Vec模型; 卷积神经网络
- Keywords:
-
word embedding; STF-DF; feature word set; Word2Vec model; convolution neural network(CNN)
- 分类号:
-
-
- DOI:
-
-
- 文献标志码:
-
A
- 摘要:
-
提出一种新颖的中文文本分类框架。在该框架中,首先基于Word2Vec构建词向量模型,然后采用分词频文档频率(segmentation term frequencydocument frequency,STF-DF)筛选出类别区分能力强的关键词,同时构建一种适合于中文文本分类的卷积神经网络(convolution neural network,CNN)进行分类。实验结果表明,采用该框架使THUCNews和复旦大学中文文本数据集中的准确率分别达到了94.51%和95.04%,同时在真实的有害信息数据集中取得了99.70%的召回率,这验证了所提出框架的有效性和实用价值。
- Abstract:
-
The rapid development of internet and big data technology has greatly facilitated people’s access to various Chinese text information,but also greatly increased the risk of dissemination of harmful information in Chinese text.The traditional text processing method based on vector representation is mainly used to process English text.To deal with these problems,a novel Chinese text classification framework was proposed.In this framework,a word vector model based on Word2Vec was constructed firstly.Then the keywords with distinguishing category ability were selected by using word document frequency (segmentation term frequencydocument frequency,STF-DF). Meanwhile,a suitable convolution neural network (CNN) was build for Chinese text classification.The experimental results show that the accuracy of this framework in THUCNews and Fudan University Chinese text data set is 94.51% and 95.04% respectively,and the recall rate is 99.70% in the real harmful information data set,which verifies the effectiveness and good practical value of the proposed framework.
参考文献/References:
相似文献/References:
更新日期/Last Update:
2020-11-04