当前位置: 首页 > 文章 > 基于word2vec和LSTM的饮食健康文本分类研究 农业机械学报 2017,48 (10) 202-208
Position: Home > Articles > Diet Health Text Classification Based on word2vec and LSTM Transactions of the Chinese Society for Agricultural Machinery 2017,48 (10) 202-208

基于word2vec和LSTM的饮食健康文本分类研究

作  者:
赵明;杜会芳;董翠翠;陈长松
单  位:
公安部第三研究所;中国农业大学信息与电气工程学院
关键词:
文本分类;word2vec;词向量;长短期记忆网络;K-means++
摘  要:
为了对饮食文本信息高效分类,建立一种基于word2vec和长短期记忆网络(Long-short term memory,LSTM)的分类模型.针对食物百科和饮食健康文本特点,首先利用word2vec实现包含语义信息的词向量表示,并解决了传统方法导致数据表示稀疏及维度灾难问题,基于K-means++根据语义关系聚类以提高训练数据质量.由word2vec构建文本向量作为LSTM的初始输入,训练LSTM分类模型,自动提取特征,进行饮食宜、忌的文本分类.实验采用48 000个文档进行测试,结果显示,分类准确率为98.08%,高于利用tf-idf、bag-of-words等文本数值化表示方法以及基于支持向量机(Support vector machine,SVM)和卷积神经网络(Convolutional neural network,CNN)分类算法结果.实验结果表明,利用该方法能够高质量地对饮食文本自动分类,帮助人们有效地利用健康饮食信息.
译  名:
Diet Health Text Classification Based on word2vec and LSTM
关键词:
text classification%word2vec%word embedding%long-short term memory network%K-means + +
摘  要:
The development of Internet information age makes Internet information grow rapidly.As the main information form of the network,the texts are massive,so is texts information about diet.The diet information is closely related with people's health.It is important to make texts be auto-classified to help people make effective use of health eating information.In order to classify the food text information efficiently,a classification model was proposed based on word2vec and LSTM.According to the characteristics of food text information in encyclopedia and diet texts in health websites,word2vec realized word embedding,including semantic information which solved the problem of sparse representation and dimension disaster that the traditional method faced.Word2vec combined with K-means + + was used to cluster key words both of the proper and the avoiding to enlarge relevant words in classification dictionaries.The words were employed to work out rules to improve the quality of training data.Then document vectors were constructed based on word2vec as the initial input values of long-short term memory network (LSTM).LSTM moved input layer,hidden layers of the neural network into the memory cell to be protected.Through the "gate" structure,sigmoid function and tanh function to remove or increase the information to the cell state which enabled LSTM model the "memory" to make good use of the text context information,which was significant for text classification.Experiments were performed with 48 000 documents.The results showed that the classification accuracy was 98.08%.The result was higher than that of ways based on tf-idf and bag-of-words text vectors representation methods.Two other classification algorithms of support vector machine (SVM) and convolutional neural network (CNN) were also conducted.Both of them were based on word2vec.The results showed that the proposed model outperformed other competing methods by several percentage points.It proved that the method can automatically classify dietary texts with high quality and help people to make good use of health diet information.

相似文章

计量
文章访问数: 21
HTML全文浏览量: 0
PDF下载量: 0

所属期刊

推荐期刊