Position: Home > Articles > 基于Spark框架XGBoost的林业文本并行分类方法研究
Transactions of the Chinese Society for Agricultural Machinery
2019
(6)
280-287
基于Spark框架XGBoost的林业文本并行分类方法研究
作 者:
崔晓晖;师栋瑜;陈志泊;许福
关键词:
林业文本;文本分类;大数据分析;Spark;XGBoost
摘 要:
针对当前"互联网+"技术与林业的交叉融合,涌现出海量待挖掘的涉林文本,而林业文本分类的相关研究尚不成熟的问题,使用网络爬虫技术面向互联网采集涉林文本,基于丰富的语料重新构建分类标签,提出基于Spark计算框架的XGBoost并行化方法,对林业文本进行分类.经由交叉验证,构建的XGBoost并行分类算法准确率为0. 923 4,在各类别中最低F1为0. 860 4,最高为0. 998 4;其在2. 1万条、4. 2万条、8. 4万条数据集上的训练加速比分别为2. 13、3. 47、3. 82.结果表明,基于该标签设定的分类模型对现存互联网中涉林文本的适应性较好;Spark环境下实现的XGBoost并行化算法的准确率显著优于其他4种机器学习(朴素贝叶斯、GBDT决策树、BP神经网络和ELM神经网络算法)的并行化算法,算法执行效率远高于单机版本,且数据量越大,其加速比越高,能有效应对海量林业文本的实时、准确分类.
作 者:
CUI Xiaohui;SHI Dongyu;CHEN Zhibo;XU Fu;College of Information Science and Technology,Beijing Forestry University;
关键词:
forestry text;;text classification;;big data analysis;;Spark;;XGBoost
摘 要:
At present,the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored,and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific,leading to the classification model lacking of ability to classify the texts on net; the classification algorithm was mostly trained in the single-machine environment without considering its parallelism,then the algorithm could not deal with the actual large-scale data classification problem. Therefore,it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts,and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark,which completed the computing of training and prediction by RDD program mode. Through cross-validation method,the accuracy of XGBoost parallel algorithm could reach 0. 923 4. The lowest F1-measure value was 0. 860 4 and the highest was 0. 998 4. By training on the 21 thousand,42 thousand and 84 thousand data sets,the speedup ratios could reach 2. 13,3. 47 and 3. 82,respectively. The results showed that the new classification labels were set more scientific,and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB,gradient boosting decision tree,back propagation neural network,extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number,the acceleration ratio could be improved,which meant it was pretty useful to deal with the problem about the real-time and accurate classification of massive forestry texts.
相似文章
-
基于TextRank和簇过滤的林业文本关键信息抽取研究 [陈志泊, 李钰曼, 许福, 冯国明, 师栋瑜, 崔晓晖] 农业机械学报 2020 (5) 207-214+172