DMman(数据挖掘青年)--在自己的算法中调用Weka实现文本分类的一个例子

本站首页 管理页面写新日志退出

公告

求真务实打基础，
宁缺毋滥读好书。

数据挖掘青年(DMman)

我的分类（专题）

首页(102)
Weka(11)
Java SE(8)
数据挖掘(26)
数据库(1)
程序人生(17)
Java EE(20)
操作系统(2)
杂谈(14)

日志更新

问君能有几多愁，恰似一群太监上青楼
我和僵尸有个约会：灵异世界或真实存在？
赤壁（下）观后小感：雷人
英科学家：酒精和烟草的危害大于大麻和摇头
只有社会主义才能拯救世界(由金融危机引发
求职心得(非名牌院校硕士计算机)
省外就业协议录入
数据挖掘方面的资源、期刊、会议的网址集合
面试心得（摘）
为学
EI收录中国期刊-核心（2008-5）
混沌理论：随机世界的建模
分子计算机已经问世，纳米计算机指日可待？
绝对好用免费的网络电话
NLP：基于机器学习的人类思想及行为建模
Weka中用于组合多个模型的的装袋、提升
数据挖掘在企业中应用的四种途径
(转)几点做人做事的建议
大学计算机软件专业生应该学什么(转)
一个程序员对学弟学妹建议(转)

留言板

签写新留言

weka分類問題
聚类算法的准确率怎么计算勒~
请教
求助
weka如何做20次10折交叉验证？
急求：weka属性简约后为什么没有提高分

链接

纪录片之家

数据挖掘者
 神威异度空间
 数据挖掘斗士
 中途出家
 不准阁
 烟雨朦胧

 神威智能挖掘中心

KDnuggets
ACM SIGKDD
数据挖掘研究院
 计算机科学论坛
 Weka中文论坛

北京福爱迪翻译中心

Blog信息

blog名称:DMman(数据挖掘青年)
日志总数:102
评论数量:564
留言数量:57
访问次数:1854072
建立时间:2007年4月9日

[Weka]在自己的算法中调用Weka实现文本分类的一个例子　
原创空间

数据挖掘青年发表于 2007/7/4 17:47:57

1 介绍：嵌入式机器学习，在自己的算法中调用Weka现文本分类,是一个小的数据挖掘程序，虽然实用价值不是很大，但对于Weka的理解和使用是有帮助的。本例子来自《数据挖掘：实用机器学习技术》第2版（好像是倒数第三章）。大家可以到http://blogger.org.cn/blog/message.asp?name=DMman#23691 下载该书察看对算法的详细解释。算法中作了详细的注释，虽然是英文的，但还是比较简单。下面对例子的使用作了浅显的介绍，有兴趣的朋友可以研究。 2 功能：使用weka中的j48分类器实现了文本分类的一个小程序。文本文件通过weka的过滤器StringToWordVector预处理。 3 注意：把weka.jar加入你的classpath中，才可以通过编译。 4 使用方法：命令行参数： -t 文本文件路径 -m 你的模型文件路径 -c 可选，类别（hit 或 miss）如果提供了-c则用于训练，否则被模型分类，输出该文本的类型（hit或miss）模型是动态建立的，第一次使用命令行必须指定-c参数，才可以建立模型。1) 建立模型>java MessageClassifier -t data/1.bmp -m myModel -c hit可以看到myModel建立了。然后继续训练一下这个模型。使用的文本实例越多，模型的分类性能越好>java MessageClassifier -t data/2.bmp -m myModel -c hit>java MessageClassifier -t data/1.gif -m myModel -c miss......2) 使用模型分类有了模型，就可以使用它为文本文件分类了，如>java MessageClassifier -t data/2.gif -m myModel 3) 可以使用提供-c参数的命令继续完善模型原文件MessageClassifier .java /*** Java program for classifying text messages into two classes.*/import weka.core.Attribute;import weka.core.Instance;import weka.core.Instances;import weka.core.FastVector;import weka.core.Utils;import weka.classifiers.Classifier;import weka.classifiers.trees.J48;import weka.filters.Filter;import weka.filters.unsupervised.attribute.StringToWordVector;import java.io.*;public class MessageClassifier implements Serializable {/* The training data gathered so far. */private Instances m_Data = null;/* The filter used to generate the word counts. */private StringToWordVector m_Filter = new StringToWordVector();/* The actual classifier. */private Classifier m_Classifier = new J48();/* Whether the model is up to date. */private boolean m_UpToDate;/*** Constructs empty training dataset.*/public MessageClassifier() throws Exception {String nameOfDataset = "MessageClassificationProblem";// Create vector of attributes.FastVector attributes = new FastVector(2);// Add attribute for holding messages.attributes.addElement(new Attribute("Message", (FastVector)null));// Add class attribute.FastVector classValues = new FastVector(2);classValues.addElement("miss");classValues.addElement("hit");attributes.addElement(new Attribute("Class", classValues));// Create dataset with initial capacity of 100, and set index of class.m_Data = new Instances(nameOfDataset, attributes, 100);m_Data.setClassIndex(m_Data.numAttributes() - 1);}/*** Updates data using the given training message.*/public void updateData(String message, String classValue) throws Exception {// Make message into instance.Instance instance = makeInstance(message, m_Data);// Set class value for instance.instance.setClassValue(classValue);// Add instance to training data.m_Data.add(instance);m_UpToDate = false;}/*** Classifies a given message.*/public void classifyMessage(String message) throws Exception {// Check whether classifier has been built.if (m_Data.numInstances() == 0) {////throw new Exception("No classifier available.");}// Check whether classifier and filter are up to date.if (!m_UpToDate) { // Initialize filter and tell it about the input format.m_Filter.setInputFormat(m_Data);// Generate word counts from the training data.Instances filteredData = Filter.useFilter(m_Data, m_Filter);// Rebuild classifier.m_Classifier.buildClassifier(filteredData);m_UpToDate = true;}// Make separate little test set so that message// does not get added to string attribute in m_Data.Instances testset = m_Data.stringFreeStructure();// Make message into test instance.Instance instance = makeInstance(message, testset);// Filter instance.m_Filter.input(instance);Instance filteredInstance = m_Filter.output();// Get index of predicted class value.double predicted = m_Classifier.classifyInstance(filteredInstance);// Output class value.System.err.println("Message classified as : " +m_Data.classAttribute().value((int)predicted));}/*** Method that converts a text message into an instance.*/private Instance makeInstance(String text, Instances data) {// Create instance of length two.Instance instance = new Instance(2);// Set value for message attributeAttribute messageAtt = data.attribute("Message");instance.setValue(messageAtt, messageAtt.addStringValue(text));// Give instance access to attribute information from the dataset.instance.setDataset(data);return instance;}/*** Main method.*/public static void main(String[] options) {try {// Read message file into string.String messageName = Utils.getOption('t', options);if (messageName.length() == 0) {throw new Exception("Must provide name of message file.");}FileReader m = new FileReader(messageName);StringBuffer message = new StringBuffer(); int l;while ((l = m.read()) != -1) {message.append((char)l);}m.close();// Check if class value is given.String classValue = Utils.getOption('c', options);// If model file exists, read it, otherwise create new one.String modelName = Utils.getOption('m', options);if (modelName.length() == 0) {throw new Exception("Must provide name of model file.");}MessageClassifier messageCl;try {ObjectInputStream modelInObjectFile =new ObjectInputStream(new FileInputStream(modelName));messageCl = (MessageClassifier) modelInObjectFile.readObject();modelInObjectFile.close();} catch (FileNotFoundException e) {messageCl = new MessageClassifier();}// Check if there are any options leftUtils.checkForRemainingOptions(options);// Process message.if (classValue.length() != 0) {messageCl.updateData(message.toString(), classValue);} else {messageCl.classifyMessage(message.toString());}// Save message classifier object.ObjectOutputStream modelOutObjectFile =new ObjectOutputStream(new FileOutputStream(modelName));modelOutObjectFile.writeObject(messageCl);modelOutObjectFile.close();} catch (Exception e) {e.printStackTrace();}}} 下载源码:500)this.width=500'>文本分类算法.rar

阅读全文(39140) | 回复(21) | 编辑 | 精华