--Nutch的自动运行 (windows)

本站首页 管理页面写新日志退出

« August 2025 »
日一二三四五六
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31

公告

我的分类（专题）

首页(1304)
Eclipse(8)
J2ME(3)
OpenSymphony(16)
Hibernate(97)
Tapestry(23)
J2SE(72)
Symbian(2)
eXtremeComponents(13)
JBoss(33)
Javascript(13)
MySQL(72)
Java Open Source(104)
DWR(Ajax)(29)
Spring(61)
WebWork(15)
Apache(jakarta)(77)
软件设计(6)
算法(22)
Acegi(2)
Subversion(44)
Dojo(Ajax)(2)
Wicket(3)
IDEA(2)
ESB(6)
TinyMCE+FCKeditor(20)
Grails(1)
Prototype(Ajax)(32)
设计模式(20)
Prototype(0)
FreeMarker(17)
集成测试(14)
codehaus.org(2)
AOP(13)
Java代码(7)
Struts 2.0(6)
Groovy(5)
Linux(10)
网站架构(70)
Cache(11)
Python(40)
网络与系统管理(34)
shell/bash(4)
Pylons学习(2)
Django(88)
Ruby on Rails(120)
Ubuntu(4)
Quixote(3)
视频处理(20)
Web(UI+UE)(2)
TurboGears(25)
jQuery(2)
iBatis(7)
CentOS(2)
MySQL集群(1)
SELinux(1)

日志更新

Java中压缩与解压--中文文件名乱码解
对当前目录下所有文件进行压缩代码
java zip 中文问题
iBatis for Paging
再析在spring框架中解决多数据源的问
如何在spring框架中解决多数据源的问
SELinux 的配置小解
apache+mod_ssl中证书生成方
StatSVN的使用（续）
[原创]MySQL的LIST分区体验与总

留言板

签写新留言

我也想装饰元件
谢谢
飘过！
模板的问题
mule 求助
extremecomponents.cs
搜索呢？
[Apache(jakarta)]Apa
jsper报表的制作!
求助一下,关于compass的

链接

SpringSide
SpringFramework中文论坛
 BlogJava
Java开源大全
 Java视线论坛
 CSDN Java频道
 JavaScud开源平台
 JavaAPI中文文档
 一个不错的提供代码示例的站点
 Spring 中文开发手册(1.1.PR)
Springframework
Hibernate
Java版模式速查手册
 良葛格學習筆記
 javareference
java2s
GRAILS

Blog信息

blog名称:
日志总数:1304
评论数量:2242
留言数量:5
访问次数:7591238
建立时间:2006年5月29日

[Apache(jakarta)]Nutch的自动运行 (windows)
软件技术

lhwork 发表于 2006/12/13 16:04:58

1：在Windows下调用nutch的脚本，可实现自动运行，这样做可以不用crywin来模拟linux,下面式win xp调用nutch的脚本 nutch.bat @cmd /V500)this.width=500'>n /c %~dp0nutch1.bat %* nutch1.bat @echo on rem ********************************************************************* rem * A script to launch nutch on Windows 2000/XP System. rem * rem * Written by babatu rem * babatu@gmail.com blog: blog.babatu.com rem * rem * Because delayed environment is used, cmd /V500)this.width=500'>n should be used to rem * run this script. rem ***************************************************************** if "%OS%"=="Windows_NT" @setlocal if "%OS%"=="WINNT" @setlocal if "%1" == "" goto :msg goto :begin :msg echo "Usage: nutch COMMAND" echo "where COMMAND is one of:" echo " crawl one-step crawler for intranets" echo " readdb read / dump crawl db" echo " readlinkdb read / dump link db" echo " inject inject new urls into the database" echo " generate generate new segments to fetch" echo " fetch fetch a segment's pages" echo " parse parse a segment's pages" echo " segread read / dump segment data" echo " updatedb update crawl db from segments after fetching" echo " invertlinks create a linkdb from parsed segments" echo " index run the indexer on parsed segments and linkdb" echo " merge merge several segment indexes" echo " dedup remove duplicates from a set of segment indexes" echo " plugin load a plugin and run one of its classes main()" echo " server run a search server" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." pause goto :end :begin rem %~dp0 这个脚本的扩展path ( expanded pathname of the current script under NT) set DEFAULT_NUTCH_HOME=%~dp0.. rem set DEFAULT_NUTCH_HOME=.. if "%NUTCH_HOME%"=="" set NUTCH_HOME=%DEFAULT_NUTCH_HOME% set DEFAULT_NUTCH_HOME="" rem 设置默认DEFAULT_NUTCH_HOME echo %NUTCH_HOME% rem set _USE_CLASSPATH=yes if "%CLASSPATH%"=="" ( set CLASSPATH=%JAVA_HOME%\lib\tools.jar) ELSE set CLASSPATH=%CLASSPATH%;%JAVA_HOME%\lib\tools.jar set CLASSPATH=%CLASSPATH%;%NUTCH_HOME%\conf; echo %CLASSPATH% echo before other rem for developers, add plugins, job & test code to CLASSPATH if exist %NUTCH_HOME%\build\plugins set CLASSPATH=%CLASSPATH%;%NUTCH_HOME%\build for /R %NUTCH_HOME%\build %%i in (nutch*.job) do set CLASSPATH=!CLASSPATH!;%%i if exist %NUTCH_HOME%\build\test\classes set CLASSPATH=%CLASSPATH%;%NUTCH_HOME%\build\test\classes rem for releases, add Nutch job to CLASSPATH for /R %NUTCH_HOME% %%i in (nutch*.job) do set CLASSPATH=!CLASSPATH!;%%i rem add plugins to classpath if exist %NUTCH_HOME%\plugins set CLASSPATH=%CLASSPATH%;%NUTCH_HOME% rem add libs to CLASSPATH for /R %NUTCH_HOME%\lib %%f in (*.jar) do set CLASSPATH=!CLASSPATH!;%%f echo %CLASSPATH% rem translate command if "%1"=="crawl" set CLASS=org.apache.nutch.crawl.Crawl if "%1"=="inject" set CLASS=org.apache.nutch.crawl.Injector if "%1"=="generate" set CLASS=org.apache.nutch.crawl.Generator if "%1"=="fetch" set CLASS=org.apache.nutch.fetcher.Fetcher if "%1"=="parse" set CLASS=org.apache.nutch.parse.ParseSegment if "%1"=="readdb" set CLASS=org.apache.nutch.crawl.CrawlDbReader if "%1"=="readlinkdb" set CLASS=org.apache.nutch.crawl.LinkDbReader if "%1"=="segread" set CLASS=org.apache.nutch.segment.SegmentReader if "%1"=="updatedb" set CLASS=org.apache.nutch.crawl.CrawlDb if "%1"=="invertlinks" set CLASS=org.apache.nutch.crawl.LinkDb if "%1"=="index" set CLASS=org.apache.nutch.indexer.Indexer if "%1"=="dedup" set CLASS=org.apache.nutch.indexer .DeleteDuplicates if "%1"=="merge" set CLASS=org.apache.nutch.indexer.IndexMerger if "%1"=="plugin" set CLASS=org.apache.nutch.plugin.PluginRepository if "%1"=="server" set CLASS=' org.apache.nutch.searcher.DistributedSearch$Server' if "%CLASS%"=="" set CLASS=%1 %JAVA_HOME%\bin\java -cp %CLASSPATH% %CLASS% %* if "%OS%"=="Windows_NT" @endlocal if "%OS%"=="WINNT" @endlocal :end 2: 写一个维护脚本，定时运行 #!/bin/bash # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME=/usr/lib/j2sdk1.5-sun # Start index updation，只查找最热门的前1000条记录,由此创建新的segment bin/nutch generate crawl.mydomain/db crawl.mydomain/segments -topN 1000 #得到最新的segment目录名 s=`ls -d crawl.virtusa/segments/2* | tail -1` echo Segment is $s bin/nutch fetch $s bin/nutch updatedb crawl.mydomain /db $s bin/nutch analyze crawl.mydomain /db 5 bin/nutch index $s #删除重复记录 bin/nutch dedup crawl.mydomain /segments crawl.mydomain/tmpfile # Merge segments to prevent too many open files exception in Lucene #合并成一个新的segment bin/nutch mergesegs -dir crawl.mydomain/segments -i -ds s=`ls -d crawl.mydomain/segments/2* | tail -1` echo Merged Segment is $s rm -rf crawl.mydomain/index 以上是在urls文件内容没有变化的时候采用的办法，如果我加入的新的URL在urls文件里，那么在运行generate以前，要执行下面一命令： #bin/nutch inject crawl.mydomain/db -urlfile urls 在generate的时候，如果不加topN参数，那么crawl只会去处理新加的或原来由于其它原因没有fetch的url或page，所以我感觉，脚本1和用2修改的脚本交替运行，会有很好的效果。摘录自http://www.lvban.com/blog/computer/index.html 3:使用shell scripts实现nutch的自动运行需要shell。这里还有一种使用快速 python的nutch脚本 import os, sys, glob# The Nutch command script## Environment Variables## NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME.## NUTCH_HEAPSIZE The maximum amount of heap to use, in MB.# Default is 1000.## NUTCH_OPTS Extra Java runtime options.## ported to python by Ben Ogle (ogle dot ben [at] gmail)#does not handle links.thisdir = os.getcwd()cpsep = ":"if( os.name == "nt" ):cpsep = ";"if( len(sys.argv) == 1 ):print "Usage: python nutch.py COMMAND"print "where COMMAND is one of:"print " crawl one-step crawler for intranets"print " readdb read / dump crawl db"print " mergedb merge crawldb-s, with optional filtering"print " readlinkdb read / dump link db"print " inject inject new urls into the database"print " generate generate new segments to fetch"print " fetch fetch a segment's pages"print " parse parse a segment's pages"print " segread read / dump segment data"print " mergesegs merge several segments, with optional filtering and slicing"print " updatedb update crawl db from segments after fetching"print " invertlinks create a linkdb from parsed segments"print " mergelinkdb merge linkdb-s, with optional filtering"print " index run the indexer on parsed segments and linkdb"print " merge merge several segment indexes"print " dedup remove duplicates from a set of segment indexes"print " plugin load a plugin and run one of its classes main()"print " server run a search server"print " or"print " CLASSNAME run the class named CLASSNAME"print "Most commands print help when invoked w/o parameters."sys.exit(1)command = sys.argv[1]#print "COMMAND: " + commandnutch_home = thisdir + "/.."java_home = os.getenv("NUTCH_JAVA_HOME")if(java_home != None)500)this.width=500'>s.setenv("JAVA_HOME", java_home)print java_homejava_home = os.getenv("JAVA_HOME")if(java_home == None):print "Error: JAVA_HOME is not set."exit(1)java = java_home + "/bin/java.exe"java_heap_max = "-Xmx1000m"nutch_heap_sz = os.getenv("NUTCH_HEAPSIZE")if(nutch_heap_sz != None):java_heap_max = "-Xmx"+ nutch_heap_sz +"m"#print java_heap_maxclasspath = nutch_home + "/conf"classpath = classpath + cpsep + nutch_home + "/lib/tools.jar"# for developers, add plugins, job & test code to CLASSPATHif( os.path.exists( nutch_home + "/build/plugins" ) ):classpath = classpath + cpsep + nutch_home + "/build/plugins"flist = glob.glob(nutch_home + "/build/nutch-*.job")for l in flist:classpath = classpath + cpsep + lif( os.path.exists( nutch_home + "/build/test/classes" ) ):classpath = classpath + cpsep + nutch_home + "/build/test/classes"flist = glob.glob(nutch_home + "/nutch-*.job")for l in flist:classpath = classpath + cpsep + lif( os.path.exists( nutch_home + "/plugins" ) ):classpath = classpath + cpsep + nutch_home + "/plugins"flist = glob.glob(nutch_home + "/lib/*.jar")for l in flist:classpath = classpath + cpsep + lflist = glob.glob(nutch_home + "/lib/jetty-ext/*.jar")for l in flist:classpath = classpath + cpsep + l#print classpathnutch_log_dir = os.getenv("NUTCH_LOG_DIR")if(nutch_log_dir == None):nutch_log_dir = nutch_home + "/logs"nutch_log_file = os.getenv("NUTCH_LOGFILE")if(nutch_log_file == None):nutch_log_file = "hadoop.log"nutch_opts = os.getenv("NUTCH_OPTS")if( nutch_opts == None ):nutch_opts = ""nutch_opts = nutch_opts + " -Dhadoop.log.dir=" + nutch_log_dirnutch_opts = nutch_opts + " -Dhadoop.log.file=" + nutch_log_file# figure out which class to runtheclass = commandif ( command == "crawl" ):theclass="org.apache.nutch.crawl.Crawl"elif ( command == "inject" ):theclass="org.apache.nutch.crawl.Injector"elif ( command == "generate" ):theclass="org.apache.nutch.crawl.Generator"elif ( command == "fetch" ):theclass="org.apache.nutch.fetcher.Fetcher"elif ( command == "parse" ):theclass="org.apache.nutch.parse.ParseSegment"elif ( command == "readdb" ):theclass="org.apache.nutch.crawl.CrawlDbReader"elif ( command == "mergedb" ):theclass="org.apache.nutch.crawl.CrawlDbMerger"elif ( command == "readlinkdb" ):theclass="org.apache.nutch.crawl.LinkDbReader"elif ( command == "segread" ):theclass="org.apache.nutch.segment.SegmentReader"elif ( command == "mergesegs" ):theclass="org.apache.nutch.segment.SegmentMerger"elif ( command == "updatedb" ):theclass="org.apache.nutch.crawl.CrawlDb"elif ( command == "invertlinks" ):theclass="org.apache.nutch.crawl.LinkDb"elif ( command == "mergelinkdb" ):theclass="org.apache.nutch.crawl.LinkDbMerger"elif ( command == "index" ):theclass="org.apache.nutch.indexer.Indexer"elif ( command == "dedup" ):theclass="org.apache.nutch.indexer.DeleteDuplicates"elif ( command == "merge" ):theclass="org.apache.nutch.indexer.IndexMerger"elif ( command == "plugin" ):theclass="org.apache.nutch.plugin.PluginRepository"elif ( command == "server" ):#what goes in place of the $Server?theclass="org.apache.nutch.searcher.DistributedSearch$Server"args = ""for i in range(2, len(sys.argv)):args = args + " " + sys.argv#windows doesnt like this even though there are quotes around it...#"\"" + java +"\" "cmdtorun = "java " + java_heap_max + " " + nutch_opts + " -classpath \"" + classpath + "\" " + theclass + args#print cmdtorunos.system(cmdtorun) 转自:http://www.cnblogs.com/abob/archive/2006/09/14/503713.html

阅读全文(2531) | 回复(1) | 编辑 | 精华

Nutch怎么可以实现自动运行 (windows)?
软件技术

宝贝(游客)发表评论于2007/3/26 16:58:51

你好: 我现在在搞nutch,但就是搞不清楚怎么可以实现nutch自动运行,不知道怎么来维护,请您指点!先谢了!我的邮箱是zhangzhimin2008@gmail.com,有这方面的资料请发给我,nutch.bat脚本到底怎么写?谢谢!

个人主页 | 引用回复 | 主人回复 | 返回 | 编辑 | 删除

» 1 »

发表评论：

昵称：
密码：
主页：
标题：

验证码： (不区分大小写,请仔细填写,输错需重写评论内容！)

站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.324 second(s), page refreshed 144769131 times.
《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号