Nutch 2.1 cassandra backend generate error -
i made choice on cassandra backend , started play nutch.
small subset of dmoz urls (~50k), (inject, generate, fetch) runs fine.
however, after injected whole dmoz url set (~3.5m) , tried generate fetchlist, got following error, reproducible on system:
~/software/nutch_dmoz/local$ ./bin/nutch generate -topn 1000 generatorjob: selecting best-scoring urls due fetch. generatorjob: starting generatorjob: filtering: true generatorjob: topn: 1000 generatorjob: java.lang.runtimeexception: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001 @ org.apache.nutch.util.nutchjob.waitforcompletion(nutchjob.java:54) @ org.apache.nutch.crawl.generatorjob.run(generatorjob.java:191) @ org.apache.nutch.crawl.generatorjob.generate(generatorjob.java:213) @ org.apache.nutch.crawl.generatorjob.run(generatorjob.java:241) @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65) @ org.apache.nutch.crawl.generatorjob.main(generatorjob.java:249)
logs/hadoop.log:
2013-04-25 17:58:07,986 info crawl.generatorjob - generatorjob: selecting best-scoring urls due fetch. 2013-04-25 17:58:08,007 info crawl.generatorjob - generatorjob: starting 2013-04-25 17:58:08,007 info crawl.generatorjob - generatorjob: filtering: true 2013-04-25 17:58:08,007 info crawl.generatorjob - generatorjob: topn: 1000 2013-04-25 17:58:08,570 info connection.cassandrahostretryservice - downed host retry service started queue size -1 , retry delay 10 s 2013-04-25 17:58:08,660 info service.jmxmonitor - registering jmx me.prettyprint.cassandra.service_test cluster:servicetype=hector,monitort ype=hector 2013-04-25 17:58:09,029 warn util.nativecodeloader - unable load native-hadoop library platform... using builtin-java classes w here applicable 2013-04-25 17:58:09,403 info mapreduce.gorarecordreader - gora.buffer.read.limit = 10000 2013-04-25 17:58:09,435 info plugin.pluginrepository - plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins 2013-04-25 17:58:09,560 info plugin.pluginrepository - plugin auto-activation mode: [true] 2013-04-25 17:58:09,560 info plugin.pluginrepository - registered plugins: 2013-04-25 17:58:09,560 info plugin.pluginrepository - nutch core extension points (nutch-extensionpoints) 2013-04-25 17:58:09,560 info plugin.pluginrepository - regex url normalizer (urlnormalizer-regex) 2013-04-25 17:58:09,560 info plugin.pluginrepository - cyberneko html parser (lib-nekohtml) 2013-04-25 17:58:09,560 info plugin.pluginrepository - opic scoring plug-in (scoring-opic) 2013-04-25 17:58:09,560 info plugin.pluginrepository - basic url normalizer (urlnormalizer-basic) 2013-04-25 17:58:09,560 info plugin.pluginrepository - tika parser plug-in (parse-tika) 2013-04-25 17:58:09,560 info plugin.pluginrepository - html parse plug-in (parse-html) 2013-04-25 17:58:09,560 info plugin.pluginrepository - basic indexing filter (index-basic) 2013-04-25 17:58:09,560 info plugin.pluginrepository - anchor indexing filter (index-anchor) 2013-04-25 17:58:09,560 info plugin.pluginrepository - http framework (lib-http) 2013-04-25 17:58:09,561 info plugin.pluginrepository - regex url filter (urlfilter-regex) 2013-04-25 17:58:09,561 info plugin.pluginrepository - regex url filter framework (lib-regex-filter) 2013-04-25 17:58:09,561 info plugin.pluginrepository - pass-through url normalizer (urlnormalizer-pass) 2013-04-25 17:58:09,561 info plugin.pluginrepository - http protocol plug-in (protocol-http) 2013-04-25 17:58:09,561 info plugin.pluginrepository - registered extension-points: 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch url normalizer (org.apache.nutch.net.urlnormalizer) 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch protocol (org.apache.nutch.protocol.protocol) 2013-04-25 17:58:09,561 info plugin.pluginrepository - parse filter (org.apache.nutch.parse.parsefilter) 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch url filter (org.apache.nutch.net.urlfilter) 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch indexing filter (org.apache.nutch.indexer.indexingfilter) 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch content parser (org.apache.nutch.parse.parser) 2013-04-25 17:58:09,561 info plugin.pluginrepository - nutch scoring (org.apache.nutch.scoring.scoringfilter) 2013-04-25 17:58:09,582 info crawl.fetchschedulefactory - using fetchschedule impl: org.apache.nutch.crawl.defaultfetchschedule 2013-04-25 17:58:09,582 info crawl.abstractfetchschedule - defaultinterval=2592000 2013-04-25 17:58:09,582 info crawl.abstractfetchschedule - maxinterval=7776000 2013-04-25 17:58:11,046 info regex.regexurlnormalizer - can't find rules scope 'generate_host_count', using default 2013-04-25 18:01:02,936 warn mapred.fileoutputcommitter - output path null in cleanup 2013-04-25 18:01:02,936 warn mapred.localjobrunner - job_local_0001 java.lang.arrayindexoutofboundsexception 2013-04-25 18:01:03,412 error crawl.generatorjob - generatorjob: java.lang.runtimeexception: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001 @ org.apache.nutch.util.nutchjob.waitforcompletion(nutchjob.java:54)
i did not run out of disk space, far can see. /tmp partition has 250g free space, partition cassandra running has 2.5t free space. there possibility increase verbosity? also, wonder arrayoutofboundsexception not tell bound tried access, nothing. keyspace webpage existing, can access cassandra-cli. here output of readdb -stats:
~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats webtable statistics start statistics webtable: min score: 55.0 retry 0: 3576393 jobs: {db_stats-job_local_0001={jobid=job_local_0001, jobname=db_stats, counters={file input format counters ={bytes_read=0}, map-reduce framework={map_output_materialized_bytes=1609, map_input_records=3576393, reduce_shuffle_bytes=0, spilled_records=858, map_output_bytes=189548829, committed_heap_bytes=1521614848, cpu_milliseconds=0, split_raw_bytes=1010, combine_input_records=14305902, reduce_input_records=114, reduce_input_groups=114, combine_output_records=444, physical_memory_bytes=0, reduce_output_records=114, virtual_memory_bytes=0, map_output_records=14305572}, filesystemcounters={file_bytes_read=910481, file_bytes_written=1028473}, file output format counters ={bytes_written=2421}}}} max score: 1.0 total urls: 3576393 status 0 (null): 3576393 avg score: 1.0 webtable statistics: done min score: 55.0 retry 0: 3576393 jobs: {db_stats-job_local_0001={jobid=job_local_0001, jobname=db_stats, counters={file input format counters ={bytes_read=0}, map-reduce framework={map_output_materialized_bytes=1609, map_input_records=3576393, reduce_shuffle_bytes=0, spilled_records=858, map_output_bytes=189548829, committed_heap_bytes=1521614848, cpu_milliseconds=0, split_raw_bytes=1010, combine_input_records=14305902, reduce_input_records=114, reduce_input_groups=114, combine_output_records=444, physical_memory_bytes=0, reduce_output_records=114, virtual_memory_bytes=0, map_output_records=14305572}, filesystemcounters={file_bytes_read=910481, file_bytes_written=1028473}, file output format counters ={bytes_written=2421}}}} max score: 1.0 total urls: 3576393 status 0 (null): 3576393 avg score: 1.0