Nutch 2.1 cassandra backend generate error -


i made choice on cassandra backend , started play nutch.

small subset of dmoz urls (~50k), (inject, generate, fetch) runs fine.

however, after injected whole dmoz url set (~3.5m) , tried generate fetchlist, got following error, reproducible on system:

~/software/nutch_dmoz/local$ ./bin/nutch generate -topn 1000 generatorjob: selecting best-scoring urls due fetch. generatorjob: starting generatorjob: filtering: true generatorjob: topn: 1000 generatorjob: java.lang.runtimeexception: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001     @ org.apache.nutch.util.nutchjob.waitforcompletion(nutchjob.java:54)     @ org.apache.nutch.crawl.generatorjob.run(generatorjob.java:191)     @ org.apache.nutch.crawl.generatorjob.generate(generatorjob.java:213)     @ org.apache.nutch.crawl.generatorjob.run(generatorjob.java:241)     @ org.apache.hadoop.util.toolrunner.run(toolrunner.java:65)     @ org.apache.nutch.crawl.generatorjob.main(generatorjob.java:249) 

logs/hadoop.log:

2013-04-25 17:58:07,986 info  crawl.generatorjob - generatorjob: selecting best-scoring urls due fetch. 2013-04-25 17:58:08,007 info  crawl.generatorjob - generatorjob: starting 2013-04-25 17:58:08,007 info  crawl.generatorjob - generatorjob: filtering: true 2013-04-25 17:58:08,007 info  crawl.generatorjob - generatorjob: topn: 1000 2013-04-25 17:58:08,570 info  connection.cassandrahostretryservice - downed host retry service started queue size -1 , retry delay 10 s 2013-04-25 17:58:08,660 info  service.jmxmonitor - registering jmx me.prettyprint.cassandra.service_test cluster:servicetype=hector,monitort ype=hector 2013-04-25 17:58:09,029 warn  util.nativecodeloader - unable load native-hadoop library platform... using builtin-java classes w here applicable 2013-04-25 17:58:09,403 info  mapreduce.gorarecordreader - gora.buffer.read.limit = 10000 2013-04-25 17:58:09,435 info  plugin.pluginrepository - plugins: looking in: /home/sethunder/software/nutch_dmoz/local/plugins 2013-04-25 17:58:09,560 info  plugin.pluginrepository - plugin auto-activation mode: [true] 2013-04-25 17:58:09,560 info  plugin.pluginrepository - registered plugins: 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         nutch core extension points (nutch-extensionpoints) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         regex url normalizer (urlnormalizer-regex) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         cyberneko html parser (lib-nekohtml) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         opic scoring plug-in (scoring-opic) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         basic url normalizer (urlnormalizer-basic) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         tika parser plug-in (parse-tika) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         html parse plug-in (parse-html) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         basic indexing filter (index-basic) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         anchor indexing filter (index-anchor) 2013-04-25 17:58:09,560 info  plugin.pluginrepository -         http framework (lib-http) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         regex url filter (urlfilter-regex) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         regex url filter framework (lib-regex-filter) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         pass-through url normalizer (urlnormalizer-pass) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         http protocol plug-in (protocol-http) 2013-04-25 17:58:09,561 info  plugin.pluginrepository - registered extension-points: 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch url normalizer (org.apache.nutch.net.urlnormalizer) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch protocol (org.apache.nutch.protocol.protocol) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         parse filter (org.apache.nutch.parse.parsefilter) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch url filter (org.apache.nutch.net.urlfilter) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch indexing filter (org.apache.nutch.indexer.indexingfilter) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch content parser (org.apache.nutch.parse.parser) 2013-04-25 17:58:09,561 info  plugin.pluginrepository -         nutch scoring (org.apache.nutch.scoring.scoringfilter) 2013-04-25 17:58:09,582 info  crawl.fetchschedulefactory - using fetchschedule impl: org.apache.nutch.crawl.defaultfetchschedule 2013-04-25 17:58:09,582 info  crawl.abstractfetchschedule - defaultinterval=2592000 2013-04-25 17:58:09,582 info  crawl.abstractfetchschedule - maxinterval=7776000 2013-04-25 17:58:11,046 info  regex.regexurlnormalizer - can't find rules scope 'generate_host_count', using default 2013-04-25 18:01:02,936 warn  mapred.fileoutputcommitter - output path null in cleanup 2013-04-25 18:01:02,936 warn  mapred.localjobrunner - job_local_0001 java.lang.arrayindexoutofboundsexception 2013-04-25 18:01:03,412 error crawl.generatorjob - generatorjob: java.lang.runtimeexception: job failed: name=generate: 1366905487-307733671, jobid=job_local_0001         @ org.apache.nutch.util.nutchjob.waitforcompletion(nutchjob.java:54) 

i did not run out of disk space, far can see. /tmp partition has 250g free space, partition cassandra running has 2.5t free space. there possibility increase verbosity? also, wonder arrayoutofboundsexception not tell bound tried access, nothing. keyspace webpage existing, can access cassandra-cli. here output of readdb -stats:

~/software/nutch_dmoz/local$ ./bin/nutch readdb -stats webtable statistics start statistics webtable:  min score:  55.0 retry 0:    3576393 jobs:   {db_stats-job_local_0001={jobid=job_local_0001, jobname=db_stats, counters={file input format counters ={bytes_read=0}, map-reduce framework={map_output_materialized_bytes=1609, map_input_records=3576393, reduce_shuffle_bytes=0, spilled_records=858, map_output_bytes=189548829, committed_heap_bytes=1521614848, cpu_milliseconds=0, split_raw_bytes=1010, combine_input_records=14305902, reduce_input_records=114, reduce_input_groups=114, combine_output_records=444, physical_memory_bytes=0, reduce_output_records=114, virtual_memory_bytes=0, map_output_records=14305572}, filesystemcounters={file_bytes_read=910481, file_bytes_written=1028473}, file output format counters ={bytes_written=2421}}}} max score:  1.0 total urls: 3576393 status 0 (null):    3576393 avg score:  1.0 webtable statistics: done min score:  55.0 retry 0:    3576393 jobs:   {db_stats-job_local_0001={jobid=job_local_0001, jobname=db_stats, counters={file input format counters ={bytes_read=0}, map-reduce framework={map_output_materialized_bytes=1609, map_input_records=3576393, reduce_shuffle_bytes=0, spilled_records=858, map_output_bytes=189548829, committed_heap_bytes=1521614848, cpu_milliseconds=0, split_raw_bytes=1010, combine_input_records=14305902, reduce_input_records=114, reduce_input_groups=114, combine_output_records=444, physical_memory_bytes=0, reduce_output_records=114, virtual_memory_bytes=0, map_output_records=14305572}, filesystemcounters={file_bytes_read=910481, file_bytes_written=1028473}, file output format counters ={bytes_written=2421}}}} max score:  1.0 total urls: 3576393 status 0 (null):    3576393 avg score:  1.0 


Popular posts from this blog

How to calculate SNR of signals in MATLAB? -

c# - Attempting to upload to FTP: System.Net.WebException: System error -

ios - UISlider customization: how to properly add shadow to custom knob image -