org.ilrt.inkling.app
Class Scutter

java.lang.Object
  |
  +--org.ilrt.inkling.app.Scutter

public class Scutter
extends java.lang.Object

A Scutter RDF harvester (see http://rdfweb.org/topic/ScutterSpec)

todo:

- fix robots, forcelocal, etc etc

- sometimes gets stuck - timeout? - delete a thread when too old
(tmp fix - random waiting)
- randomlise urls list arnd remove duplicates
(ok)
- also robots.txt
- check local copy first
- add some more options
- force check local copy
- set db and driver
- check scutterplan name out and in
- make choosable start and endfile.

- exclude some urls
- store url info in SQLdatabase
- check if visited before


Field Summary
static java.lang.String DISALLOW
           
 boolean forcelocal
           
 
Constructor Summary
Scutter()
           
Scutter(java.lang.String db)
          db is a JDBC database url, e.g.
 
Method Summary
 void addSingleUrl(java.lang.String url)
          adds a single url to the store and a scutterplan called 'scutter.new2.rdf'
 boolean checkAvoid(java.lang.String url)
          Makes sure this one is not a url to avoid scuttering.
 void checkDone(int i)
          Checks if all the threads are finished and can save the scutterplan
 boolean checkModified(ScutterURLData data)
          etag methods - from http://www.hackdiary.com/archives/000028.html by Matt Biddulph
 java.lang.String getdb()
           
 java.lang.String getDriver()
           
 java.util.Vector getPlan()
           
 java.util.Vector getThreads()
           
static void main(java.lang.String[] args)
           
 java.util.Vector readScutter(java.lang.String uri)
          Reads in a scutterplan, creating a ScutterUrlData for each url, and returning a Vector of these.
 boolean robotSafe(java.lang.String u)
          robots.txt handling
 void runScutter()
          reads a Scutterplan (e.g.
 void saveScutter(java.util.Vector newplan, java.lang.String filen)
          Saves Scutterplan with the filename specified
 void setdb(java.lang.String db)
           
 void setDriver(java.lang.String driver)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

forcelocal

public boolean forcelocal

DISALLOW

public static final java.lang.String DISALLOW
Constructor Detail

Scutter

public Scutter()

Scutter

public Scutter(java.lang.String db)
db is a JDBC database url, e.g. jdbc:postgresql://localhost:5432/codepict2?auth=password&user=postgres&password=notneeded or jdbc:mysql://127.0.0.1:3306/codepict2?user=mysql
Method Detail

getPlan

public java.util.Vector getPlan()

main

public static void main(java.lang.String[] args)

runScutter

public void runScutter()
reads a Scutterplan (e.g. http://swordfish.rdfweb.org/discovery/2001/08/codepict/scutterplan.rdf and runs a scutter over it, in a randomised order and with 10 threads

checkDone

public void checkDone(int i)
Checks if all the threads are finished and can save the scutterplan

checkAvoid

public boolean checkAvoid(java.lang.String url)
Makes sure this one is not a url to avoid scuttering. be nice to expand this to use filters for certain types of file e.g. images file, foaf files.

readScutter

public java.util.Vector readScutter(java.lang.String uri)
Reads in a scutterplan, creating a ScutterUrlData for each url, and returning a Vector of these.

saveScutter

public void saveScutter(java.util.Vector newplan,
                        java.lang.String filen)
Saves Scutterplan with the filename specified

addSingleUrl

public void addSingleUrl(java.lang.String url)
adds a single url to the store and a scutterplan called 'scutter.new2.rdf'

checkModified

public boolean checkModified(ScutterURLData data)
etag methods - from http://www.hackdiary.com/archives/000028.html by Matt Biddulph

setdb

public void setdb(java.lang.String db)

setDriver

public void setDriver(java.lang.String driver)

getdb

public java.lang.String getdb()

getDriver

public java.lang.String getDriver()

getThreads

public java.util.Vector getThreads()

robotSafe

public boolean robotSafe(java.lang.String u)
robots.txt handling