|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.dbsystems.simplescrape.Scraper
public class Scraper
Central class for this package. This class supports more compact descriptions of things to find in a given webpage. Instead of checking in code whether "this and this is followed by that and that" one can provide expressions of things to watch for. This class is not thread-safe.
Constructor Summary | |
---|---|
Scraper()
Empty constructor, does nothing. |
|
Scraper(java.io.InputStream input)
Convenience-Constructor. |
Method Summary | |
---|---|
void |
advance(int howFar)
Advance within the current file. |
int |
available()
Returns, how many more elements are in the current file based on the current position. |
AbstractHTMLToken |
get(int index)
Returns the element at the given index. |
java.util.List<AbstractHTMLToken> |
getForms()
Returns all Elements in this document that are relevant to forms. |
java.lang.String |
getNextContent(java.lang.String tagName)
Returns the text contained in the next tag of a specified kind. |
HTMLTag |
getNextTag()
Returns the next HtmlTag in the current file. |
HTMLTag |
getNextTag(int fromHere)
Returns the next HtmlTag in the current file starting from a given location. |
TextToken |
getNextText(boolean skipEmpty)
Returns the next TextToken-HtmlToken in the current file. |
TextToken |
getNextText(int fromHere,
boolean skipEmpty)
Returns the next TextToken-HtmlToken in the current file starting from a given location. |
int |
getPosition()
Returns the current position as an index into the list of tokens. |
Tokenizer |
getTokenizer()
Returns the currently used tokenizer. |
boolean |
hasNext()
Returns whether more Elements can be retrieved using the next()-method. |
int |
indexOf(AbstractHTMLToken searchToken,
ScrapeOptions options)
Searches in the current data for a token as provided. |
int |
indexOf(int startHere,
AbstractHTMLToken searchToken,
ScrapeOptions options)
Searches in the current data for a token as provided. |
java.lang.Object |
next()
Returns the next HtmlToken in the current file. |
void |
printToFile(java.util.List<AbstractHTMLToken> tokens,
java.lang.String filename)
Convenience method for printing a list of tokens to a file. |
void |
printToFile(java.lang.String filename)
Convenience method for printing HTML-content to a file. |
void |
remove()
Removes the element currently pointed at from the file. |
void |
reset()
Sets the marker to the first token. |
int |
searchTokens(int startHere,
java.util.Vector<AbstractHTMLToken> searchElements,
ScrapeOptions options)
Like searchTokenChain(Vector |
int |
searchTokens(java.util.Vector<AbstractHTMLToken> searchElements,
ScrapeOptions options)
Searches in the current data for a sequence of tokens as provided. |
void |
setPosition(int current)
Set the current position to be used for subsequent searches. |
void |
setTokenizerAndParse(Tokenizer tokenizer)
Sets the tokenizer to be used for this scraping experience. |
int |
size()
Returns the total number of elements in the current file. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Scraper()
public Scraper(java.io.InputStream input)
input
- An InputStream to be parsed.Method Detail |
---|
public int getPosition()
public void setPosition(int current)
current
- The new position.public void reset()
public Tokenizer getTokenizer()
public void setTokenizerAndParse(Tokenizer tokenizer)
tokenizer
- The tokenizer.public java.lang.String getNextContent(java.lang.String tagName)
tagName
- The tag name to be searched for.
public int indexOf(AbstractHTMLToken searchToken, ScrapeOptions options)
searchToken
- The token to be searched for. Must not be null.options
- The options to be used for the search.
ScrapeOptions
public int indexOf(int startHere, AbstractHTMLToken searchToken, ScrapeOptions options)
startHere
- The position from where on the search should be performed.searchToken
- The token to be searched for. Must not be null.options
- The options to be used for the search.
ScrapeOptions
public int searchTokens(java.util.Vector<AbstractHTMLToken> searchElements, ScrapeOptions options)
searchElements
- The sequence of elements to be searched for. Must not be null.options
- The options to be used for the search.
ScrapeOptions
public int searchTokens(int startHere, java.util.Vector<AbstractHTMLToken> searchElements, ScrapeOptions options)
startHere
- The position from where on the search should be performed.searchElements
- The sequence of elements to be searched for. Must not be null.options
- The options to be used for the search.
ScrapeOptions
public boolean hasNext()
hasNext
in interface java.util.Iterator
public int size()
public int available()
public java.lang.Object next()
next
in interface java.util.Iterator
public void remove()
remove
in interface java.util.Iterator
public void advance(int howFar)
howFar
- How many elements should be skipped.public TextToken getNextText(boolean skipEmpty)
skipEmpty
- true: TextToken-Elements containing only whitespace and
linebreaks are skipped, false: the next TextToken-HtmlToken is returned regardless
of content.
public TextToken getNextText(int fromHere, boolean skipEmpty)
fromHere
- Starting index for the search.skipEmpty
- true: TextToken-Elements containing only whitespace and
linebreaks are skipped, false: the next TextToken-HtmlToken is returned regardless
of content.
public HTMLTag getNextTag()
public HTMLTag getNextTag(int fromHere)
fromHere
- Starting index for the search.
public AbstractHTMLToken get(int index)
index
- The index of the element to be retrieved.
public java.util.List<AbstractHTMLToken> getForms()
public void printToFile(java.lang.String filename) throws java.io.IOException
filename
- The name (and path) of the file to write to.
java.io.IOException
public void printToFile(java.util.List<AbstractHTMLToken> tokens, java.lang.String filename) throws java.io.IOException
filename
- The name (and path) of the file to write to.
java.io.IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |