de.dbsystems.simplescrape
Class Scraper

java.lang.Object
  extended by de.dbsystems.simplescrape.Scraper
All Implemented Interfaces:
java.util.Iterator

public class Scraper
extends java.lang.Object
implements java.util.Iterator

Central class for this package. This class supports more compact descriptions of things to find in a given webpage. Instead of checking in code whether "this and this is followed by that and that" one can provide expressions of things to watch for. This class is not thread-safe.

Since:
04.04.2007
Author:
Ronald Bieber, DB Systems GmbH

Constructor Summary
Scraper()
          Empty constructor, does nothing.
Scraper(java.io.InputStream input)
          Convenience-Constructor.
 
Method Summary
 void advance(int howFar)
          Advance within the current file.
 int available()
          Returns, how many more elements are in the current file based on the current position.
 AbstractHTMLToken get(int index)
          Returns the element at the given index.
 java.util.List<AbstractHTMLToken> getForms()
          Returns all Elements in this document that are relevant to forms.
 java.lang.String getNextContent(java.lang.String tagName)
          Returns the text contained in the next tag of a specified kind.
 HTMLTag getNextTag()
          Returns the next HtmlTag in the current file.
 HTMLTag getNextTag(int fromHere)
          Returns the next HtmlTag in the current file starting from a given location.
 TextToken getNextText(boolean skipEmpty)
          Returns the next TextToken-HtmlToken in the current file.
 TextToken getNextText(int fromHere, boolean skipEmpty)
          Returns the next TextToken-HtmlToken in the current file starting from a given location.
 int getPosition()
          Returns the current position as an index into the list of tokens.
 Tokenizer getTokenizer()
          Returns the currently used tokenizer.
 boolean hasNext()
          Returns whether more Elements can be retrieved using the next()-method.
 int indexOf(AbstractHTMLToken searchToken, ScrapeOptions options)
          Searches in the current data for a token as provided.
 int indexOf(int startHere, AbstractHTMLToken searchToken, ScrapeOptions options)
          Searches in the current data for a token as provided.
 java.lang.Object next()
          Returns the next HtmlToken in the current file.
 void printToFile(java.util.List<AbstractHTMLToken> tokens, java.lang.String filename)
          Convenience method for printing a list of tokens to a file.
 void printToFile(java.lang.String filename)
          Convenience method for printing HTML-content to a file.
 void remove()
          Removes the element currently pointed at from the file.
 void reset()
          Sets the marker to the first token.
 int searchTokens(int startHere, java.util.Vector<AbstractHTMLToken> searchElements, ScrapeOptions options)
          Like searchTokenChain(Vector, ScrapeOptions), but with a configurable starting-point for the search.
 int searchTokens(java.util.Vector<AbstractHTMLToken> searchElements, ScrapeOptions options)
          Searches in the current data for a sequence of tokens as provided.
 void setPosition(int current)
          Set the current position to be used for subsequent searches.
 void setTokenizerAndParse(Tokenizer tokenizer)
          Sets the tokenizer to be used for this scraping experience.
 int size()
          Returns the total number of elements in the current file.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Scraper

public Scraper()
Empty constructor, does nothing.


Scraper

public Scraper(java.io.InputStream input)
Convenience-Constructor. Takes the input stream, wraps a Tokenizer around it and parses the stream completely. Afterwards, input can be closed.

Parameters:
input - An InputStream to be parsed.
Method Detail

getPosition

public int getPosition()
Returns the current position as an index into the list of tokens.

Returns:
The current position.

setPosition

public void setPosition(int current)
Set the current position to be used for subsequent searches.

Parameters:
current - The new position.

reset

public void reset()
Sets the marker to the first token. Subsequent searches start there (unless specified otherwise).


getTokenizer

public Tokenizer getTokenizer()
Returns the currently used tokenizer.

Returns:
The tokenizer.

setTokenizerAndParse

public void setTokenizerAndParse(Tokenizer tokenizer)
Sets the tokenizer to be used for this scraping experience. This leads to an immediate reading of the complete HTML-file into elements. The input stream can afterwards be closed.

Parameters:
tokenizer - The tokenizer.

getNextContent

public java.lang.String getNextContent(java.lang.String tagName)
Returns the text contained in the next tag of a specified kind. The current marker in the scraper gets pushed forward, if a result could be found. It will point to the closing tag of the found container. Typical uses of this will search of a table cell, div-container, paragraph or heading. If other tags appear between the opening and closing tag, they will be ignored, but text contained therein gets concatenated.

Parameters:
tagName - The tag name to be searched for.
Returns:
The text, or null, if no appropriate tag could be found.

indexOf

public int indexOf(AbstractHTMLToken searchToken,
                   ScrapeOptions options)
Searches in the current data for a token as provided. The search starts at the last used position as returned by getCurrentMarker(). Regardless of options.advance, the current marker will only be pushed forward, if the requested token could be found.

Parameters:
searchToken - The token to be searched for. Must not be null.
options - The options to be used for the search.
Returns:
An index pointing to the found token, or -1, if the token could not be found.
See Also:
ScrapeOptions

indexOf

public int indexOf(int startHere,
                   AbstractHTMLToken searchToken,
                   ScrapeOptions options)
Searches in the current data for a token as provided. The search starts at the last used position as returned by getCurrentMarker(). Regardless of options.advance, the current marker will only be pushed forward, if the requested token could be found.

Parameters:
startHere - The position from where on the search should be performed.
searchToken - The token to be searched for. Must not be null.
options - The options to be used for the search.
Returns:
An index pointing to the found token, or -1, if the sequence could not be found.
See Also:
ScrapeOptions

searchTokens

public int searchTokens(java.util.Vector<AbstractHTMLToken> searchElements,
                        ScrapeOptions options)
Searches in the current data for a sequence of tokens as provided. The search starts at the last used position as returned by getCurrentMarker(). Regardless of options.advance, the current marker will only be pushed forward, if the requested pattern could be found.

Parameters:
searchElements - The sequence of elements to be searched for. Must not be null.
options - The options to be used for the search.
Returns:
An index pointing to the first element after the found sequence of elements, or -1, if the sequence could not be found.
See Also:
ScrapeOptions

searchTokens

public int searchTokens(int startHere,
                        java.util.Vector<AbstractHTMLToken> searchElements,
                        ScrapeOptions options)
Like searchTokenChain(Vector, ScrapeOptions), but with a configurable starting-point for the search.

Parameters:
startHere - The position from where on the search should be performed.
searchElements - The sequence of elements to be searched for. Must not be null.
options - The options to be used for the search.
Returns:
The index pointing to the first element after the found sequence of elements, or -1, if the sequence could not be found.
See Also:
ScrapeOptions

hasNext

public boolean hasNext()
Returns whether more Elements can be retrieved using the next()-method. Warning: Other methods like getNextTag or getNextText() may fail, even though this method returns true!

Specified by:
hasNext in interface java.util.Iterator

size

public int size()
Returns the total number of elements in the current file.


available

public int available()
Returns, how many more elements are in the current file based on the current position.


next

public java.lang.Object next()
Returns the next HtmlToken in the current file. This can be any kind of HtmlToken, including TextToken-Elements with only whitespace. The current Marker is advanced by one when calling this method.

Specified by:
next in interface java.util.Iterator
Returns:
The next HtmlToken.

remove

public void remove()
Removes the element currently pointed at from the file.

Specified by:
remove in interface java.util.Iterator

advance

public void advance(int howFar)
Advance within the current file. The position can advance beyond the last element in the file.

Parameters:
howFar - How many elements should be skipped.

getNextText

public TextToken getNextText(boolean skipEmpty)
Returns the next TextToken-HtmlToken in the current file. The current position advances to behind the next element, if the search was successful.

Parameters:
skipEmpty - true: TextToken-Elements containing only whitespace and linebreaks are skipped, false: the next TextToken-HtmlToken is returned regardless of content.
Returns:
The next TextToken-HtmlToken, or null, if none could be found.

getNextText

public TextToken getNextText(int fromHere,
                             boolean skipEmpty)
Returns the next TextToken-HtmlToken in the current file starting from a given location. The current position in the file does not change.

Parameters:
fromHere - Starting index for the search.
skipEmpty - true: TextToken-Elements containing only whitespace and linebreaks are skipped, false: the next TextToken-HtmlToken is returned regardless of content.
Returns:
The next TextToken-HtmlToken, or null, if non could be found.

getNextTag

public HTMLTag getNextTag()
Returns the next HtmlTag in the current file. The current position advances to behind the next element.

Returns:
The next HtmlTag, or null, if none could be found.

getNextTag

public HTMLTag getNextTag(int fromHere)
Returns the next HtmlTag in the current file starting from a given location. The current position in the file does not change.

Parameters:
fromHere - Starting index for the search.
Returns:
The next HtmlTag, or null, if non could be found.

get

public AbstractHTMLToken get(int index)
Returns the element at the given index.

Parameters:
index - The index of the element to be retrieved.
Returns:
The requested element, or null, if index is out of range.

getForms

public java.util.List<AbstractHTMLToken> getForms()
Returns all Elements in this document that are relevant to forms. This includes the form-tags, all input-, select-, and option-tags. The position in the file doesn't change.

Returns:
The elemnts for all forms in the document.

printToFile

public void printToFile(java.lang.String filename)
                 throws java.io.IOException
Convenience method for printing HTML-content to a file.

Parameters:
filename - The name (and path) of the file to write to.
Throws:
java.io.IOException

printToFile

public void printToFile(java.util.List<AbstractHTMLToken> tokens,
                        java.lang.String filename)
                 throws java.io.IOException
Convenience method for printing a list of tokens to a file.

Parameters:
filename - The name (and path) of the file to write to.
Throws:
java.io.IOException