Package de.dbsystems.simplescrape

The webscraping-package enables the quick programmatic extraction of information from HTML-pages.

See:
          Description

Class Summary
AbstractHTMLToken Common superclass for all tokens that can be found in an HTML-file.
HTMLComment Class for holding HTML-comments.
HTMLTag Represents tags in HTML-files.
HTMLTagAttributes Defines a class for parsing and storing the attributes of an HTML tag.
HTTPHelper Class for holding HTML-comments.
RegExTextToken Basically a TextToken, but whose content is treated as a regular expression.
ScrapeOptions Preferences for scraping a file.
Scraper Central class for this package.
TextToken Represents tokens containing text data in an HTML-file.
Tokenizer Split an input stream into HTML tokens.
XMLHelper Class for holding HTML-comments.
 

Package de.dbsystems.simplescrape Description

The webscraping-package enables the quick programmatic extraction of information from HTML-pages.

Package Specification

Current State

The current state is that of a usable alpha version. In that respect, the webscraper is not yet feature complete, but can already be used (at your own risk, of course).

Typical Usage

Some examples for usage can be found in the JUnit test-cases. These can be found under /test/.../

It is expected that Simple-Scrape is used in a programmatic way like this:

  1. Acquire the content of a webpage.
    This can be done in any way the programmer sees fit. In simple cases and for testing the methods in HTTPHelper can be of use to acquire the contents.
  2. Feed InputStream to Scraper
    The simplest way to do this is by instantiating a new Scraper with
    new Scraper()
    The InputStream will be read completely and can be closed afterwards if necessary.
  3. Search and retrieve content
    Use the scrapers indexOf and searchTokens methods to look for specific parts of the file. A couple of methods exist for retrieving tokens.

Requirements

This project was developed using Eclipse 3.2 and the files .project and .classpath reflect that origin.

Suggestions for enhancements