de.dbsystems.simplescrape

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package de.dbsystems.simplescrape

The webscraping-package enables the quick programmatic extraction of information from HTML-pages.

See:
Description

Class Summary
AbstractHTMLToken	Common superclass for all tokens that can be found in an HTML-file.
HTMLComment	Class for holding HTML-comments.
HTMLTag	Represents tags in HTML-files.
HTMLTagAttributes	Defines a class for parsing and storing the attributes of an HTML tag.
HTTPHelper	Class for holding HTML-comments.
RegExTextToken	Basically a TextToken, but whose content is treated as a regular expression.
ScrapeOptions	Preferences for scraping a file.
Scraper	Central class for this package.
TextToken	Represents tokens containing text data in an HTML-file.
Tokenizer	Split an input stream into HTML tokens.
XMLHelper	Class for holding HTML-comments.

Package de.dbsystems.simplescrape Description

The webscraping-package enables the quick programmatic extraction of information from HTML-pages.

Package Specification

Current State

The current state is that of a usable alpha version. In that respect, the webscraper is not yet feature complete, but can already be used (at your own risk, of course).

Typical Usage

Some examples for usage can be found in the JUnit test-cases. These can be found under /test/.../

It is expected that Simple-Scrape is used in a programmatic way like this:

Acquire the content of a webpage.
This can be done in any way the programmer sees fit. In simple cases and for testing the methods in HTTPHelper can be of use to acquire the contents.
Feed InputStream to Scraper
The simplest way to do this is by instantiating a new Scraper with
new Scraper()
The InputStream will be read completely and can be closed afterwards if necessary.
Search and retrieve content
Use the scrapers indexOf and searchTokens methods to look for specific parts of the file. A couple of methods exist for retrieving tokens.

Requirements

Java 1.5 or higher
Log4J (tested with version 1.2.8)
Log4J can be obtained from http://logging.apache.org/log4j/docs/
For unit testing: JUnit 4
JUnit 4 is alredays installed if you have Eclipse 3.2 or higher

This project was developed using Eclipse 3.2 and the files .project and .classpath reflect that origin.

Suggestions for enhancements

More support for using forms
More search capabilities:

Different searchoptions per search element
Support for XPath(-like) expressions

Graphical developer support: Point-and-Click-construction of complex dialogs across multiple pages
(Semi-)automatic support for creating Web-Services (REST and SOAP) for scraped results (in and out support!)
Make it thread-safe
Cookie-Support