Simple-Scrape

Introduction

Simple-Scrape is a simple web-scraping library that allows for programmatic access to HTML code using Java. No further techniques are needed and the library is very compact and thus easy to use.

What is Web-Scraping?

Web-Scraping means, to load HTML-data from a web-site (potentially not your own one), extracting specific data from there and then use this data for your own purposes.

Please note that just because some webpage is up there on the internet, this does not imply that it is Ok to scrape any content you like. Some sites specifically deny you the right to do so and others will simply not like some one else taking advantage of their services and effort. Therefore, it is good practice, to ask for permission before establishing such a kind of automated data extraction.

The two main problems with web scraping are that a) HTML tends to be rather unstructured and syntactically unsound and b) the HTML code of someone elses website may change anytime without prior notice, thus disabling your scraping code. Using a framework doesn't eliminate those problems, but makes scraping more stable and easier to change when needed.

Other Frameworks

Simple-Scrape is not the only framework that can be used for web scraping. Depending on your environment and your needs, another framework may be suited better for your needs. There are frameworks ...

for other programming languages.
I don't know much about those, as I prefer Java
commercial frameworks.
If you have the money, these may be a good option.
bigger Java frameworks.
Other frameworks (like WebHarvest, HTML Parser and Switchboard for example) tend to offer a wider funactionality, but also require you to be familiar with other techniques like XML, XSLT, XPath and others of the like. Even though this can be considered a much cleaner approach than in Simple-Scrape, I believe there is enough room for a straight-forward, easy-to-use framework like Simple-Scrape.

Documentation

Right now, the documentation is completely integrated into the generated JavaDoc documentation. Take a good look at the package-comment and the methods in de.dbsystems.simplescrape.Scraper.

Licence

Simple-Scrape has been released under the LGPL (Lesser GNU Public Licence).

Contact

Simple-Scrape was developed by DB Systel GmbH, the ICT service provider of Deutsche Bahn. If you want to contact the author, you can do so at the address ronald.bieber (-at-) bahn.de.