How many times you extracted data from other website to include in your site? I guess its many times. In this web 2.0 era we all mix up different type of contents to make our own. Sometimes you grab data from youtube.com, sometimes amazon.com and then you create a mesh up. All these things need little bit of data parsing knowledge. Also you need to interact with an http resource. Many PHP coders does it by curl, DOMDocument etc extensions. This job is quite tedious. To resolve this problem I created a class long time ago. Now I put them on http://github.com/shiplu. The classes I am talking about can be found on http://github.com/shiplu/dxtool.
“dxtool” stands for data extraction tools. Its very easy to use. Here I dump the README file from github.
Requirement
- php5
- php5-curl extension
- php5-json extension (already included with php5)
Features
- Extract Data from any http resource
- Use simple regular expression to extract data
- Hassle free http transaction
- Supports cookie (via curl)
- Can cache http response
Here is an example on how to use it.
<?php
require 'DataExtractor.php';
require 'WebGet.php';
$google_new_feed = 'http://news.google.com/news?pz=1&cf=all&ned=in&hl=en&output=rss';
$w = new WebGet();
$content = $w->requestContent($google_new_feed);
$dx = new DataExtractor($content);
$dx->titles = '|title>([^<]+)</title|a';
$dx->rsstitle = '|title>([^<]+)</title|';
$data = $dx->extractArray();
print_r($data);
?>
If you run it you’ll see this output. Output will be different as google news rss will change over time
Array
(
[titles] => Array
(
[0] => Top Stories - Google News
[1] => Top Stories - Google News
[2] => Draft of Lokpal Bill discussed informally at cabinet meet - Hindustan Times
[3] => cabinet clears Food Security Bill - Hindustan Times
[4] => Obama hails Havel's 'moral leadership' and 'dignity' - AFP
[5] => Philippines struggles to cope after storm leaves 650 dead - Telegraph.co.uk
[6] => Civil nuclear liability rules balanced, India to Russia - Hindustan Times
[7] => 'Include Muslims, expand OBC quota' - Hindustan Times
[8] => Sadhbhavana's success answer to Gujarat detractors: Narendra Modi - Daily News & Analysis
[9] => Female protestor's beating sparks Egypt outrage - Telegraph.co.uk
[10] => Romney says US withdrawal from Iraq 'precipitous' - AFP
[11] => Come clean on Chidambaram's alleged favours to ex-client: BJP to Centre - The Hindu
)
[rsstitle] => Top Stories - Google News
)