How many times you extracted data from other website to include in your site? I guess its many times. In this web 2.0 era we all mix up different type of contents to make our own. Sometimes you grab data from youtube.com, sometimes amazon.com and then you create a mesh up. All these things need little bit of data parsing knowledge. Also you need to interact with an http resource. Many PHP coders does it by curl, DOMDocument etc extensions. This job is quite tedious. To resolve this problem I created a class long time ago. Now I put them on http://github.com/shiplu. The classes I am talking about can be found on http://github.com/shiplu/dxtool.
“dxtool” stands for data extraction tools. Its very easy to use. Here I dump the README file from github.
Requirement
- php5
- php5-curl extension
- php5-json extension (already included with php5)
Features
- Extract Data from any http resource
- Use simple regular expression to extract data
- Hassle free http transaction
- Supports cookie (via curl)
- Can cache http response
Here is an example on how to use it.
<?php require 'DataExtractor.php'; require 'WebGet.php'; $google_new_feed = 'http://news.google.com/news?pz=1&cf=all&ned=in&hl=en&output=rss'; $w = new WebGet(); $content = $w->requestContent($google_new_feed); $dx = new DataExtractor($content); $dx->titles = '|title>([^<]+)</title|a'; $dx->rsstitle = '|title>([^<]+)</title|'; $data = $dx->extractArray(); print_r($data); ?>
If you run it you’ll see this output. Output will be different as google news rss will change over time
Array ( [titles] => Array ( [0] => Top Stories - Google News [1] => Top Stories - Google News [2] => Draft of Lokpal Bill discussed informally at cabinet meet - Hindustan Times [3] => cabinet clears Food Security Bill - Hindustan Times [4] => Obama hails Havel's 'moral leadership' and 'dignity' - AFP [5] => Philippines struggles to cope after storm leaves 650 dead - Telegraph.co.uk [6] => Civil nuclear liability rules balanced, India to Russia - Hindustan Times [7] => 'Include Muslims, expand OBC quota' - Hindustan Times [8] => Sadhbhavana's success answer to Gujarat detractors: Narendra Modi - Daily News & Analysis [9] => Female protestor's beating sparks Egypt outrage - Telegraph.co.uk [10] => Romney says US withdrawal from Iraq 'precipitous' - AFP [11] => Come clean on Chidambaram's alleged favours to ex-client: BJP to Centre - The Hindu ) [rsstitle] => Top Stories - Google News )
Hi,
Good Post! Very informative, glad that you are going to continue writing things like this!
Hi,
I read your post, i really appreciate your experience and i will get good knowledge from their as well.
informative, thanks. Another data extraction video http://goo.gl/PJScH6
How to extract the same if there is user authentication involved in this.
On my local server I have PHP 5.6 and curl enabled.
I need to retrieve xml data from an external website which is purely authentication based.
$header = array(‘Contect-Type:application/xml’, ‘Accept:application/xml’);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, ‘url’);
curl_setopt($ch, CURLOPT_USERPWD, “username:password”);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, “GET”);
$result = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
But this above code did not fetch me any data. Returns 302 all the time.
Tried adding
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $max > 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, $max);
where $max is something like 10. The it throws access denied error code.
But when I copy the same url in browser directly and enter the username and password, it shows me the result.
Completely lost in figuring out a way since 2 days.
Any help would be highly appreciated.
Thanks in advance.