2014-12-23 21:28:26 -05:00
Web scraper
===========
The web scraper is useful for feeds that display only a summary of articles, the scraper can download and parse the full content from the original website.
How the content grabber works?
------------------------------
1. Try with rules first (XPath queries) for the domain name (see `PicoFeed\Rules\` )
2. Try to find the text content by using common attributes for class and id
3. Finally, if nothing is found, the feed content is displayed
**The best results are obtained with XPath rules file.**
Standalone usage
----------------
2015-04-28 18:08:42 +02:00
Fetch remote content:
2014-12-23 21:28:26 -05:00
```php
< ?php
2015-04-28 18:08:42 +02:00
use PicoFeed\Config\Config;
use PicoFeed\Scraper\Scraper;
$config = new Config;
2014-12-23 21:28:26 -05:00
2015-04-28 18:08:42 +02:00
$grabber = new Scraper($config)
$grabber->setUrl($url);
$grabber->execute();
2014-12-23 21:28:26 -05:00
// Get raw HTML content
echo $grabber->getRawContent();
// Get relevant content
2015-04-28 18:08:42 +02:00
echo $grabber->getRelevantContent();
2014-12-23 21:28:26 -05:00
// Get filtered relevant content
echo $grabber->getFilteredContent();
2015-04-28 18:08:42 +02:00
// Return true if there is relevant content
var_dump($grabber->hasRelevantContent());
```
Parse HTML content:
```php
< ?php
$grabber = new Scraper($config);
$grabber->setRawContent($html);
$grabber->execute();
2014-12-23 21:28:26 -05:00
```
Fetch full item contents during feed parsing
--------------------------------------------
Before parsing all items, just call the method `$parser->enableContentGrabber()` :
```php
< ?php
use PicoFeed\Reader\Reader;
use PicoFeed\PicoFeedException;
try {
$reader = new Reader;
// Return a resource
$resource = $reader->download('http://www.egscomics.com/rss.php');
// Return the right parser instance according to the feed format
$parser = $reader->getParser(
$resource->getUrl(),
$resource->getContent(),
$resource->getEncoding()
);
// Enable content grabber before parsing items
$parser->enableContentGrabber();
// Return a Feed object
$feed = $parser->execute();
}
catch (PicoFeedException $e) {
// Do Something...
}
```
When the content scraper is enabled, everything will be slower.
**For each item a new HTTP request is made** and the HTML downloaded is parsed with XML/XPath.
Configuration
-------------
### Enable content grabber for items
- Method name: `enableContentGrabber()`
2015-04-28 18:08:42 +02:00
- Default value: false (also fetch content if no rule file exist)
- Argument value: bool (true scrape only webpages which have a rule file)
2014-12-23 21:28:26 -05:00
```php
2015-04-28 18:08:42 +02:00
$parser->enableContentGrabber(false);
2014-12-23 21:28:26 -05:00
```
### Ignore item urls for the content grabber
- Method name: `setGrabberIgnoreUrls()`
- Default value: empty (fetch all item urls)
- Argument value: array (list of item urls to ignore)
```php
$parser->setGrabberIgnoreUrls(['http://foo', 'http://bar']);
```
How to write a grabber rules file?
----------------------------------
Add a PHP file to the directory `PicoFeed\Rules` , the filename must be the same as the domain name:
Example with the BBC website, `www.bbc.co.uk.php` :
```php
< ?php
return array(
2015-04-28 18:08:42 +02:00
'grabber' => array(
'%.*%' => array(
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
'body' => array(
'//div[@class ="story-body"]',
),
'strip' => array(
'//script',
'//form',
'//style',
'//*[@class ="story-date"]',
'//*[@class ="story-header"]',
'//*[@class ="story-related"]',
'//*[contains(@class , "byline")]',
'//*[contains(@class , "story-feature")]',
'//*[@id ="video-carousel-container"]',
'//*[@id ="also-related-links"]',
'//*[contains(@class , "share") or contains(@class , "hidden") or contains(@class , "hyper")]',
)
)
2014-12-23 21:28:26 -05:00
)
);
```
2015-04-28 18:08:42 +02:00
Each rule file can contain multiple rules, based so links to different website URLs can be handled differently. The first level key is a regex, which will be matched against the full path of the URL using **preg_match** , e.g. for **http://www.bbc.co.uk/news/world-middle-east-23911833?test=1** the URL that would be matched is ** /news/world-middle-east-23911833?test=1**
2014-12-23 21:28:26 -05:00
2015-04-28 18:08:42 +02:00
Each rule has the following keys:
* **body**: An array of xpath expressions which will be extracted from the page
* **strip**: An array of xpath expressions which will be removed from the matched content
* **test_url**: A test url to a matching page to test the grabber
2014-12-23 21:28:26 -05:00
Don't forget to send a pull request or a ticket to share your contribution with everybody,
2015-04-28 18:08:42 +02:00
**A more complex example**:
Let's say you wanted to extract a div with the id **video** if the article points to an URL like **http://comix.com/videos/423** , **audio** if the article points to an URL like **http://comix.com/podcasts/5** and all other links to the page should instead take the div with the id **content** . The following rulefile would fit that requirement and would be stored in a file called **lib/PicoFeed/Rules/comix.com.php** :
```php
return array(
'grabber' => array(
'%^/videos.*%' => array(
'test_url' => 'http://comix.com/videos/423',
'body' => array(
'//div[@id ="video"]',
),
'strip' => array()
),
'%^/podcasts.*%' => array(
'test_url' => 'http://comix.com/podcasts/5',
'body' => array(
'//div[@id ="audio"]',
),
'strip' => array()
),
'%.*%' => array(
'test_url' => 'http://comix.com/blog/1',
'body' => array(
'//div[@id ="content"]',
),
'strip' => array()
)
)
);
```
2014-12-23 21:28:26 -05:00
List of content grabber rules
-----------------------------
Rules are stored inside the directory [lib/PicoFeed/Rules ](https://github.com/fguillot/picoFeed/tree/master/lib/PicoFeed/Rules )