2014-04-06 03:58:17 +02:00
|
|
|
Full article download
|
|
|
|
=====================
|
|
|
|
|
|
|
|
For feeds that accept only a summary, it's possible to download the full content directly from the original website.
|
|
|
|
|
|
|
|
How the content grabber works?
|
|
|
|
------------------------------
|
|
|
|
|
2014-12-29 23:13:20 +01:00
|
|
|
1. Try with rules first (Xpath patterns) for the domain name
|
2014-04-06 03:58:17 +02:00
|
|
|
2. Try to find the text content by using common attributes for class and id
|
2014-10-19 20:42:31 +02:00
|
|
|
3. Finally, if nothing is found, the feed content is displayed
|
2014-04-06 03:58:17 +02:00
|
|
|
|
|
|
|
The content downloader use a fake user agent, actually Google Chrome under Mac Os X.
|
|
|
|
|
|
|
|
However the content grabber doesn't work very well with all websites.
|
|
|
|
|
|
|
|
**The best results are obtained with Xpath rules file.**
|
|
|
|
|
|
|
|
|
|
|
|
How to write a grabber rules file?
|
|
|
|
----------------------------------
|
|
|
|
|
2015-04-11 02:34:48 +02:00
|
|
|
Add a PHP file to the directory `rules`, the filename must be the domain name with the suffix `.php`:
|
2014-04-06 03:58:17 +02:00
|
|
|
|
|
|
|
Example with the BBC website, `www.bbc.co.uk.php`:
|
|
|
|
|
2014-10-31 03:10:59 +01:00
|
|
|
```php
|
|
|
|
<?php
|
|
|
|
|
|
|
|
return array(
|
|
|
|
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
|
|
|
|
'body' => array(
|
|
|
|
'//div[@class="story-body"]',
|
|
|
|
),
|
|
|
|
'strip' => array(
|
|
|
|
'//script',
|
|
|
|
'//form',
|
|
|
|
'//style',
|
|
|
|
'//*[@class="story-date"]',
|
|
|
|
'//*[@class="story-header"]',
|
|
|
|
'//*[@class="story-related"]',
|
|
|
|
'//*[contains(@class, "byline")]',
|
|
|
|
'//*[contains(@class, "story-feature")]',
|
|
|
|
'//*[@id="video-carousel-container"]',
|
|
|
|
'//*[@id="also-related-links"]',
|
|
|
|
'//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
|
|
|
|
)
|
|
|
|
);
|
|
|
|
```
|
2014-04-06 03:58:17 +02:00
|
|
|
|
|
|
|
Actually, only `body`, `strip` and `test_url` are supported.
|
|
|
|
|
|
|
|
Don't forget to send a pull request or a ticket to share your contribution with everybody.
|
|
|
|
|
|
|
|
List of content grabber rules
|
|
|
|
-----------------------------
|
|
|
|
|
2015-04-11 02:34:48 +02:00
|
|
|
[List of rules included by default](https://github.com/miniflux/miniflux/tree/master/vendor/fguillot/picofeed/lib/PicoFeed/Rules).
|