Update readme
This commit is contained in:
parent
e77b785263
commit
14d67d85e8
@ -279,3 +279,65 @@ It would be awesome for everybody :)
|
|||||||
- Line indentation: 4 spaces
|
- Line indentation: 4 spaces
|
||||||
- Line endings: Unix
|
- Line endings: Unix
|
||||||
- File encoding: UTF-8
|
- File encoding: UTF-8
|
||||||
|
|
||||||
|
### How the content grabber works?
|
||||||
|
|
||||||
|
1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`)
|
||||||
|
2. Try to find the text content by using common attributes for class and id
|
||||||
|
3. Fallback to Readability if no content is found
|
||||||
|
4. Finally, if nothing is found, the feed content is displayed
|
||||||
|
|
||||||
|
The content downloader use a fake user agent, actually Google Chrome under Mac Os X.
|
||||||
|
|
||||||
|
However the content grabber doesn't work very well with all websites.
|
||||||
|
**The best results are obtained with Xpath rules file.**
|
||||||
|
|
||||||
|
There is a PHP script inside PicoFeed to import Fivefilters rules, but I dont' use it because almost of these patterns are not up to date.
|
||||||
|
|
||||||
|
### How to write a grabber rules file?
|
||||||
|
|
||||||
|
Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the domain name:
|
||||||
|
|
||||||
|
Example with the BBC website, `www.bbc.co.uk.php`:
|
||||||
|
|
||||||
|
<?php
|
||||||
|
return array(
|
||||||
|
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
|
||||||
|
'body' => array(
|
||||||
|
'//div[@class="story-body"]',
|
||||||
|
),
|
||||||
|
'strip' => array(
|
||||||
|
'//script',
|
||||||
|
'//form',
|
||||||
|
'//style',
|
||||||
|
'//*[@class="story-date"]',
|
||||||
|
'//*[@class="story-header"]',
|
||||||
|
'//*[@class="story-related"]',
|
||||||
|
'//*[contains(@class, "byline")]',
|
||||||
|
'//*[contains(@class, "story-feature")]',
|
||||||
|
'//*[@id="video-carousel-container"]',
|
||||||
|
'//*[@id="also-related-links"]',
|
||||||
|
'//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
|
||||||
|
)
|
||||||
|
);
|
||||||
|
|
||||||
|
Actually, only `body`, `strip` and `test_url` are supported.
|
||||||
|
|
||||||
|
Don't forget to send a pull request or a ticket to share your contribution with everybody,
|
||||||
|
|
||||||
|
### List of content grabber rules
|
||||||
|
|
||||||
|
**If you want to add new rules, just open a ticket and I will do it.**
|
||||||
|
|
||||||
|
- *.blog.lemonde.fr
|
||||||
|
- *.blog.nytimes.com
|
||||||
|
- *.nytimes.php
|
||||||
|
- *.slate.com
|
||||||
|
- *.wsj.com
|
||||||
|
- rue89.com
|
||||||
|
- www.bbc.co.uk
|
||||||
|
- www.cnn.com
|
||||||
|
- www.egscomics.com
|
||||||
|
- www.lemonde.fr
|
||||||
|
- www.numerama.com
|
||||||
|
- www.slate.fr
|
||||||
|
Loading…
Reference in New Issue
Block a user