Update readme

This commit is contained in:
Frédéric Guillot 2013-08-31 11:27:21 -04:00
parent e77b785263
commit 14d67d85e8
1 changed files with 62 additions and 0 deletions

View File

@ -279,3 +279,65 @@ It would be awesome for everybody :)
- Line indentation: 4 spaces
- Line endings: Unix
- File encoding: UTF-8
### How the content grabber works?
1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`)
2. Try to find the text content by using common attributes for class and id
3. Fallback to Readability if no content is found
4. Finally, if nothing is found, the feed content is displayed
The content downloader use a fake user agent, actually Google Chrome under Mac Os X.
However the content grabber doesn't work very well with all websites.
**The best results are obtained with Xpath rules file.**
There is a PHP script inside PicoFeed to import Fivefilters rules, but I dont' use it because almost of these patterns are not up to date.
### How to write a grabber rules file?
Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the domain name:
Example with the BBC website, `www.bbc.co.uk.php`:
<?php
return array(
'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
'body' => array(
'//div[@class="story-body"]',
),
'strip' => array(
'//script',
'//form',
'//style',
'//*[@class="story-date"]',
'//*[@class="story-header"]',
'//*[@class="story-related"]',
'//*[contains(@class, "byline")]',
'//*[contains(@class, "story-feature")]',
'//*[@id="video-carousel-container"]',
'//*[@id="also-related-links"]',
'//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
)
);
Actually, only `body`, `strip` and `test_url` are supported.
Don't forget to send a pull request or a ticket to share your contribution with everybody,
### List of content grabber rules
**If you want to add new rules, just open a ticket and I will do it.**
- *.blog.lemonde.fr
- *.blog.nytimes.com
- *.nytimes.php
- *.slate.com
- *.wsj.com
- rue89.com
- www.bbc.co.uk
- www.cnn.com
- www.egscomics.com
- www.lemonde.fr
- www.numerama.com
- www.slate.fr