diff --git a/README.markdown b/README.markdown index d096a2b..16a286c 100644 --- a/README.markdown +++ b/README.markdown @@ -279,3 +279,65 @@ It would be awesome for everybody :) - Line indentation: 4 spaces - Line endings: Unix - File encoding: UTF-8 + +### How the content grabber works? + +1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`) +2. Try to find the text content by using common attributes for class and id +3. Fallback to Readability if no content is found +4. Finally, if nothing is found, the feed content is displayed + +The content downloader use a fake user agent, actually Google Chrome under Mac Os X. + +However the content grabber doesn't work very well with all websites. +**The best results are obtained with Xpath rules file.** + +There is a PHP script inside PicoFeed to import Fivefilters rules, but I dont' use it because almost of these patterns are not up to date. + +### How to write a grabber rules file? + +Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the domain name: + +Example with the BBC website, `www.bbc.co.uk.php`: + + 'http://www.bbc.co.uk/news/world-middle-east-23911833', + 'body' => array( + '//div[@class="story-body"]', + ), + 'strip' => array( + '//script', + '//form', + '//style', + '//*[@class="story-date"]', + '//*[@class="story-header"]', + '//*[@class="story-related"]', + '//*[contains(@class, "byline")]', + '//*[contains(@class, "story-feature")]', + '//*[@id="video-carousel-container"]', + '//*[@id="also-related-links"]', + '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]', + ) + ); + +Actually, only `body`, `strip` and `test_url` are supported. + +Don't forget to send a pull request or a ticket to share your contribution with everybody, + +### List of content grabber rules + +**If you want to add new rules, just open a ticket and I will do it.** + +- *.blog.lemonde.fr +- *.blog.nytimes.com +- *.nytimes.php +- *.slate.com +- *.wsj.com +- rue89.com +- www.bbc.co.uk +- www.cnn.com +- www.egscomics.com +- www.lemonde.fr +- www.numerama.com +- www.slate.fr