Update readme

2013-08-31 11:27:21 -04:00 · 2013-08-31 11:27:21 -04:00 · 14d67d85e8
commit 14d67d85e8
parent e77b785263
1 changed files with 62 additions and 0 deletions
--- a/README.markdown
+++ b/README.markdown
@ -279,3 +279,65 @@ It would be awesome for everybody :)
 - Line indentation: 4 spaces
 - Line endings: Unix
 - File encoding: UTF-8
+
+### How the content grabber works?
+
+1. Try with rules first (xpath patterns) for the domain name (see `PicoFeed\Rules\`)
+2. Try to find the text content by using common attributes for class and id
+3. Fallback to Readability if no content is found
+4. Finally, if nothing is found, the feed content is displayed
+
+The content downloader use a fake user agent, actually Google Chrome under Mac Os X.
+
+However the content grabber doesn't work very well with all websites.
+**The best results are obtained with Xpath rules file.**
+
+There is a PHP script inside PicoFeed to import Fivefilters rules, but I dont' use it because almost of these patterns are not up to date.
+
+### How to write a grabber rules file?
+
+Add a PHP file to the directory `PicoFeed\Rules`, the filename must be the domain name:
+
+Example with the BBC website, `www.bbc.co.uk.php`:
+
+    <?php
+    return array(
+        'test_url' => 'http://www.bbc.co.uk/news/world-middle-east-23911833',
+        'body' => array(
+            '//div[@class="story-body"]',
+        ),
+        'strip' => array(
+            '//script',
+            '//form',
+            '//style',
+            '//*[@class="story-date"]',
+            '//*[@class="story-header"]',
+            '//*[@class="story-related"]',
+            '//*[contains(@class, "byline")]',
+            '//*[contains(@class, "story-feature")]',
+            '//*[@id="video-carousel-container"]',
+            '//*[@id="also-related-links"]',
+            '//*[contains(@class, "share") or contains(@class, "hidden") or contains(@class, "hyper")]',
+        )
+    );
+
+Actually, only `body`, `strip` and `test_url` are supported.
+
+Don't forget to send a pull request or a ticket to share your contribution with everybody,
+
+### List of content grabber rules
+
+**If you want to add new rules, just open a ticket and I will do it.**
+
+- *.blog.lemonde.fr
+- *.blog.nytimes.com
+- *.nytimes.php
+- *.slate.com
+- *.wsj.com
+- rue89.com
+- www.bbc.co.uk
+- www.cnn.com
+- www.egscomics.com
+- www.lemonde.fr
+- www.numerama.com
+- www.slate.fr