Crawly: Micro crawler for Python
Crawly is a Python library that allow to crawl website and extract data
from this later using a simple API.
Crawly work by combining different tool, that ultimately created a small
library (~350 lines of code) that fetch website HTML, crawl it (follow links)
and extract data from each page.
Libraries used:
- requests It’s a Python
HTTP library, it’s used by crawly to fetch website HTML, this library
take care of maintaining the Connection Pool, it’s also easily configurable
and support a lot of feature including: SSL, Cookies, Persistent requests,
HTML decoding ... .
- gevent This is the engine responsible of the speed in
crawly, with gevent you can run concurrent code, using green thread.
- lxml a fast, easy to use Python library that used to parse
the HTML fetched to help extracting data easily.
- logging Python standard library module that log information, also easily
configurable.