Crawly: Micro crawler for Python¶

Crawly is a Python library that allow to crawl website and extract data from this later using a simple API.

Crawly work by combining different tool, that ultimately created a small library (~350 lines of code) that fetch website HTML, crawl it (follow links) and extract data from each page.

Libraries used:

requests It’s a Python HTTP library, it’s used by crawly to fetch website HTML, this library take care of maintaining the Connection Pool, it’s also easily configurable and support a lot of feature including: SSL, Cookies, Persistent requests, HTML decoding ... .
gevent This is the engine responsible of the speed in crawly, with gevent you can run concurrent code, using green thread.
lxml a fast, easy to use Python library that used to parse the HTML fetched to help extracting data easily.
logging Python standard library module that log information, also easily configurable.

Crawly: Micro crawler for Python¶

User Guide:¶

Project Versions

Table Of Contents

Next topic

This Page

Navigation

Crawly: Micro crawler for Python¶

User Guide:¶

Project Versions

RTD Search

Table Of Contents

Next topic

This Page

Quick search

Navigation