Minicrawler parses URLs, executes HTTP (HTTP/2) requests while handling cookies, network connection management and SSL/TLS protocols. By default it follows redirect locations and returns a full response, final URL, parsed cookied and more. It is designed to handle many request in parallel in a single thread. It multiplexes connections, running the read/write communication asynchronously. The whole Minicrawler suite is licensed under the AGPL license.
WHATWG URL Standard compliant parsing and serializing library written in C. It is fast and has only one external dependency – libicu. The library is licensed under the AGPL license.
#include <minicrawler/minicrawler-url.h>
/**
* First argument input URL, second (optional) base URL
*/
int main(int argc, char *argv[]) {
if (argc < 2) return 2;
char *input = argv[1];
char *base = NULL;
if (argc > 2) {
base = argv[2];
}
mcrawler_url_url url, *base_url = NULL;
if (base) {
base_url = (mcrawler_url_url *)malloc(sizeof(mcrawler_url_url));
if (mcrawler_url_parse(base_url, base, NULL) == MCRAWLER_URL_FAILURE) {
printf("Invalid base URL\n");
return 1;
}
}
if (mcrawler_url_parse(&url, input, base_url) == MCRAWLER_URL_FAILURE) {
printf("Invalid URL\n");
return 1;
}
printf("Result: %s\n", mcrawler_url_serialize_url(&url, 0));
return 0;
}More in test/url.c.
#include <stdio.h>
#include <minicrawler/minicrawler.h>
static void onfinish(mcrawler_url *url, void *arg) {
printf("%d: Status: %d\n", url->index, url->status);
}
void main() {
mcrawler_url url[2];
mcrawler_url *urls[] = {&url[0], &url[1], NULL};
mcrawler_settings settings;
memset(&url[0], 0, sizeof(mcrawler_url));
memset(&url[1], 0, sizeof(mcrawler_url));
mcrawler_init_url(&url[0], "http://example.com");
url[0].index = 0;
mcrawler_init_url(&url[1], "http://example.com");
url[1].index = 1;
mcrawler_init_settings(&settings);
mcrawler_go(urls, &settings, &onfinish, NULL);
}minicrawler [options] [urloptions] url [[url2options] url2]...
options:
-2 disable HTTP/2
-6 resolve host to IPv6 address only
-8 convert from page encoding to UTF-8
-A STRING custom user agent (max 255 bytes)
-b STRING cookies in the netscape/mozilla file format (max 20 cookies)
-c convert content to text format (with UTF-8 encoding)
-DMILIS set delay time in miliseconds when downloading more pages from the same IP (default is 100 ms)
-g accept gzip encoding
-h enable output of HTTP headers
-i enable impatient mode (minicrawler exits few seconds earlier if it doesn't make enough progress)
-k disable SSL certificate verification (allow insecure connections)
-l do not follow redirects
-mINT maximum page size in MiB (default 2 MiB)
-pSTRING password for HTTP authentication (basic or digest, max 31 bytes)
-S disable SSL/TLS support
-tSECONDS set timeout (default is 5 seconds)
-u STRING username for HTTP authentication (basic or digest, max 31 bytes)
-v verbose output (to stderr)
-w STRING write this custom header to all requests (max 4095 bytes)
urloptions:
-C STRING parameter which replaces '%' in the custom header
-P STRING HTTP POST parameters
-X STRING custom request HTTP method, no validation performed (max 15 bytes)Minicrawler prepends its own header into the output with the following meaning
- URL: Requested URL
- Redirected-To: Final absolute URL
- Redirect-info: Info about each redirect
- Status: HTTP Status of final response (negative in case of error)
-10Invalid input-9,-8DNS error-7,-6Connection error-5SSL/TLS error-4,-3Error during sending a HTTP request-2Error during receiving a HTTP response-1Decoding or converting error
- Content-length: Length of the downloaded content in bytes
- Timeout: Reason of timeout in case of timeout
- Error-msg: Error message in case of error (negative Status)
- Content-type: Correct content type of outputed content
- WWW-Authenticate: WWW-Authenticate header
- Cookies: Number of cookies followed by that number of lines of parsed cookies in Netscape/Mozilla file format
- Downtime: Length of an interval between time of the first connection and time of the last received byte; time of the start of the first connection
- Timing: Timing of request (DNS lookup, Initial connection, SSL, Request, Waiting, Content download, Total)
- Index: Index of URL from command line
- Asynchronous hostname resolving – c-ares
- Gzip decoding – zlib
- TLS/SSL – OpenSSL
- HTTP2 – Nghttp2
- Unicode processing – ICU
Tested platforms: Debian Linux, Red Hat Linux, OS X.
Install following dependencies (including header files, i.e. dev packages):
On Linux with apt-get run:
apt install libc-ares-dev zlib1g-dev libicu-dev libssl-dev libnghttp2-devThe GNU Autotools are also needed and the GNU Compiler Collection, they can be installed by:
apt install make autoconf automake autotools-dev libtool gccOn macOS with homebrew CFLAGS and LDFLAGS need to contain proper paths.
You can assign them directly as the configure script options.
./configure CFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/opt -L/usr/local/lib"After installation, you can link libminicrawler by adding this to your Makefile:
CFLAGS += $(shell pkg-config --cflags libminicrawler-4)
LDFLAGS += $(shell pkg-config --libs libminicrawler-4)First create .env file with COMPOSE_PROJECT_NAME=minicrawler then build docker image
docker compose build minicrawler
docker compose run --rm minicrawlerThen run:
./autogen.sh
./configure --prefix=$PREFIX --with-ca-bundle=/var/lib/certs/ca-bundle.crt --with-ca-path=/etc/ssl/certs
make
make install
make check # for testsUnit tests are done by simply running make check. They need php-cli to be installed.
Integration tests require a running instance of httpbin. You can use public one like on nghttp2.org or install it locally For example as a library from PyPI and run it using Gunicorn:
apt install -y python3-pip
pip install httpbin
gunicorn httpbin:appThen run the following command:
make -C integration-tests checkdocker compose up -d httpbinmake -C integration-tests checkCOPY --from=minicrawler:latest /var/lib/minicrawler/usr /usr- Testomato – A simple website monitoring tool
- add me here