wxpath - declarative web crawling with XPath wxpath is a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression. wxpath executes that expression concurrently, breadth-first-ish, and streams results as they are discovered. By introducing the url(...) operator and the /// syntax, wxpath's engine is able to perform deep (or paginated) web crawling and extraction. NOTE: This project is in early development. Core concepts are stable, but the API and features may change. Please report issues - in particular, deadlocked crawls or unexpected behavior - and any features you'd like to see (no guarantee they'll be implemented). Contents Example import wxpath from wxpath . settings import CRAWLER_SETTINGS # Custom headers for politeness; necessary for some sites (e.g., Wikipedia) CRAWLER_SETTINGS . headers = { 'User-Agent' : 'my-app/0.4.0 (contact: you@example.com)' } # Crawl, extract fields, build a knowledge graph path_expr = """ url('https://en.wikipedia.org/wiki/Expression_language') ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))]) /map{ 'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.), 'url': string(base-uri(.)), 'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.), 'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.) } """ for item in wxpath . wxpath_async_blocking_iter ( path_expr , max_depth = 1 ): print ( item ) Output: map { 'title' : 'Computer language' , 'url' : 'https://en.wikipedia.org/wiki/Computer_language' , 'short_description' : 'Formal language for communicating with a computer' , 'forward_links' : [ '/wiki/Formal_language' , '/wiki/Communication' , ...]} map { 'title' : 'Advanced Boolean Expression Language' , 'url' : 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language' , 'short_description' : 'Hardwa...
First seen: 2026-01-20 18:35
Last seen: 2026-01-20 20:35