Furl

Fast Url Parser Library

http://github.com/stricaud/furl

Problem

I have an url:

http://www.hack.lu/index.php3?ref=http://www.wallinfire.net


How do I extract the 'lu' TLD?

Problem

I have an url:

http://www.hack.lu/index.php3?ref=http://www.wallinfire.net


Regular Expression?

^(?=[^&])(?:(?[^:/?#]+):)?(?://(?[^/?#]*))?(?[^?#]*)(?:\?(?[^#]*))?(?:#(?.*))?

Broken Solutions

urllib.parse

QUrl

...

Broken Solutions

urllib.parse #1

>>> from urllib.parse import urlparse
>>> url = urlparse('http://192.168.0.1/index.php3?ref=http://slashdot.org#blah')
ParseResult(scheme='http', netloc='192.168.0.1', path='/index.php3', params='', 
  query='ref=http://slashdot.org', fragment='blah')

Broken Solutions

urllib.parse #2

>>> from urllib.parse import urlparse
>>> url = urlparse('192.168.0.1/index.php3?ref=http://slashdot.org#blah')
ParseResult(scheme='', netloc='', path='192.168.0.1/index.php3', params='', 
  query='ref=http://slashdot.org', fragment='blah')

Broken Solutions

QUrl

>>> from PyQt4 import QtCore
>>> url = QtCore.QUrl("192.168.0.1/index.php3?ref=http://slashdot.org#blah")
>>> print(url.host())

>>>

Stuff ain't working!

  • Because URLs can simply be 'localhost'
  • Many scripts would love running on top of url fields in a fast manner

What fast means?

  • No allocation to decode an url
  • Read characters only once

Pragmatic

  • A static C library you can embed
  • A dynamic C library
  • A command line tool

A command line tool: furl

$ furl -p 192.168.0.1/index.php3?ref=http://slashdot.org#blah
scheme,credential,subdomain,domain,host,tld,port,resource_path,query_string,fragment
,,,192.168.0.1,192.168.0.1,,,/index.php3,?ref=http://slashdot.org,#blah
					  

Extract a TLD

$ furl http://www.hack.lu:42/foo.html |cut -d',' -f6
lu
					  

List your TLD, sorted

$ cat urls |furl |cut -d',' -f6 |sort |uniq
com
lu
net
org
					  

Fyodor admited he had troubles with google-anatylics.com yesterday!

$ cat fyodor |furl |cut -d',' -f5 |sort |uniq
google-analitycs.com
google-analytics.com
google-anatylics.com
					  

Python bindings

   >>> from pyfurl.furl import Furl
   >>> f = Furl()
   >>> f.decode("https://www.slashdot.org")
   >>> f.get()
   {'credential': None, 'domain': 'slashdot.org', 'subdomain': 'www', 
    'fragment': None, 'host': 'www.slashdot.org', 'resource_path': None, 
    'tld': 'org', 'query_string': None, 'scheme': 'https', 'port': None}
   >>> 					  

How it works?

A look at the C API:
   furl_handler_t *fh;

   fh = furl_init();
   furl_decode(fh, "https://wallinfire.net", strlen("https://wallinfire.net"));
   tld_pos = furl_get_tld_pos(fh); /* will return 19 */       
   tld_size = furl_get_tld_size(fh); /* will return 3 */       
   furl_show(fh, ',', stdout);

   furl_terminate(fh);
				  

Availability

http://github.com/stricaud/furl

License
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 
Version 2, December 2004 

Copyright (C) 2012 Sebastien Tricaud (sebastien@honeynet.org)

Everyone is permitted to copy and distribute verbatim or modified 
copies of this license document, and changing it is allowed as long 
as the name is changed. 

DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE 
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 

 0. You just DO WHAT THE FUCK YOU WANT TO.

					  

Questions?