HTTP Utilities

In this module, you’ll find a variety of useful functions for working with web data. The pewtils.http.canonical_link() function is our best attempt at standardizing and cleaning a URL without losing any information, and the pewtils.http.strip_html() function is useful for attempting to extract text from raw HTML data with minimal fine-tuning.

Functions:

hash_url(url)

Clears out http/https prefix and returns an MD5 hash of the URL.

strip_html(html[, simple, break_tags])

Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components.

trim_get_parameters(url[, session, timeout, ...])

Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects.

extract_domain_from_url(url[, ...])

Attempts to extract a standardized domain from a url by following the link and extracting the TLD.

canonical_link(url[, timeout, session, ...])

Tries to resolve a link to the "most correct" version.

hash_url(url)[source]

Clears out http/https prefix and returns an MD5 hash of the URL. More effective when used in conjunction with pewtils.http.canonical_link().

Parameters

url (str) – The URL to hash

Returns

Hashed string representation of the URL using the md5 hashing algorithm.

Return type

str

Usage:

from pewtils.http import hash_url

>>> hash_url("http://www.example.com")
"7c1767b30512b6003fd3c2e618a86522"
>>> hash_url("www.example.com")
"7c1767b30512b6003fd3c2e618a86522"
strip_html(html, simple=False, break_tags=None)[source]

Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components. By default, the function will use BeautifulSoup to parse the HTML. Setting simple=True will make the function use a much simpler regular expression approach to parsing.

Parameters
  • html (str) – The HTML to process

  • simple (bool) – Whether or not to use a simple regex or more complex parsing rules (default=False)

  • break_tags (list) – A custom list of tags on which to break (default is [“strong”, “em”, “i”, “b”, “p”])

Returns

The text with HTML components removed

Return type

str

Usage:

from pewtils.http import strip_html

>>> my_html = "<html><head>Header text</head><body>Body text</body></html>"
>>> strip_html(my_html)
'Header text Body text'
trim_get_parameters(url, session=None, timeout=30, user_agent=None)[source]

Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects.

Parameters
  • url (str) – The URL to trim

  • session (requests.Session object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)

  • user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided

  • timeout (int or float) – Timeout for requests

Returns

The original URL with optional GET parameters removed

Return type

str

Usage:

from pewtils.http import trim_get_parameters

>>> trim_get_parameters("https://httpbin.org/status/200?param=1")
"https://httpbin.org/status/200"
extract_domain_from_url(url, include_subdomain=True, resolve_url=False, timeout=1.0, session=None, user_agent=None, expand_shorteners=True)[source]

Attempts to extract a standardized domain from a url by following the link and extracting the TLD.

Parameters
  • url (str) – The link from which to extract the domain

  • include_subdomain (bool) – Whether or not to include the subdomain (e.g. ‘news.google.com’); default is True

  • resolve_url – Whether to fully resolve the URL. If False (default), it will operate on the URL as-is; if True, the URL will be passed to pewtils.http.canonical_link() to be standardized prior to extracting the domain.

  • timeout (int or float) – (Optional, for use with resolve_url) Maximum number of seconds to wait on a request before timing out (default is 1)

  • session (requests.Session object) – (Optional, for use with resolve_url) A persistent session that can optionally be passed (useful if you’re processing many links at once)

  • user_agent (str) – (Optional, for use with resolve_url) User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided

  • expand_shorteners (bool) – If True, shortened URLs that don’t successfully expand will be checked against a list of known URL shorteners and expanded if recognized. (Default = True)

Returns

The domain for the link

Return type

str

Note

If resolve_url is set to True, the link will be standardized prior to domain extraction (in which case you can provide optional timeout, session, and user_agent parameters that will be passed to pewtils.http.canonical_link()). By default, however, the link will be operated on as-is. The final extracted domain is then checked against known URL shorteners (see Vanity Link Shorteners) and if it is recognized, the expanded domain will be returned instead. Shortened URLs that are not standardized and do not follow patterns included in this dictionary of known shorteners may be returned with an incorrect domain.

Usage:

from pewtils.http import extract_domain_from_url

>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False)
"bbc.co.uk"
>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True)
"forums.bbc.co.uk"

Tries to resolve a link to the “most correct” version.

Useful for expanding short URLs from bit.ly / Twitter and for checking HTTP status codes without retrieving the actual data. Follows redirects and tries to pick the most informative version of a URL while avoiding redirects to generic 404 pages. Also tries to iteratively remove optional GET parameters.

May not be particularly effective on dead links, but may still be able to follow redirects enough to return a URL with the correct domain associated with the original link.

Parameters
  • url (str) – The URL to test. Should be fully qualified.

  • timeout (int or float) – How long to wait for a response before giving up (default is one second)

  • session (requests.Session object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)

  • user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided

Returns

The “canonical” URL as supplied by the server, or the original URL if none supplied.

Return type

str

Note

See Link Shorteners for a complete list of shortened links recognized by this function.

This function might not resolve all existing URL modificiations, but it has been tested on a vast, well maintained variety of URLs. It typically resolves URL to the correct final page while avoiding redirects to generic error pages.

Usage:

from pewtils.http import canonical_link

>>> canonical_link("https://pewrsr.ch/2lxB0EX")
"https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/"