HTTP Utilities

In this module, you’ll find a variety of useful functions for working with web data. The pewtils.http.canonical_link() function is our best attempt at standardizing and cleaning a URL without losing any information, and the pewtils.http.strip_html() function is useful for attempting to extract text from raw HTML data with minimal fine-tuning.

Functions:

`hash_url`(url)	Clears out http/https prefix and returns an MD5 hash of the URL.
`strip_html`(html[, simple, break_tags])	Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components.
`trim_get_parameters`(url[, session, timeout, ...])	Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects.
`extract_domain_from_url`(url[, ...])	Attempts to extract a standardized domain from a url by following the link and extracting the TLD.
`canonical_link`(url[, timeout, session, ...])	Tries to resolve a link to the "most correct" version.

hash_url(url)[source]

Clears out http/https prefix and returns an MD5 hash of the URL. More effective when used in conjunction with pewtils.http.canonical_link().

Parameters: url (str) – The URL to hash
Returns: Hashed string representation of the URL using the md5 hashing algorithm.
Return type: str

Usage:

from pewtils.http import hash_url

>>> hash_url("http://www.example.com")
"7c1767b30512b6003fd3c2e618a86522"
>>> hash_url("www.example.com")
"7c1767b30512b6003fd3c2e618a86522"

strip_html(html, simple=False, break_tags=None)[source]

Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components. By default, the function will use BeautifulSoup to parse the HTML. Setting simple=True will make the function use a much simpler regular expression approach to parsing.

Parameters

html (str) – The HTML to process
simple (bool) – Whether or not to use a simple regex or more complex parsing rules (default=False)
break_tags (list) – A custom list of tags on which to break (default is [“strong”, “em”, “i”, “b”, “p”])

Returns

The text with HTML components removed

Return type

str

Usage:

from pewtils.http import strip_html

>>> my_html = "<html><head>Header text</head><body>Body text</body></html>"
>>> strip_html(my_html)
'Header text Body text'

trim_get_parameters(url, session=None, timeout=30, user_agent=None)[source]

Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects.

Parameters

url (str) – The URL to trim
session (requests.Session object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)
user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided
timeout (int or float) – Timeout for requests

Returns

The original URL with optional GET parameters removed

Return type

str

Usage:

from pewtils.http import trim_get_parameters

>>> trim_get_parameters("https://httpbin.org/status/200?param=1")
"https://httpbin.org/status/200"

extract_domain_from_url(url, include_subdomain=True, resolve_url=False, timeout=1.0, session=None, user_agent=None, expand_shorteners=True)[source]

Attempts to extract a standardized domain from a url by following the link and extracting the TLD.

Parameters

url (str) – The link from which to extract the domain
include_subdomain (bool) – Whether or not to include the subdomain (e.g. ‘news.google.com’); default is True
resolve_url – Whether to fully resolve the URL. If False (default), it will operate on the URL as-is; if True, the URL will be passed to pewtils.http.canonical_link() to be standardized prior to extracting the domain.
timeout (int or float) – (Optional, for use with resolve_url) Maximum number of seconds to wait on a request before timing out (default is 1)
session (requests.Session object) – (Optional, for use with resolve_url) A persistent session that can optionally be passed (useful if you’re processing many links at once)
user_agent (str) – (Optional, for use with resolve_url) User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided
expand_shorteners (bool) – If True, shortened URLs that don’t successfully expand will be checked against a list of known URL shorteners and expanded if recognized. (Default = True)

Returns

The domain for the link

Return type

str

Note

If resolve_url is set to True, the link will be standardized prior to domain extraction (in which case you can provide optional timeout, session, and user_agent parameters that will be passed to pewtils.http.canonical_link()). By default, however, the link will be operated on as-is. The final extracted domain is then checked against known URL shorteners (see Vanity Link Shorteners) and if it is recognized, the expanded domain will be returned instead. Shortened URLs that are not standardized and do not follow patterns included in this dictionary of known shorteners may be returned with an incorrect domain.

Usage:

from pewtils.http import extract_domain_from_url

>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False)
"bbc.co.uk"
>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True)
"forums.bbc.co.uk"

canonical_link(url, timeout=5.0, session=None, user_agent=None)[source]

Tries to resolve a link to the “most correct” version.

Useful for expanding short URLs from bit.ly / Twitter and for checking HTTP status codes without retrieving the actual data. Follows redirects and tries to pick the most informative version of a URL while avoiding redirects to generic 404 pages. Also tries to iteratively remove optional GET parameters.

May not be particularly effective on dead links, but may still be able to follow redirects enough to return a URL with the correct domain associated with the original link.

Parameters

url (str) – The URL to test. Should be fully qualified.
timeout (int or float) – How long to wait for a response before giving up (default is one second)
session (requests.Session object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)
user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided

Returns

The “canonical” URL as supplied by the server, or the original URL if none supplied.

Return type

str

Note

See Link Shorteners for a complete list of shortened links recognized by this function.

This function might not resolve all existing URL modificiations, but it has been tested on a vast, well maintained variety of URLs. It typically resolves URL to the correct final page while avoiding redirects to generic error pages.

Usage:

from pewtils.http import canonical_link

>>> canonical_link("https://pewrsr.ch/2lxB0EX")
"https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/"

Link Shorteners

List of link shorteners recognized by methods in this section.

General Link Shorteners

A list of known Generic Link Shorteners.

Vanity Link Shorteners

A list of known URL shorteners for websites specific Vanity Link Shorteners (primarily news websites).