HTTP Utilities
In this module, you’ll find a variety of useful functions for working with web data. The pewtils.http.canonical_link()
function is our best attempt at standardizing and cleaning a URL without losing any information, and the pewtils.http.strip_html()
function is useful for attempting to extract text from raw HTML data with minimal fine-tuning.
Functions:
|
Clears out http/https prefix and returns an MD5 hash of the URL. |
|
Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components. |
|
Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects. |
|
Attempts to extract a standardized domain from a url by following the link and extracting the TLD. |
|
Tries to resolve a link to the "most correct" version. |
- hash_url(url)[source]
Clears out http/https prefix and returns an MD5 hash of the URL. More effective when used in conjunction with
pewtils.http.canonical_link()
.- Parameters
url (str) – The URL to hash
- Returns
Hashed string representation of the URL using the md5 hashing algorithm.
- Return type
str
Usage:
from pewtils.http import hash_url >>> hash_url("http://www.example.com") "7c1767b30512b6003fd3c2e618a86522" >>> hash_url("www.example.com") "7c1767b30512b6003fd3c2e618a86522"
- strip_html(html, simple=False, break_tags=None)[source]
Attempts to strip out HTML code from an arbitrary string while preserving meaningful text components. By default, the function will use BeautifulSoup to parse the HTML. Setting
simple=True
will make the function use a much simpler regular expression approach to parsing.- Parameters
html (str) – The HTML to process
simple (bool) – Whether or not to use a simple regex or more complex parsing rules (default=False)
break_tags (list) – A custom list of tags on which to break (default is [“strong”, “em”, “i”, “b”, “p”])
- Returns
The text with HTML components removed
- Return type
str
Usage:
from pewtils.http import strip_html >>> my_html = "<html><head>Header text</head><body>Body text</body></html>" >>> strip_html(my_html) 'Header text Body text'
- trim_get_parameters(url, session=None, timeout=30, user_agent=None)[source]
Takes a URL (presumed to be the final end point) and iterates over GET parameters, attempting to find optional ones that can be removed without generating any redirects.
- Parameters
url (str) – The URL to trim
session (
requests.Session
object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided
timeout (int or float) – Timeout for requests
- Returns
The original URL with optional GET parameters removed
- Return type
str
Usage:
from pewtils.http import trim_get_parameters >>> trim_get_parameters("https://httpbin.org/status/200?param=1") "https://httpbin.org/status/200"
- extract_domain_from_url(url, include_subdomain=True, resolve_url=False, timeout=1.0, session=None, user_agent=None, expand_shorteners=True)[source]
Attempts to extract a standardized domain from a url by following the link and extracting the TLD.
- Parameters
url (str) – The link from which to extract the domain
include_subdomain (bool) – Whether or not to include the subdomain (e.g. ‘news.google.com’); default is True
resolve_url – Whether to fully resolve the URL. If False (default), it will operate on the URL as-is; if True, the URL will be passed to
pewtils.http.canonical_link()
to be standardized prior to extracting the domain.timeout (int or float) – (Optional, for use with
resolve_url
) Maximum number of seconds to wait on a request before timing out (default is 1)session (
requests.Session
object) – (Optional, for use withresolve_url
) A persistent session that can optionally be passed (useful if you’re processing many links at once)user_agent (str) – (Optional, for use with
resolve_url
) User agent for the auto-created requests Session to use, if a preconfigured requests Session is not providedexpand_shorteners (bool) – If True, shortened URLs that don’t successfully expand will be checked against a list of known URL shorteners and expanded if recognized. (Default = True)
- Returns
The domain for the link
- Return type
str
Note
If
resolve_url
is set to True, the link will be standardized prior to domain extraction (in which case you can provide optional timeout, session, and user_agent parameters that will be passed topewtils.http.canonical_link()
). By default, however, the link will be operated on as-is. The final extracted domain is then checked against known URL shorteners (see Vanity Link Shorteners) and if it is recognized, the expanded domain will be returned instead. Shortened URLs that are not standardized and do not follow patterns included in this dictionary of known shorteners may be returned with an incorrect domain.Usage:
from pewtils.http import extract_domain_from_url >>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False) "bbc.co.uk" >>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True) "forums.bbc.co.uk"
- canonical_link(url, timeout=5.0, session=None, user_agent=None)[source]
Tries to resolve a link to the “most correct” version.
Useful for expanding short URLs from bit.ly / Twitter and for checking HTTP status codes without retrieving the actual data. Follows redirects and tries to pick the most informative version of a URL while avoiding redirects to generic 404 pages. Also tries to iteratively remove optional GET parameters.
May not be particularly effective on dead links, but may still be able to follow redirects enough to return a URL with the correct domain associated with the original link.
- Parameters
url (str) – The URL to test. Should be fully qualified.
timeout (int or float) – How long to wait for a response before giving up (default is one second)
session (
requests.Session
object) – (Optional) A persistent session that can optionally be passed (useful if you’re processing many links at once)user_agent (str) – User agent for the auto-created requests Session to use, if a preconfigured requests Session is not provided
- Returns
The “canonical” URL as supplied by the server, or the original URL if none supplied.
- Return type
str
Note
See Link Shorteners for a complete list of shortened links recognized by this function.
This function might not resolve all existing URL modificiations, but it has been tested on a vast, well maintained variety of URLs. It typically resolves URL to the correct final page while avoiding redirects to generic error pages.
Usage:
from pewtils.http import canonical_link >>> canonical_link("https://pewrsr.ch/2lxB0EX") "https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/"
Link Shorteners
List of link shorteners recognized by methods in this section.
General Link Shorteners
A list of known Generic Link Shorteners.
Vanity Link Shorteners
A list of known URL shorteners for websites specific Vanity Link Shorteners (primarily news websites).