************** Examples ************** Check for null values ----------------------------------------------------- You can use the :py:func:`pewtils.is_null` and :py:func:`pewtils.is_not_null` to quickly check for a \ variety of common null values. .. code-block:: python from pewtils import is_null from pewtils import is_not_null import numpy as np >>> is_null(None) True >>> is_null("None") True >>> is_null("nan") True >>> is_null("") True >>> is_null(" ") True >>> is_null("NaN") True >>> is_null("none") True >>> is_null("NONE") True >>> is_null("n/a") True >>> is_null("N/A") True >>> is_null(np.nan) True >>> is_null("-9", custom_nulls=["-9"]) True >>> is_null("Hello World") False >>> is_null(0.0) False Collapse documents into context-sensitive hashes ----------------------------------------------------- When working with large documents, you can use the :py:func:`pewtils.get_hash` function to convert \ them into a variety of different hashed representations. By default, this function uses SSDEEP, which \ produced context-sensitive hashes that can be useful for searching for similar documents. .. code-block:: python from pewtils import get_hash >>> doc1 = "This is a document." >>> doc2 = "This is a document. But this one is longer." >>> get_hash(doc1) '3:hMCE+RL:hu+t' >>> get_hash(doc2) '3:hMCE+RGreCQHCAb:hu+0rLkb' # Notice that both hashes start the same way, corresponding to their overlapping text. Flatten nested lists ----------------------------------------------------- Easily flatten lists of lists: .. code-block:: python from pewtils import flatten_list >>> nested_lists = [[1, 2, 3], [4, 5, 6]] >>> flatten_list(nested_lists) [1, 2, 3, 4, 5, 6] Recursively update dictionaries and object attributes ----------------------------------------------------- Map a dictionary or object onto another version of itself to update overlapping attributes: .. code-block:: python from pewtils import recursive_update class TestObject(object): def __init__(self, value): self.value = value self.dict = {"obj_key": "original"} def __repr__(self): return("TestObject(value='{}', dict={})".format(self.value, self.dict)) original = { "object": TestObject("original"), "key1": {"key2": "original"} } update = { "object": {"value": "updated", "dict": {"obj_key": "updated"}}, "key1": {"key3": "new"} } >>> recursive_update(original, update) {'object': TestObject(value='updated', dict={'obj_key': 'updated'}), 'key1': {'key2': 'original', 'key3': 'new'}} Efficiently map a function onto a Pandas Series ----------------------------------------------------- Avoid repeating database lookups or expensive computations when applying a function to a Pandas \ Series by using the :py:func:`pewtils.cached_series_mapper` function, which caches the results \ for each value in the series as it iterates. .. code-block:: python import pandas as pd from pewtils import cached_series_mapper values = ["value"]*10 def my_function(x): print(x) return x df = pd.DataFrame(values, columns=['column']) >>> mapped = df['column'].map(my_function) value value value value value value value value value value >>> mapped = cached_series_mapper(df['column'], my_function) value Read and write data in a variety of formats ----------------------------------------------------- The :py:class:`pewtils.io.FileHandler` class lets you easily read and write files in a variety of \ formats with minimal code, and it has support for Amazon S3 too: .. code-block:: python from pewtils.io import FileHandler >>> h = FileHandler("./", use_s3=False) # current local folder >>> df = h.read("my_csv", format="csv") # Do something and save to Excel >>> h.write("my_new_csv", df, format="xlsx") >>> my_data = [{"key": "value"}] >>> h.write("my_data", my_data, format="json") >>> my_data = ["a", "python", "list"] >>> h.write("my_data", my_data, format="pkl") # To read/write to an S3 bucket, simply pass your credentials >>> h = FileHandler("/my_folder", use_s3=True, aws_access="12345", aws_secret="67890", bucket="my-bucket") # The FileHandler can also detect your tokens directly from your environment # Just set the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET Quickly extract text from raw HTML ----------------------------------------------------- It's not always perfect, but the :py:func:`pewtils.http.strip_html` function can often be used to \ extract most of the valuable text data from a raw HTML documents - useful for quick exploratory \ analysis after scraping a bunch of webpages. .. code-block:: python from pewtils.http import strip_html >>> my_html = "Header textBody text" >>> strip_html(my_html) 'Header text\n\nBody text' Standardize URLs and extract domains ----------------------------------------------------- The :py:func:`pewtils.http.canonical_link` function is our best attempt at resolving URLs to their \ true form: it follows shortened URLs, removes unnecessary GET parameters, and tries to avoid returning \ incorrect 404 pages in favor of the most informative last-known version of a URL. Once links have been \ standardized, you can also use the :py:func:`pewtils.http.extract_domain_from_url` function to pull \ out domains and subdomains. .. code-block:: python from pewtils.http import canonical_link >>> canonical_link("https://pewrsr.ch/2lxB0EX?unnecessary_param=1") "https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/" from pewtils.http import extract_domain_from_url >>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False) "bbc.co.uk" >>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True) "forums.bbc.co.uk"