Examples

Check for null values

You can use the pewtils.is_null() and pewtils.is_not_null() to quickly check for a variety of common null values.

from pewtils import is_null
from pewtils import is_not_null
import numpy as np

>>> is_null(None)
True
>>> is_null("None")
True
>>> is_null("nan")
True
>>> is_null("")
True
>>> is_null(" ")
True
>>> is_null("NaN")
True
>>> is_null("none")
True
>>> is_null("NONE")
True
>>> is_null("n/a")
True
>>> is_null("N/A")
True
>>> is_null(np.nan)
True
>>> is_null("-9", custom_nulls=["-9"])
True
>>> is_null("Hello World")
False
>>> is_null(0.0)
False

Collapse documents into context-sensitive hashes

When working with large documents, you can use the pewtils.get_hash() function to convert them into a variety of different hashed representations. By default, this function uses SSDEEP, which produced context-sensitive hashes that can be useful for searching for similar documents.

from pewtils import get_hash

>>> doc1 = "This is a document."
>>> doc2 = "This is a document. But this one is longer."
>>> get_hash(doc1)
'3:hMCE+RL:hu+t'
>>> get_hash(doc2)
'3:hMCE+RGreCQHCAb:hu+0rLkb'
# Notice that both hashes start the same way, corresponding to their overlapping text.

Flatten nested lists

Easily flatten lists of lists:

from pewtils import flatten_list

>>> nested_lists = [[1, 2, 3], [4, 5, 6]]
>>> flatten_list(nested_lists)
[1, 2, 3, 4, 5, 6]

Recursively update dictionaries and object attributes

Map a dictionary or object onto another version of itself to update overlapping attributes:

from pewtils import recursive_update

class TestObject(object):
    def __init__(self, value):
        self.value = value
        self.dict = {"obj_key": "original"}
    def __repr__(self):
        return("TestObject(value='{}', dict={})".format(self.value, self.dict))

original = {
    "object": TestObject("original"),
    "key1": {"key2": "original"}
}
update = {
    "object": {"value": "updated", "dict": {"obj_key": "updated"}},
    "key1": {"key3": "new"}
}

>>> recursive_update(original, update)
{'object': TestObject(value='updated', dict={'obj_key': 'updated'}),
 'key1': {'key2': 'original', 'key3': 'new'}}

Efficiently map a function onto a Pandas Series

Avoid repeating database lookups or expensive computations when applying a function to a Pandas Series by using the pewtils.cached_series_mapper() function, which caches the results for each value in the series as it iterates.

import pandas as pd
from pewtils import cached_series_mapper

values = ["value"]*10
def my_function(x):
    print(x)
    return x

df = pd.DataFrame(values, columns=['column'])
>>> mapped = df['column'].map(my_function)
value
value
value
value
value
value
value
value
value
value
>>> mapped = cached_series_mapper(df['column'], my_function)
value

Read and write data in a variety of formats

The pewtils.io.FileHandler class lets you easily read and write files in a variety of formats with minimal code, and it has support for Amazon S3 too:

from pewtils.io import FileHandler

>>> h = FileHandler("./", use_s3=False)  # current local folder
>>> df = h.read("my_csv", format="csv")
# Do something and save to Excel
>>> h.write("my_new_csv", df, format="xlsx")

>>> my_data = [{"key": "value"}]
>>> h.write("my_data", my_data, format="json")

>>> my_data = ["a", "python", "list"]
>>> h.write("my_data", my_data, format="pkl")

# To read/write to an S3 bucket, simply pass your credentials
>>> h = FileHandler("/my_folder", use_s3=True, aws_access="12345", aws_secret="67890", bucket="my-bucket")
# The FileHandler can also detect your tokens directly from your environment
# Just set the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET

Quickly extract text from raw HTML

It’s not always perfect, but the pewtils.http.strip_html() function can often be used to extract most of the valuable text data from a raw HTML documents - useful for quick exploratory analysis after scraping a bunch of webpages.

from pewtils.http import strip_html

>>> my_html = "<html><head>Header text</head><body>Body text</body></html>"
>>> strip_html(my_html)
'Header text\n\nBody text'

Standardize URLs and extract domains

The pewtils.http.canonical_link() function is our best attempt at resolving URLs to their true form: it follows shortened URLs, removes unnecessary GET parameters, and tries to avoid returning incorrect 404 pages in favor of the most informative last-known version of a URL. Once links have been standardized, you can also use the pewtils.http.extract_domain_from_url() function to pull out domains and subdomains.

from pewtils.http import canonical_link

>>> canonical_link("https://pewrsr.ch/2lxB0EX?unnecessary_param=1")
"https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/"

from pewtils.http import extract_domain_from_url

>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False)
"bbc.co.uk"
>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True)
"forums.bbc.co.uk"