Examples
Check for null values
You can use the pewtils.is_null()
and pewtils.is_not_null()
to quickly check for a variety of common null values.
from pewtils import is_null
from pewtils import is_not_null
import numpy as np
>>> is_null(None)
True
>>> is_null("None")
True
>>> is_null("nan")
True
>>> is_null("")
True
>>> is_null(" ")
True
>>> is_null("NaN")
True
>>> is_null("none")
True
>>> is_null("NONE")
True
>>> is_null("n/a")
True
>>> is_null("N/A")
True
>>> is_null(np.nan)
True
>>> is_null("-9", custom_nulls=["-9"])
True
>>> is_null("Hello World")
False
>>> is_null(0.0)
False
Collapse documents into context-sensitive hashes
When working with large documents, you can use the pewtils.get_hash()
function to convert them into a variety of different hashed representations. By default, this function uses SSDEEP, which produced context-sensitive hashes that can be useful for searching for similar documents.
from pewtils import get_hash
>>> doc1 = "This is a document."
>>> doc2 = "This is a document. But this one is longer."
>>> get_hash(doc1)
'3:hMCE+RL:hu+t'
>>> get_hash(doc2)
'3:hMCE+RGreCQHCAb:hu+0rLkb'
# Notice that both hashes start the same way, corresponding to their overlapping text.
Flatten nested lists
Easily flatten lists of lists:
from pewtils import flatten_list
>>> nested_lists = [[1, 2, 3], [4, 5, 6]]
>>> flatten_list(nested_lists)
[1, 2, 3, 4, 5, 6]
Recursively update dictionaries and object attributes
Map a dictionary or object onto another version of itself to update overlapping attributes:
from pewtils import recursive_update
class TestObject(object):
def __init__(self, value):
self.value = value
self.dict = {"obj_key": "original"}
def __repr__(self):
return("TestObject(value='{}', dict={})".format(self.value, self.dict))
original = {
"object": TestObject("original"),
"key1": {"key2": "original"}
}
update = {
"object": {"value": "updated", "dict": {"obj_key": "updated"}},
"key1": {"key3": "new"}
}
>>> recursive_update(original, update)
{'object': TestObject(value='updated', dict={'obj_key': 'updated'}),
'key1': {'key2': 'original', 'key3': 'new'}}
Efficiently map a function onto a Pandas Series
Avoid repeating database lookups or expensive computations when applying a function to a Pandas Series by using the pewtils.cached_series_mapper()
function, which caches the results for each value in the series as it iterates.
import pandas as pd
from pewtils import cached_series_mapper
values = ["value"]*10
def my_function(x):
print(x)
return x
df = pd.DataFrame(values, columns=['column'])
>>> mapped = df['column'].map(my_function)
value
value
value
value
value
value
value
value
value
value
>>> mapped = cached_series_mapper(df['column'], my_function)
value
Read and write data in a variety of formats
The pewtils.io.FileHandler
class lets you easily read and write files in a variety of formats with minimal code, and it has support for Amazon S3 too:
from pewtils.io import FileHandler
>>> h = FileHandler("./", use_s3=False) # current local folder
>>> df = h.read("my_csv", format="csv")
# Do something and save to Excel
>>> h.write("my_new_csv", df, format="xlsx")
>>> my_data = [{"key": "value"}]
>>> h.write("my_data", my_data, format="json")
>>> my_data = ["a", "python", "list"]
>>> h.write("my_data", my_data, format="pkl")
# To read/write to an S3 bucket, simply pass your credentials
>>> h = FileHandler("/my_folder", use_s3=True, aws_access="12345", aws_secret="67890", bucket="my-bucket")
# The FileHandler can also detect your tokens directly from your environment
# Just set the environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET
Quickly extract text from raw HTML
It’s not always perfect, but the pewtils.http.strip_html()
function can often be used to extract most of the valuable text data from a raw HTML documents - useful for quick exploratory analysis after scraping a bunch of webpages.
from pewtils.http import strip_html
>>> my_html = "<html><head>Header text</head><body>Body text</body></html>"
>>> strip_html(my_html)
'Header text\n\nBody text'
Standardize URLs and extract domains
The pewtils.http.canonical_link()
function is our best attempt at resolving URLs to their true form: it follows shortened URLs, removes unnecessary GET parameters, and tries to avoid returning incorrect 404 pages in favor of the most informative last-known version of a URL. Once links have been standardized, you can also use the pewtils.http.extract_domain_from_url()
function to pull out domains and subdomains.
from pewtils.http import canonical_link
>>> canonical_link("https://pewrsr.ch/2lxB0EX?unnecessary_param=1")
"https://www.pewresearch.org/interactives/how-does-a-computer-see-gender/"
from pewtils.http import extract_domain_from_url
>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=False)
"bbc.co.uk"
>>> extract_domain_from_url("http://forums.bbc.co.uk", include_subdomain=True)
"forums.bbc.co.uk"