Core Functions

The main Pewtils module contains a variety of generally useful functions that make our researchers lives easier. For those still working in Python 2.x, the pewtils.decode_text() function can help alleviate headaches related to text encodings. The pewtils.is_null() and pewtils.is_not_null() functions provide an easy way to deal with the wide variety of possible null values that exist in the Python (and broader research universe) by using a best-guess approach. When working with dictionaries or JSON records that need to be updated, pewtils.recursive_update() makes it easy to map one version of an object onto another. While we strive to write efficient code that can cover every possible use-case, there are certainly some edge cases that we haven’t encountered, and other existing Python libraries may very well provide many of these same features. This collection simply consists of functions we find ourselves using again and again, and we hope that Pewtils may help expand your daily toolkit in some way as well.

Classes:

classproperty(fget)

This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class).

PrintExecutionTime([label, stdout])

Simple context manager to print the time it takes for a block of code to execute

Functions:

is_not_null(val[, empty_lists_are_null, ...])

Checks whether the value is null, using a variety of potential string values, etc.

is_null(val[, empty_lists_are_null, ...])

Returns the opposite of the outcome of pewtils.is_not_null().

decode_text(text[, throw_loud_fail])

Attempts to decode and re-encode text as ASCII.

get_hash(text[, hash_function])

Generates hashed text using one of several available hashing functions.

zipcode_num_to_string(zipcode)

Attempts to standardize a string/integer/float that contains a U.S.

concat_text(*args)

A helper function for concatenating text values.

vector_concat_text(*args)

Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces.

scale_range(old_val, old_min, old_max, ...)

Scales a value from one range to another.

new_random_number([attempt, minimum, maximum])

Returns a random number between the boundary that exponentially increases with the number of attempt.

chunk_list(seq, size)

Takes a sequence and groups values into smaller lists based on the specified size.

flatten_list(l)

Takes a list of lists and flattens it into a single list.

scan_dictionary(search_dict, field)

Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key.

recursive_update(existing, new)

Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object.

cached_series_mapper(series, function)

Applies a function to all of the unique values in a pandas.Series to avoid repeating the operation on duplicate values.

multiprocess_group_apply(grp, func, *args, ...)

Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data.

extract_json_from_folder(folder_path[, ...])

Takes a folder path and traverses it, looking for JSON files.

extract_attributes_from_folder_modules(...)

Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name.

class classproperty(fget)[source]

This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class). It allows you to access classproperty attributes directly, such as obj.property, rather than as a function on a class instance (like obj = Obj(); obj.property()).

Borrowed from a StackOverflow post.

Usage:

from pewtils import classproperty

class MyClass(object):
    x = 4

    @classproperty
    def number(cls):
        return cls.x

>>> MyClass().number
4
>>> MyClass.number
4
is_not_null(val, empty_lists_are_null=False, custom_nulls=None)[source]

Checks whether the value is null, using a variety of potential string values, etc. The following values are always considered null: numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"

Parameters
  • val – The value to check

  • empty_lists_are_null (bool) – Whether or not an empty list or pandas.DataFrame should be considered null (default=False)

  • custom_nulls (list) – an optional list of additional values to consider as null

Returns

True if the value is not null

Return type

bool

Usage:

from pewtils import is_not_null

>>> text = "Hello"
>>> is_not_null(text)
True
is_null(val, empty_lists_are_null=False, custom_nulls=None)[source]

Returns the opposite of the outcome of pewtils.is_not_null(). The following values are always considered null: numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"

Parameters
  • val – The value to check

  • empty_lists_are_null (bool) – Whether or not an empty list or pandas.DataFrame should be considered null (default=False)

  • custom_nulls (list) – an optional list of additional values to consider as null

Returns

True if the value is null

Return type

bool

Usage:

from pewtils import is_null

>>> empty_list = []
>>> is_null(empty_list, empty_lists_are_null=True)
True
decode_text(text, throw_loud_fail=False)[source]

Attempts to decode and re-encode text as ASCII. In the case of failure, it will attempt to detect the string’s encoding, decode it, and convert it to ASCII. If both these attempts fail, it will attempt to use the unidecode package to transliterate into ASCII. And finally, if that doesn’t work, it will forcibly encode the text as ASCII and ignore non-ASCII characters.

Warning

This function is potentially destructive to source input and should be used with some care. Input text that cannot be decoded may be stripped out, or replaced with a similar ASCII character or other placeholder, potentially resulting in an empty string.

Parameters
  • text (str) – The text to process

  • throw_loud_fail (bool) – If True, exceptions will be raised, otherwise the function will fail silently and return an empty string (default False)

Returns

Decoded text, or empty string

Return type

str

Note

In Python 3, the decode/encode attempts will fail by default, and the unidecode package will be used to transliterate. In general, you shouldn’t need to use this function in Python 3, but it shouldn’t hurt anything if you do.

get_hash(text, hash_function='ssdeep')[source]

Generates hashed text using one of several available hashing functions.

Parameters
  • text (str) – The string to hash

  • hash_function (str) – The specific algorithm to use; options are 'nilsimsa', 'md5', and 'ssdeep' (default)

Returns

A hashed representation of the provided string

Return type

str

Note

The string will be passed through pewtils.decode_text() and the returned value will be used instead of the original value if it runs successfully, in order to ensure consistent hashing in both Python 2 and 3. By default the function uses the ssdeep algorithm, which generates context-sensitive hashes that are useful for computing document similarities at scale.

Note

Using hash_function=’ssdeep’ requires the ssdeep library, which is not installed by default because it requires the installation of additional system libraries on certain operating systems. For help installing ssdeep, refer to the pewtils documentation installation section, which provides OS-specific instructions.

Usage:

from pewtils import get_hash

>>> text = 'test_string'
>>> get_hash(text)
'3:HI2:Hl'
zipcode_num_to_string(zipcode)[source]

Attempts to standardize a string/integer/float that contains a U.S. zipcode. Front-pads with zeroes and uses the zipcodes library to ensure that the zipcode is real. If the zipcode doesn’t validate successfully, None will be returned.

Parameters

zip (str or float or int) – Object that contains a sequence of digits (string, integer, float)

Returns

A 5-digit string, or None

Return type

str or NoneType

Usage:

from pewtils import zipcode_num_to_string

>>> zipcode_number = 6463
>>> zipcode_num_to_string(zipcode_number)
'06463'
>>> not_zipcode_number = 345678
>>> zipcode_num_to_string(not_zipcode_number)
>>>
concat_text(*args)[source]

A helper function for concatenating text values. Text values are passed through pewtils.decode_text() before concatenation.

Parameters

args (list) – A list of text values that will be returned as a single space-separated string

Returns

A single string of the values concatenated by spaces

Return type

str

Usage:

from pewtils import concat_text

>>> text_list = ['Hello', 'World', '!']
>>> concat_text(text_list)
'Hello World !'
vector_concat_text(*args)[source]

Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces. Useful for merging multiple columns of text in Pandas.

Parameters

args – A list of lists or pandas.Series s that contain text values

Returns

A single list or pandas.Series with all of the text values for each row concatenated

Usage with lists:

from pewtils import vector_concat_text

>>> text_lists = ["one", "two", "three"], ["a", "b", "c"]
>>> vector_concat_text(text_lists)
['one a', 'two b', 'three c']

Usage with Pandas:

import pandas as pd
from pewtils import vector_concat_text

df = pd.DataFrame([
    {"text1": "one", "text2": "a"},
    {"text1": "two", "text2": "b"},
    {"text1": "three", "text2": "c"}
])

>>> df['text'] = vector_concat_text(df['text1'], df['text2'])
>>> df['text']
0      one a
1      two b
2    three c
Name: text, dtype: object
scale_range(old_val, old_min, old_max, new_min, new_max)[source]

Scales a value from one range to another. Useful for comparing values from different scales, for example.

Parameters
  • old_val (int or float) – The value to convert

  • old_min (int or float) – The minimum of the old range

  • old_max (int or float) – The maximum of the old range

  • new_min (int or float) – The minimum of the new range

  • new_max (int or float) – The maximum of the new range

Returns

Value equivalent from the new scale

Return type

float

Usage:

from pewtils import scale_range

>>> old_value = 5
>>> scale_range(old_value, 0, 10, 0, 20)
10.0
new_random_number(attempt=1, minimum=1.0, maximum=10)[source]

Returns a random number between the boundary that exponentially increases with the number of attempt. The upper bound is capped using the maximum parameter (default 10) but is otherwise determined by the function minimum * 2 ** attempt.

In effect, this means that when attempt is 1, the number returned will be in the range of the minimum and twice the minimum’s value. As you increase attempt, the possible range of returned values expands exponentially until it hits the maximum ceiling.
Parameters
  • attempt (int) – Increasing attempt will expand the upper-bound of the range from which the random number is drawn

  • minimum (int or float) – The minimum allowed value that can be returned; must be greater than zero.

  • maximum (int or float) – The maximum allowed value that can be returned; must be greater than minimum.

Returns

A random number drawn uniformly from across the range determined by the provided arguments.

Return type

float

Note

One useful application of this function is rate limiting: a script can pause in between requests at a reasonably fast pace, but then moderate itself and pause for longer periods if it begins encountering errors, simply by increasing the attempt variable (hence its name).

Usage:

from pewtils import new_random_number

>>> new_random_number(attempt=1)
1.9835581813820642
>>> new_random_number(attempt=2)
3.1022350739064
chunk_list(seq, size)[source]

Takes a sequence and groups values into smaller lists based on the specified size.

Parameters
  • seq (list or iterable) – List or a list-like iterable

  • size (int) – Desired size of each sublist

Returns

A list of lists

Return type

list

Usage:

from pewtils import chunk_list

>>> number_sequence = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> chunk_list(number_sequence, 3)
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
flatten_list(l)[source]

Takes a list of lists and flattens it into a single list. Nice shortcut to avoid having to deal with list comprehension.

Parameters

l (list) – A list of lists

Returns

A flattened list of all of the elements contained in the original list of lists

Return type

list

Usage:

from pewtils import flatten_list

>>> nested_lists = [[1, 2, 3], [4, 5, 6]]
>>> flatten_list(nested_lists)
[1, 2, 3, 4, 5, 6]
scan_dictionary(search_dict, field)[source]

Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key. Since keys can occur more than once, the function returns a list of all of the found values along with a list of equal length that specifies the nested key path to each value.

Parameters
  • search_dict (dict) – The dictionary to search

  • field (str) – The field to find

Returns

A tuple of the found values and file path-style strings representing their locations

Return type

tuple

Usage:

from pewtils import scan_dictionary

>>> test_dict = {"one": {"two": {"three": "four"}}}
>>> scan_dictionary(test_dict, "three")
(['four'], ['one/two/three/'])
>>> scan_dictionary(test_dict, "five")
([], [])
recursive_update(existing, new)[source]

Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object.

Regardless of whether or not the keys in the dictionary correspond to attribute names or dictionary keys; you can use this to iterate through a nested hierarchy of objects and dictionaries and update whatever you like.
Parameters
  • existing (dict or object) – An object or dictionary

  • new (dict or object) – A dictionary where keys correspond to the names of keys in the existing dictionary or attributes on the existing object

Returns

A copy of the original object or dictionary, with the values updated based on the provided map

Return type

dict or object

Usage:

from pewtils import recursive_update

class TestObject(object):
    def __init__(self, value):
        self.value = value
        self.dict = {"obj_key": "original"}
    def __repr__(self):
        return("TestObject(value='{}', dict={})".format(self.value, self.dict))

original = {
    "object": TestObject("original"),
    "key1": {"key2": "original"}
}
update = {
    "object": {"value": "updated", "dict": {"obj_key": "updated"}},
    "key1": {"key3": "new"}
}

>>> recursive_update(original, update)
{'object': TestObject(value='updated', dict={'obj_key': 'updated'}),
 'key1': {'key2': 'original', 'key3': 'new'}}
cached_series_mapper(series, function)[source]

Applies a function to all of the unique values in a pandas.Series to avoid repeating the operation on duplicate values.

Great if you’re doing database lookups or something computationally intensive on a column that may contain repeating values, etc.
Parameters
  • series (pandas.Series) – A pandas.Series

  • function – A function to apply to values in the pandas.Series

Returns

The resulting pandas.Series

Return type

pandas.Series

Usage:

import pandas as pd
from pewtils import cached_series_mapper

values = ["value"]*10
def my_function(x):
    print(x)
    return x

df = pd.DataFrame(values, columns=['column'])
>>> mapped = df['column'].map(my_function)
value
value
value
value
value
value
value
value
value
value
>>> mapped = cached_series_mapper(df['column'], my_function)
value
multiprocess_group_apply(grp, func, *args, **kwargs)[source]

Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data. Each group gets processed in parallel, and the results are concatenated together after all processing has finished. If you pass a function that aggregates each group into a single value, you’ll get back a DataFrame with one row for each group, as though you had performed a .agg function. If you pass a function that returns a value for each _row_ in the group, then you’ll get back a DataFrame in your original shape. In this case, you would simply be using grouping to efficiently apply a row-level operation.

Parameters
  • grp (pandas.core.groupby.generic.DataFrameGroupBy) – A Pandas DataFrameGroupBy object

  • func (function) – A function that accepts a Pandas DataFrame representing a group from the original DataFrame

  • args – Arguments to be passed to the function

  • kwargs – Keyword arguments to be passed to the function

Returns

The resulting DataFrame

Return type

pandas.DataFrame

Usage:

df = pd.DataFrame([
    {"group": 1, "value": "one two three"},
    {"group": 1, "value": "one two three four"},
    {"group": 2, "value": "one two"}
])

### For efficient aggregation

def get_length(grp):
    # Simple function that returns the number of rows in each group
    return len(grp)

>>> df.groupby("group_col").apply(lambda x: len(x))
1    2
2    1
dtype: int64
>>> multiprocess_group_apply(df.groupby("group_col"), get_length)
1    2
2    1
dtype: int64

### For efficient mapping

def get_value_length(grp):
    # Simple function that returns the word count of each row in the group
    return grp['value'].map(lambda x: len(x.split()))

>>> df['value'].map(lambda x: len(x.split()))
0    3
1    4
2    2
Name: value, dtype: int64
>>> multiprocess_group_apply(df.groupby("group_col"), get_value_length)
0    3
1    4
2    2
Name: value, dtype: int64

# If you just want to efficiently map a function to your DataFrame and you want to evenly split your
# DataFrame into groups, you could do the following:

df["group_col"] = (df.reset_index().index.values / (len(df) / multiprocessing.cpu_count())).astype(int)
df["mapped_value"] = multiprocess_group_apply(df.groupby("group_col"), get_value_length)
del df["group_col"]
extract_json_from_folder(folder_path, include_subdirs=False, concat_subdir_names=False)[source]

Takes a folder path and traverses it, looking for JSON files. When it finds one, it adds it to a dictionary, with the key being the name of the file and the value being the JSON itself. This is useful if you store configurations or various metadata in a nested folder structure, which we do for things like content analysis codebooks.

Has options for recursively traversing a folder, and for optionally concatenating the subfolder names into the dictionary keys as prefixes.
Parameters
  • folder_path (str) – The path of the folder to scan

  • include_subdirs (bool) – Whether or not to recursively scan subfolders

  • concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders

Returns

A dictionary containing all of the abstracted JSON files as values

Return type

dict

Usage:

# For example, let's say we have the following folder structure
# with various JSON codebooks scattered about:
#
# /codebooks
#     /logos
#         /antipathy.json
#     /atp_open_ends
#         /w29
#             /sources_of_meaning.json
#
# Here's what we'd get depending on the different parameters we use:

from pewtils import extract_json_from_folder
>>> extract_json_from_folder("codebooks", include_subdirs=False, concat_subdir_names=False)
{}
>>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=False)
{
    "logos": {"antipathy": "json would be here"},
    "atp_open_ends": {"w29": {"sources_of_meaning": "json would be here"}}
}
>>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=True)
{
    "logos_antipathy": "json would be here",
    "atp_open_ends_w29_sources_of_meaning": "json would be here"
}
extract_attributes_from_folder_modules(folder_path, attribute_name, include_subdirs=False, concat_subdir_names=False, current_subdirs=None)[source]

Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name. It extracts those attributes and returns a dictionary where the keys are the names of the files that contained the attributes, and the values are the attributes themselves.

This operates exactly the same as pewtils.extract_json_from_folder() except instead of reading JSON files and adding them as values in the dictionary that gets returned, this function will instead look for Python files that contain a function, class, method, or attribute with the name you provide in attribute_name and will load that attribute in as the values.

Parameters
  • folder_path (str) – The path of a folder/module to scan

  • attribute_name (str) – The name of the attribute (class, function, variable, etc.) to extract from files

  • include_subdirs (bool) – Whether or not to recursively scan subfolders

  • concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders

  • current_subdirs – Used to track location when recursively iterating a module (do not use)

Returns

A dictionary with all of the extracted attributes as values

Return type

dict

Note

if you use Python 2.7 you will need to add from __future__ import absolute_import to the top of files that you want to scan and import using this function.

class PrintExecutionTime(label=None, stdout=None)[source]

Simple context manager to print the time it takes for a block of code to execute

Parameters
  • label – A label to print alongside the execution time

  • stdout – a StringIO-like output stream (sys.stdout by default)

Usage:

from pewtils import PrintExecutionTime

>>> with PrintExecutionTime(label="my function"): time.sleep(5)
my function: 5.004292011260986 seconds