Core Functions

The main Pewtils module contains a variety of generally useful functions that make our researchers lives easier. For those still working in Python 2.x, the pewtils.decode_text() function can help alleviate headaches related to text encodings. The pewtils.is_null() and pewtils.is_not_null() functions provide an easy way to deal with the wide variety of possible null values that exist in the Python (and broader research universe) by using a best-guess approach. When working with dictionaries or JSON records that need to be updated, pewtils.recursive_update() makes it easy to map one version of an object onto another. While we strive to write efficient code that can cover every possible use-case, there are certainly some edge cases that we haven’t encountered, and other existing Python libraries may very well provide many of these same features. This collection simply consists of functions we find ourselves using again and again, and we hope that Pewtils may help expand your daily toolkit in some way as well.

Classes:

`classproperty`(fget)	This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class).
`PrintExecutionTime`([label, stdout])	Simple context manager to print the time it takes for a block of code to execute

Functions:

`is_not_null`(val[, empty_lists_are_null, ...])	Checks whether the value is null, using a variety of potential string values, etc.
`is_null`(val[, empty_lists_are_null, ...])	Returns the opposite of the outcome of `pewtils.is_not_null()`.
`decode_text`(text[, throw_loud_fail])	Attempts to decode and re-encode text as ASCII.
`get_hash`(text[, hash_function])	Generates hashed text using one of several available hashing functions.
`zipcode_num_to_string`(zipcode)	Attempts to standardize a string/integer/float that contains a U.S.
`concat_text`(*args)	A helper function for concatenating text values.
`vector_concat_text`(*args)	Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces.
`scale_range`(old_val, old_min, old_max, ...)	Scales a value from one range to another.
`new_random_number`([attempt, minimum, maximum])	Returns a random number between the boundary that exponentially increases with the number of `attempt`.
`chunk_list`(seq, size)	Takes a sequence and groups values into smaller lists based on the specified size.
`flatten_list`(l)	Takes a list of lists and flattens it into a single list.
`scan_dictionary`(search_dict, field)	Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key.
`recursive_update`(existing, new)	Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object.
`cached_series_mapper`(series, function)	Applies a function to all of the unique values in a `pandas.Series` to avoid repeating the operation on duplicate values.
`multiprocess_group_apply`(grp, func, *args, ...)	Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data.
`extract_json_from_folder`(folder_path[, ...])	Takes a folder path and traverses it, looking for JSON files.
`extract_attributes_from_folder_modules`(...)	Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name.

class classproperty(fget)[source]

This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class). It allows you to access classproperty attributes directly, such as obj.property, rather than as a function on a class instance (like obj = Obj(); obj.property()).

Borrowed from a StackOverflow post.

Usage:

from pewtils import classproperty

class MyClass(object):
    x = 4

    @classproperty
    def number(cls):
        return cls.x

>>> MyClass().number
4
>>> MyClass.number
4

is_not_null(val, empty_lists_are_null=False, custom_nulls=None)[source]

Checks whether the value is null, using a variety of potential string values, etc. The following values are always considered null: numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"

Parameters

val – The value to check
empty_lists_are_null (bool) – Whether or not an empty list or pandas.DataFrame should be considered null (default=False)
custom_nulls (list) – an optional list of additional values to consider as null

Returns

True if the value is not null

Return type

bool

Usage:

from pewtils import is_not_null

>>> text = "Hello"
>>> is_not_null(text)
True

is_null(val, empty_lists_are_null=False, custom_nulls=None)[source]

Returns the opposite of the outcome of pewtils.is_not_null(). The following values are always considered null: numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"

Parameters

val – The value to check
empty_lists_are_null (bool) – Whether or not an empty list or pandas.DataFrame should be considered null (default=False)
custom_nulls (list) – an optional list of additional values to consider as null

Returns

True if the value is null

Return type

bool

Usage:

from pewtils import is_null

>>> empty_list = []
>>> is_null(empty_list, empty_lists_are_null=True)
True

decode_text(text, throw_loud_fail=False)[source]

Attempts to decode and re-encode text as ASCII. In the case of failure, it will attempt to detect the string’s encoding, decode it, and convert it to ASCII. If both these attempts fail, it will attempt to use the unidecode package to transliterate into ASCII. And finally, if that doesn’t work, it will forcibly encode the text as ASCII and ignore non-ASCII characters.

Warning

This function is potentially destructive to source input and should be used with some care. Input text that cannot be decoded may be stripped out, or replaced with a similar ASCII character or other placeholder, potentially resulting in an empty string.

Parameters

text (str) – The text to process
throw_loud_fail (bool) – If True, exceptions will be raised, otherwise the function will fail silently and return an empty string (default False)

Returns

Decoded text, or empty string

Return type

str

Note

In Python 3, the decode/encode attempts will fail by default, and the unidecode package will be used to transliterate. In general, you shouldn’t need to use this function in Python 3, but it shouldn’t hurt anything if you do.

get_hash(text, hash_function='ssdeep')[source]

Generates hashed text using one of several available hashing functions.

Parameters

text (str) – The string to hash
hash_function (str) – The specific algorithm to use; options are 'nilsimsa', 'md5', and 'ssdeep' (default)

Returns

A hashed representation of the provided string

Return type

str

Note

The string will be passed through pewtils.decode_text() and the returned value will be used instead of the original value if it runs successfully, in order to ensure consistent hashing in both Python 2 and 3. By default the function uses the ssdeep algorithm, which generates context-sensitive hashes that are useful for computing document similarities at scale.

Note

Using hash_function=’ssdeep’ requires the ssdeep library, which is not installed by default because it requires the installation of additional system libraries on certain operating systems. For help installing ssdeep, refer to the pewtils documentation installation section, which provides OS-specific instructions.

Usage:

from pewtils import get_hash

>>> text = 'test_string'
>>> get_hash(text)
'3:HI2:Hl'

zipcode_num_to_string(zipcode)[source]

Attempts to standardize a string/integer/float that contains a U.S. zipcode. Front-pads with zeroes and uses the zipcodes library to ensure that the zipcode is real. If the zipcode doesn’t validate successfully, None will be returned.

Parameters: zip (str or float or int) – Object that contains a sequence of digits (string, integer, float)
Returns: A 5-digit string, or None
Return type: str or NoneType

Usage:

from pewtils import zipcode_num_to_string

>>> zipcode_number = 6463
>>> zipcode_num_to_string(zipcode_number)
'06463'
>>> not_zipcode_number = 345678
>>> zipcode_num_to_string(not_zipcode_number)
>>>

concat_text(*args)[source]

A helper function for concatenating text values. Text values are passed through pewtils.decode_text() before concatenation.

Parameters: args (list) – A list of text values that will be returned as a single space-separated string
Returns: A single string of the values concatenated by spaces
Return type: str

Usage:

from pewtils import concat_text

>>> text_list = ['Hello', 'World', '!']
>>> concat_text(text_list)
'Hello World !'

vector_concat_text(*args)[source]

Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces. Useful for merging multiple columns of text in Pandas.

Parameters: args – A list of lists or pandas.Series s that contain text values
Returns: A single list or pandas.Series with all of the text values for each row concatenated

Usage with lists:

from pewtils import vector_concat_text

>>> text_lists = ["one", "two", "three"], ["a", "b", "c"]
>>> vector_concat_text(text_lists)
['one a', 'two b', 'three c']

Usage with Pandas:

import pandas as pd
from pewtils import vector_concat_text

df = pd.DataFrame([
    {"text1": "one", "text2": "a"},
    {"text1": "two", "text2": "b"},
    {"text1": "three", "text2": "c"}
])

>>> df['text'] = vector_concat_text(df['text1'], df['text2'])
>>> df['text']
0      one a
1      two b
2    three c
Name: text, dtype: object

scale_range(old_val, old_min, old_max, new_min, new_max)[source]

Scales a value from one range to another. Useful for comparing values from different scales, for example.

Parameters

old_val (int or float) – The value to convert
old_min (int or float) – The minimum of the old range
old_max (int or float) – The maximum of the old range
new_min (int or float) – The minimum of the new range
new_max (int or float) – The maximum of the new range

Returns

Value equivalent from the new scale

Return type

float

Usage:

from pewtils import scale_range

>>> old_value = 5
>>> scale_range(old_value, 0, 10, 0, 20)
10.0

new_random_number(attempt=1, minimum=1.0, maximum=10)[source]

Returns a random number between the boundary that exponentially increases with the number of attempt. The upper bound is capped using the maximum parameter (default 10) but is otherwise determined by the function minimum * 2 ** attempt.

In effect, this means that when attempt is 1, the number returned will be in the range of the minimum and twice the minimum’s value. As you increase attempt, the possible range of returned values expands exponentially until it hits the maximum ceiling.

Parameters

attempt (int) – Increasing attempt will expand the upper-bound of the range from which the random number is drawn
minimum (int or float) – The minimum allowed value that can be returned; must be greater than zero.
maximum (int or float) – The maximum allowed value that can be returned; must be greater than minimum.

Returns

A random number drawn uniformly from across the range determined by the provided arguments.

Return type

float

Note

One useful application of this function is rate limiting: a script can pause in between requests at a reasonably fast pace, but then moderate itself and pause for longer periods if it begins encountering errors, simply by increasing the attempt variable (hence its name).

Usage:

from pewtils import new_random_number

>>> new_random_number(attempt=1)
1.9835581813820642
>>> new_random_number(attempt=2)
3.1022350739064

chunk_list(seq, size)[source]

Takes a sequence and groups values into smaller lists based on the specified size.

Parameters

seq (list or iterable) – List or a list-like iterable
size (int) – Desired size of each sublist

Returns

A list of lists

Return type

list

Usage:

from pewtils import chunk_list

>>> number_sequence = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> chunk_list(number_sequence, 3)
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]

flatten_list(l)[source]

Takes a list of lists and flattens it into a single list. Nice shortcut to avoid having to deal with list comprehension.

Parameters: l (list) – A list of lists
Returns: A flattened list of all of the elements contained in the original list of lists
Return type: list

Usage:

from pewtils import flatten_list

>>> nested_lists = [[1, 2, 3], [4, 5, 6]]
>>> flatten_list(nested_lists)
[1, 2, 3, 4, 5, 6]

scan_dictionary(search_dict, field)[source]

Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key. Since keys can occur more than once, the function returns a list of all of the found values along with a list of equal length that specifies the nested key path to each value.

Parameters

search_dict (dict) – The dictionary to search
field (str) – The field to find

Returns

A tuple of the found values and file path-style strings representing their locations

Return type

tuple

Usage:

from pewtils import scan_dictionary

>>> test_dict = {"one": {"two": {"three": "four"}}}
>>> scan_dictionary(test_dict, "three")
(['four'], ['one/two/three/'])
>>> scan_dictionary(test_dict, "five")
([], [])

recursive_update(existing, new)[source]

Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object.

Regardless of whether or not the keys in the dictionary correspond to attribute names or dictionary keys; you can use this to iterate through a nested hierarchy of objects and dictionaries and update whatever you like.

Parameters

existing (dict or object) – An object or dictionary
new (dict or object) – A dictionary where keys correspond to the names of keys in the existing dictionary or attributes on the existing object

Returns

A copy of the original object or dictionary, with the values updated based on the provided map

Return type

dict or object

Usage:

from pewtils import recursive_update

class TestObject(object):
    def __init__(self, value):
        self.value = value
        self.dict = {"obj_key": "original"}
    def __repr__(self):
        return("TestObject(value='{}', dict={})".format(self.value, self.dict))

original = {
    "object": TestObject("original"),
    "key1": {"key2": "original"}
}
update = {
    "object": {"value": "updated", "dict": {"obj_key": "updated"}},
    "key1": {"key3": "new"}
}

>>> recursive_update(original, update)
{'object': TestObject(value='updated', dict={'obj_key': 'updated'}),
 'key1': {'key2': 'original', 'key3': 'new'}}

cached_series_mapper(series, function)[source]

Applies a function to all of the unique values in a pandas.Series to avoid repeating the operation on duplicate values.

Great if you’re doing database lookups or something computationally intensive on a column that may contain repeating values, etc.

Parameters

series (pandas.Series) – A pandas.Series
function – A function to apply to values in the pandas.Series

Returns

The resulting pandas.Series

Return type

pandas.Series

Usage:

import pandas as pd
from pewtils import cached_series_mapper

values = ["value"]*10
def my_function(x):
    print(x)
    return x

df = pd.DataFrame(values, columns=['column'])
>>> mapped = df['column'].map(my_function)
value
value
value
value
value
value
value
value
value
value
>>> mapped = cached_series_mapper(df['column'], my_function)
value

multiprocess_group_apply(grp, func, *args, **kwargs)[source]

Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data. Each group gets processed in parallel, and the results are concatenated together after all processing has finished. If you pass a function that aggregates each group into a single value, you’ll get back a DataFrame with one row for each group, as though you had performed a .agg function. If you pass a function that returns a value for each _row_ in the group, then you’ll get back a DataFrame in your original shape. In this case, you would simply be using grouping to efficiently apply a row-level operation.

Parameters

grp (pandas.core.groupby.generic.DataFrameGroupBy) – A Pandas DataFrameGroupBy object
func (function) – A function that accepts a Pandas DataFrame representing a group from the original DataFrame
args – Arguments to be passed to the function
kwargs – Keyword arguments to be passed to the function

Returns

The resulting DataFrame

Return type

pandas.DataFrame

Usage:

df = pd.DataFrame([
    {"group": 1, "value": "one two three"},
    {"group": 1, "value": "one two three four"},
    {"group": 2, "value": "one two"}
])

### For efficient aggregation

def get_length(grp):
    # Simple function that returns the number of rows in each group
    return len(grp)

>>> df.groupby("group_col").apply(lambda x: len(x))
1    2
2    1
dtype: int64
>>> multiprocess_group_apply(df.groupby("group_col"), get_length)
1    2
2    1
dtype: int64

### For efficient mapping

def get_value_length(grp):
    # Simple function that returns the word count of each row in the group
    return grp['value'].map(lambda x: len(x.split()))

>>> df['value'].map(lambda x: len(x.split()))
0    3
1    4
2    2
Name: value, dtype: int64
>>> multiprocess_group_apply(df.groupby("group_col"), get_value_length)
0    3
1    4
2    2
Name: value, dtype: int64

# If you just want to efficiently map a function to your DataFrame and you want to evenly split your
# DataFrame into groups, you could do the following:

df["group_col"] = (df.reset_index().index.values / (len(df) / multiprocessing.cpu_count())).astype(int)
df["mapped_value"] = multiprocess_group_apply(df.groupby("group_col"), get_value_length)
del df["group_col"]

extract_json_from_folder(folder_path, include_subdirs=False, concat_subdir_names=False)[source]

Takes a folder path and traverses it, looking for JSON files. When it finds one, it adds it to a dictionary, with the key being the name of the file and the value being the JSON itself. This is useful if you store configurations or various metadata in a nested folder structure, which we do for things like content analysis codebooks.

Has options for recursively traversing a folder, and for optionally concatenating the subfolder names into the dictionary keys as prefixes.

Parameters

folder_path (str) – The path of the folder to scan
include_subdirs (bool) – Whether or not to recursively scan subfolders
concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders

Returns

A dictionary containing all of the abstracted JSON files as values

Return type

dict

Usage:

# For example, let's say we have the following folder structure
# with various JSON codebooks scattered about:
#
# /codebooks
#     /logos
#         /antipathy.json
#     /atp_open_ends
#         /w29
#             /sources_of_meaning.json
#
# Here's what we'd get depending on the different parameters we use:

from pewtils import extract_json_from_folder
>>> extract_json_from_folder("codebooks", include_subdirs=False, concat_subdir_names=False)
{}
>>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=False)
{
    "logos": {"antipathy": "json would be here"},
    "atp_open_ends": {"w29": {"sources_of_meaning": "json would be here"}}
}
>>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=True)
{
    "logos_antipathy": "json would be here",
    "atp_open_ends_w29_sources_of_meaning": "json would be here"
}

extract_attributes_from_folder_modules(folder_path, attribute_name, include_subdirs=False, concat_subdir_names=False, current_subdirs=None)[source]

Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name. It extracts those attributes and returns a dictionary where the keys are the names of the files that contained the attributes, and the values are the attributes themselves.

This operates exactly the same as pewtils.extract_json_from_folder() except instead of reading JSON files and adding them as values in the dictionary that gets returned, this function will instead look for Python files that contain a function, class, method, or attribute with the name you provide in attribute_name and will load that attribute in as the values.

Parameters

folder_path (str) – The path of a folder/module to scan
attribute_name (str) – The name of the attribute (class, function, variable, etc.) to extract from files
include_subdirs (bool) – Whether or not to recursively scan subfolders
concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders
current_subdirs – Used to track location when recursively iterating a module (do not use)

Returns

A dictionary with all of the extracted attributes as values

Return type

dict

Note

if you use Python 2.7 you will need to add from __future__ import absolute_import to the top of files that you want to scan and import using this function.

class PrintExecutionTime(label=None, stdout=None)[source]

Simple context manager to print the time it takes for a block of code to execute

Parameters

label – A label to print alongside the execution time
stdout – a StringIO-like output stream (sys.stdout by default)

Usage:

from pewtils import PrintExecutionTime

>>> with PrintExecutionTime(label="my function"): time.sleep(5)
my function: 5.004292011260986 seconds