Core Functions
The main Pewtils module contains a variety of generally useful functions that make our researchers lives easier. For those still working in Python 2.x, the pewtils.decode_text()
function can help alleviate headaches related to text encodings. The pewtils.is_null()
and pewtils.is_not_null()
functions provide an easy way to deal with the wide variety of possible null values that exist in the Python (and broader research universe) by using a best-guess approach. When working with dictionaries or JSON records that need to be updated, pewtils.recursive_update()
makes it easy to map one version of an object onto another. While we strive to write efficient code that can cover every possible use-case, there are certainly some edge cases that we haven’t encountered, and other existing Python libraries may very well provide many of these same features. This collection simply consists of functions we find ourselves using again and again, and we hope that Pewtils may help expand your daily toolkit in some way as well.
Classes:
|
This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class). |
|
Simple context manager to print the time it takes for a block of code to execute |
Functions:
|
Checks whether the value is null, using a variety of potential string values, etc. |
|
Returns the opposite of the outcome of |
|
Attempts to decode and re-encode text as ASCII. |
|
Generates hashed text using one of several available hashing functions. |
|
Attempts to standardize a string/integer/float that contains a U.S. |
|
A helper function for concatenating text values. |
|
Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces. |
|
Scales a value from one range to another. |
|
Returns a random number between the boundary that exponentially increases with the number of |
|
Takes a sequence and groups values into smaller lists based on the specified size. |
|
Takes a list of lists and flattens it into a single list. |
|
Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key. |
|
Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object. |
|
Applies a function to all of the unique values in a |
|
Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data. |
|
Takes a folder path and traverses it, looking for JSON files. |
Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name. |
- class classproperty(fget)[source]
This decorator allows you to define functions on a class that are accessible directly from the class itself (rather than an instance of the class). It allows you to access
classproperty
attributes directly, such asobj.property
, rather than as a function on a class instance (likeobj = Obj(); obj.property()
).Borrowed from a StackOverflow post.
Usage:
from pewtils import classproperty class MyClass(object): x = 4 @classproperty def number(cls): return cls.x >>> MyClass().number 4 >>> MyClass.number 4
- is_not_null(val, empty_lists_are_null=False, custom_nulls=None)[source]
Checks whether the value is null, using a variety of potential string values, etc. The following values are always considered null:
numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"
- Parameters
val – The value to check
empty_lists_are_null (bool) – Whether or not an empty list or
pandas.DataFrame
should be considered null (default=False)custom_nulls (list) – an optional list of additional values to consider as null
- Returns
True if the value is not null
- Return type
bool
Usage:
from pewtils import is_not_null >>> text = "Hello" >>> is_not_null(text) True
- is_null(val, empty_lists_are_null=False, custom_nulls=None)[source]
Returns the opposite of the outcome of
pewtils.is_not_null()
. The following values are always considered null:numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"
- Parameters
val – The value to check
empty_lists_are_null (bool) – Whether or not an empty list or
pandas.DataFrame
should be considered null (default=False)custom_nulls (list) – an optional list of additional values to consider as null
- Returns
True if the value is null
- Return type
bool
Usage:
from pewtils import is_null >>> empty_list = [] >>> is_null(empty_list, empty_lists_are_null=True) True
- decode_text(text, throw_loud_fail=False)[source]
Attempts to decode and re-encode text as ASCII. In the case of failure, it will attempt to detect the string’s encoding, decode it, and convert it to ASCII. If both these attempts fail, it will attempt to use the
unidecode
package to transliterate into ASCII. And finally, if that doesn’t work, it will forcibly encode the text as ASCII and ignore non-ASCII characters.Warning
This function is potentially destructive to source input and should be used with some care. Input text that cannot be decoded may be stripped out, or replaced with a similar ASCII character or other placeholder, potentially resulting in an empty string.
- Parameters
text (str) – The text to process
throw_loud_fail (bool) – If True, exceptions will be raised, otherwise the function will fail silently and return an empty string (default False)
- Returns
Decoded text, or empty string
- Return type
str
Note
In Python 3, the decode/encode attempts will fail by default, and the
unidecode
package will be used to transliterate. In general, you shouldn’t need to use this function in Python 3, but it shouldn’t hurt anything if you do.
- get_hash(text, hash_function='ssdeep')[source]
Generates hashed text using one of several available hashing functions.
- Parameters
text (str) – The string to hash
hash_function (str) – The specific algorithm to use; options are
'nilsimsa'
,'md5'
, and'ssdeep'
(default)
- Returns
A hashed representation of the provided string
- Return type
str
Note
The string will be passed through
pewtils.decode_text()
and the returned value will be used instead of the original value if it runs successfully, in order to ensure consistent hashing in both Python 2 and 3. By default the function uses thessdeep
algorithm, which generates context-sensitive hashes that are useful for computing document similarities at scale.Note
Using hash_function=’ssdeep’ requires the
ssdeep
library, which is not installed by default because it requires the installation of additional system libraries on certain operating systems. For help installing ssdeep, refer to the pewtils documentation installation section, which provides OS-specific instructions.Usage:
from pewtils import get_hash >>> text = 'test_string' >>> get_hash(text) '3:HI2:Hl'
- zipcode_num_to_string(zipcode)[source]
Attempts to standardize a string/integer/float that contains a U.S. zipcode. Front-pads with zeroes and uses the
zipcodes
library to ensure that the zipcode is real. If the zipcode doesn’t validate successfully,None
will be returned.- Parameters
zip (str or float or int) – Object that contains a sequence of digits (string, integer, float)
- Returns
A 5-digit string, or None
- Return type
str or NoneType
Usage:
from pewtils import zipcode_num_to_string >>> zipcode_number = 6463 >>> zipcode_num_to_string(zipcode_number) '06463' >>> not_zipcode_number = 345678 >>> zipcode_num_to_string(not_zipcode_number) >>>
- concat_text(*args)[source]
A helper function for concatenating text values. Text values are passed through
pewtils.decode_text()
before concatenation.- Parameters
args (list) – A list of text values that will be returned as a single space-separated string
- Returns
A single string of the values concatenated by spaces
- Return type
str
Usage:
from pewtils import concat_text >>> text_list = ['Hello', 'World', '!'] >>> concat_text(text_list) 'Hello World !'
- vector_concat_text(*args)[source]
Takes a list of equal-length lists and returns a single list with the rows concatenated by spaces. Useful for merging multiple columns of text in Pandas.
- Parameters
args – A list of lists or
pandas.Series
s that contain text values- Returns
A single list or
pandas.Series
with all of the text values for each row concatenated
Usage with lists:
from pewtils import vector_concat_text >>> text_lists = ["one", "two", "three"], ["a", "b", "c"] >>> vector_concat_text(text_lists) ['one a', 'two b', 'three c']
Usage with Pandas:
import pandas as pd from pewtils import vector_concat_text df = pd.DataFrame([ {"text1": "one", "text2": "a"}, {"text1": "two", "text2": "b"}, {"text1": "three", "text2": "c"} ]) >>> df['text'] = vector_concat_text(df['text1'], df['text2']) >>> df['text'] 0 one a 1 two b 2 three c Name: text, dtype: object
- scale_range(old_val, old_min, old_max, new_min, new_max)[source]
Scales a value from one range to another. Useful for comparing values from different scales, for example.
- Parameters
old_val (int or float) – The value to convert
old_min (int or float) – The minimum of the old range
old_max (int or float) – The maximum of the old range
new_min (int or float) – The minimum of the new range
new_max (int or float) – The maximum of the new range
- Returns
Value equivalent from the new scale
- Return type
float
Usage:
from pewtils import scale_range >>> old_value = 5 >>> scale_range(old_value, 0, 10, 0, 20) 10.0
- new_random_number(attempt=1, minimum=1.0, maximum=10)[source]
Returns a random number between the boundary that exponentially increases with the number of
attempt
. The upper bound is capped using themaximum
parameter (default 10) but is otherwise determined by the functionminimum * 2 ** attempt
.In effect, this means that whenattempt
is 1, the number returned will be in the range of the minimum and twice the minimum’s value. As you increaseattempt
, the possible range of returned values expands exponentially until it hits themaximum
ceiling.- Parameters
attempt (int) – Increasing attempt will expand the upper-bound of the range from which the random number is drawn
minimum (int or float) – The minimum allowed value that can be returned; must be greater than zero.
maximum (int or float) – The maximum allowed value that can be returned; must be greater than
minimum
.
- Returns
A random number drawn uniformly from across the range determined by the provided arguments.
- Return type
float
Note
One useful application of this function is rate limiting: a script can pause in between requests at a reasonably fast pace, but then moderate itself and pause for longer periods if it begins encountering errors, simply by increasing the
attempt
variable (hence its name).Usage:
from pewtils import new_random_number >>> new_random_number(attempt=1) 1.9835581813820642 >>> new_random_number(attempt=2) 3.1022350739064
- chunk_list(seq, size)[source]
Takes a sequence and groups values into smaller lists based on the specified size.
- Parameters
seq (list or iterable) – List or a list-like iterable
size (int) – Desired size of each sublist
- Returns
A list of lists
- Return type
list
Usage:
from pewtils import chunk_list >>> number_sequence = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> chunk_list(number_sequence, 3) [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
- flatten_list(l)[source]
Takes a list of lists and flattens it into a single list. Nice shortcut to avoid having to deal with list comprehension.
- Parameters
l (list) – A list of lists
- Returns
A flattened list of all of the elements contained in the original list of lists
- Return type
list
Usage:
from pewtils import flatten_list >>> nested_lists = [[1, 2, 3], [4, 5, 6]] >>> flatten_list(nested_lists) [1, 2, 3, 4, 5, 6]
- scan_dictionary(search_dict, field)[source]
Takes a dictionary with nested lists and dictionaries, and searches recursively for a specific key. Since keys can occur more than once, the function returns a list of all of the found values along with a list of equal length that specifies the nested key path to each value.
- Parameters
search_dict (dict) – The dictionary to search
field (str) – The field to find
- Returns
A tuple of the found values and file path-style strings representing their locations
- Return type
tuple
Usage:
from pewtils import scan_dictionary >>> test_dict = {"one": {"two": {"three": "four"}}} >>> scan_dictionary(test_dict, "three") (['four'], ['one/two/three/']) >>> scan_dictionary(test_dict, "five") ([], [])
- recursive_update(existing, new)[source]
Takes an object and a dictionary representation of attributes and values, and recursively traverses through the new values and updates the object.
Regardless of whether or not the keys in the dictionary correspond to attribute names or dictionary keys; you can use this to iterate through a nested hierarchy of objects and dictionaries and update whatever you like.- Parameters
existing (dict or object) – An object or dictionary
new (dict or object) – A dictionary where keys correspond to the names of keys in the existing dictionary or attributes on the existing object
- Returns
A copy of the original object or dictionary, with the values updated based on the provided map
- Return type
dict or object
Usage:
from pewtils import recursive_update class TestObject(object): def __init__(self, value): self.value = value self.dict = {"obj_key": "original"} def __repr__(self): return("TestObject(value='{}', dict={})".format(self.value, self.dict)) original = { "object": TestObject("original"), "key1": {"key2": "original"} } update = { "object": {"value": "updated", "dict": {"obj_key": "updated"}}, "key1": {"key3": "new"} } >>> recursive_update(original, update) {'object': TestObject(value='updated', dict={'obj_key': 'updated'}), 'key1': {'key2': 'original', 'key3': 'new'}}
- cached_series_mapper(series, function)[source]
Applies a function to all of the unique values in a
pandas.Series
to avoid repeating the operation on duplicate values.Great if you’re doing database lookups or something computationally intensive on a column that may contain repeating values, etc.- Parameters
series (
pandas.Series
) – Apandas.Series
function – A function to apply to values in the
pandas.Series
- Returns
The resulting
pandas.Series
- Return type
pandas.Series
Usage:
import pandas as pd from pewtils import cached_series_mapper values = ["value"]*10 def my_function(x): print(x) return x df = pd.DataFrame(values, columns=['column']) >>> mapped = df['column'].map(my_function) value value value value value value value value value value >>> mapped = cached_series_mapper(df['column'], my_function) value
- multiprocess_group_apply(grp, func, *args, **kwargs)[source]
Apply arbitrary functions to groups or slices of a Pandas DataFrame using multiprocessing, to efficiently map or aggregate data. Each group gets processed in parallel, and the results are concatenated together after all processing has finished. If you pass a function that aggregates each group into a single value, you’ll get back a DataFrame with one row for each group, as though you had performed a .agg function. If you pass a function that returns a value for each _row_ in the group, then you’ll get back a DataFrame in your original shape. In this case, you would simply be using grouping to efficiently apply a row-level operation.
- Parameters
grp (pandas.core.groupby.generic.DataFrameGroupBy) – A Pandas DataFrameGroupBy object
func (function) – A function that accepts a Pandas DataFrame representing a group from the original DataFrame
args – Arguments to be passed to the function
kwargs – Keyword arguments to be passed to the function
- Returns
The resulting DataFrame
- Return type
pandas.DataFrame
Usage:
df = pd.DataFrame([ {"group": 1, "value": "one two three"}, {"group": 1, "value": "one two three four"}, {"group": 2, "value": "one two"} ]) ### For efficient aggregation def get_length(grp): # Simple function that returns the number of rows in each group return len(grp) >>> df.groupby("group_col").apply(lambda x: len(x)) 1 2 2 1 dtype: int64 >>> multiprocess_group_apply(df.groupby("group_col"), get_length) 1 2 2 1 dtype: int64 ### For efficient mapping def get_value_length(grp): # Simple function that returns the word count of each row in the group return grp['value'].map(lambda x: len(x.split())) >>> df['value'].map(lambda x: len(x.split())) 0 3 1 4 2 2 Name: value, dtype: int64 >>> multiprocess_group_apply(df.groupby("group_col"), get_value_length) 0 3 1 4 2 2 Name: value, dtype: int64 # If you just want to efficiently map a function to your DataFrame and you want to evenly split your # DataFrame into groups, you could do the following: df["group_col"] = (df.reset_index().index.values / (len(df) / multiprocessing.cpu_count())).astype(int) df["mapped_value"] = multiprocess_group_apply(df.groupby("group_col"), get_value_length) del df["group_col"]
- extract_json_from_folder(folder_path, include_subdirs=False, concat_subdir_names=False)[source]
Takes a folder path and traverses it, looking for JSON files. When it finds one, it adds it to a dictionary, with the key being the name of the file and the value being the JSON itself. This is useful if you store configurations or various metadata in a nested folder structure, which we do for things like content analysis codebooks.
Has options for recursively traversing a folder, and for optionally concatenating the subfolder names into the dictionary keys as prefixes.- Parameters
folder_path (str) – The path of the folder to scan
include_subdirs (bool) – Whether or not to recursively scan subfolders
concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders
- Returns
A dictionary containing all of the abstracted JSON files as values
- Return type
dict
Usage:
# For example, let's say we have the following folder structure # with various JSON codebooks scattered about: # # /codebooks # /logos # /antipathy.json # /atp_open_ends # /w29 # /sources_of_meaning.json # # Here's what we'd get depending on the different parameters we use: from pewtils import extract_json_from_folder >>> extract_json_from_folder("codebooks", include_subdirs=False, concat_subdir_names=False) {} >>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=False) { "logos": {"antipathy": "json would be here"}, "atp_open_ends": {"w29": {"sources_of_meaning": "json would be here"}} } >>> extract_json_from_folder("codebooks", include_subdirs=True, concat_subdir_names=True) { "logos_antipathy": "json would be here", "atp_open_ends_w29_sources_of_meaning": "json would be here" }
- extract_attributes_from_folder_modules(folder_path, attribute_name, include_subdirs=False, concat_subdir_names=False, current_subdirs=None)[source]
Takes a folder path and traverses it, looking for Python files that contain an attribute (i.e., class, function, etc.) with a given name. It extracts those attributes and returns a dictionary where the keys are the names of the files that contained the attributes, and the values are the attributes themselves.
This operates exactly the same as
pewtils.extract_json_from_folder()
except instead of reading JSON files and adding them as values in the dictionary that gets returned, this function will instead look for Python files that contain a function, class, method, or attribute with the name you provide inattribute_name
and will load that attribute in as the values.- Parameters
folder_path (str) – The path of a folder/module to scan
attribute_name (str) – The name of the attribute (class, function, variable, etc.) to extract from files
include_subdirs (bool) – Whether or not to recursively scan subfolders
concat_subdir_names (bool) – Whether or not to prefix the dictionary keys with the names of subfolders
current_subdirs – Used to track location when recursively iterating a module (do not use)
- Returns
A dictionary with all of the extracted attributes as values
- Return type
dict
Note
if you use Python 2.7 you will need to add
from __future__ import absolute_import
to the top of files that you want to scan and import using this function.
- class PrintExecutionTime(label=None, stdout=None)[source]
Simple context manager to print the time it takes for a block of code to execute
- Parameters
label – A label to print alongside the execution time
stdout – a StringIO-like output stream (sys.stdout by default)
Usage:
from pewtils import PrintExecutionTime >>> with PrintExecutionTime(label="my function"): time.sleep(5) my function: 5.004292011260986 seconds