I/O Tools

This module contains utilities related to reading and writing files in a variety of formats. Right now, it consists exclusively of the pewtils.io.FileHandler class, which provides a standardized interface for loading and saving data both locally and on Amazon S3. It doesn’t always work exactly as intended, but 99% of the time, it gives us a way to read and write files with just one or two lines of code - and accordingly, we use it everywhere. We hope you do too!

Classes:

FileHandler(path[, use_s3, bucket])

Read/write data files in a variety of formats, locally and in Amazon S3 buckets.

class FileHandler(path, use_s3=None, bucket=None)[source]

Read/write data files in a variety of formats, locally and in Amazon S3 buckets.

Parameters
  • path (str) – A valid path to the folder in local or s3 directory where files will be written to or read from

  • use_s3 (bool) – Whether the path is an S3 location or local location

  • bucket (str) – The name of the S3 bucket, required if use_s3=True; will also try to fetch from the environment as S3_BUCKET

Note

Typical rectangular data files (i.e. csv, tab, xlsx, xls, dta file extension types) will be read to/written from a pandas.DataFrame object. The exceptions are pkl and json objects which accept any serializable Python object and correctly-formatted JSON object respectively.

Tip

You can configure your environment to make it easier to automatically connect to S3 by defining the variable S3_BUCKET.

Usage:

from pewtils.io import FileHandler

>>> h = FileHandler("./", use_s3=False)  # current local folder
>>> df = h.read("my_csv", format="csv")
# Do something and save to Excel
>>> h.write("my_new_csv", df, format="xlsx")

>>> my_data = [{"key": "value"}]
>>> h.write("my_data", my_data, format="json")

>>> my_data = ["a", "python", "list"]
>>> h.write("my_data", my_data, format="pkl")

# To read/write to an S3 bucket
# The FileHandler detects your AWS tokens using boto3's standard methods to find them in ~/.aws or defined as environment variables.
>>> h = FileHandler("/my_folder", use_s3=True, bucket="my-bucket")

Methods:

iterate_path()

Iterates over the directory and returns a list of filenames or S3 object keys

clear_folder()

Deletes the path (if local) or unlinks all keys in the bucket folder (if S3)

clear_file(key[, format, hash_key])

Deletes a specific file.

get_key_hash(key)

Converts a key to a hashed representation.

write(key, data[, format, hash_key, ...])

Writes arbitrary data objects to a variety of file formats.

read(key[, format, hash_key])

Reads a file from the directory or S3 path, returning its contents.

iterate_path()[source]

Iterates over the directory and returns a list of filenames or S3 object keys

Returns

Yields a list of filenames or S3 keys

Return type

iterable

Usage:

from pewtils.io import FileHandler

>>> h = FileHandler("./", use_s3=False)
>>> for file in h.iterate_path(): print(file)
file1.csv
file2.pkl
file3.json
clear_folder()[source]

Deletes the path (if local) or unlinks all keys in the bucket folder (if S3)

Warning

This is a destructive function, use with caution!

Usage:

from pewtils.io import FileHandler

>>> h = FileHandler("./", use_s3=False)
>>> len(list(h.iterate_path()))
3
>>> h.clear_folder()
>>> len(list(h.iterate_path()))
0
clear_file(key, format='pkl', hash_key=False)[source]

Deletes a specific file.

Warning

This is a destructive function, use with caution!

Parameters
  • key (str) – The name of the file to delete

  • format (str) – The file extension

  • hash_key (bool) – If True, will hash the filename before looking it up; default is False.

Usage:

from pewtils.io import FileHandler

>>> h = FileHandler("./", use_s3=False)
>>> for file in h.iterate_path(): print(file)
file1.csv
file2.pkl
file3.json
>>> h.clear_file("file1", format="csv")
>>> for file in h.iterate_path(): print(file)
file2.pkl
file3.json
get_key_hash(key)[source]

Converts a key to a hashed representation. Allows you to pass arbitrary objects and convert their string representation into a shorter hashed key, so it can be useful for caching. You can call this method directly to see the hash that a key will be converted into, but this method is mainly used in conjunction with the pewtils.FileHandler.write() and pewtils.FileHandler.read() methods by passing in hash_key=True.

Parameters

key (str or object) – A raw string or Python object that can be meaningfully converted into a string representation

Returns

A SHA224 hash representation of that key

Return type

str

Usage:

from pewtils.io import FileHandler

>>> h = FileHandler("tests/files", use_s3=False)
>>> h.get_key_hash("temp")
"c51bf90ccb22befa316b7a561fe9d5fd9650180b14421fc6d71bcd57"
>>> h.get_key_hash({"key": "value"})
"37e13e1116c86a6e9f3f8926375c7cb977ca74d2d598572ced03cd09"
write(key, data, format='pkl', hash_key=False, add_timestamp=False, **io_kwargs)[source]

Writes arbitrary data objects to a variety of file formats.

Parameters
  • key (str) – The name of the file or key (without a file suffix!)

  • data (object) – The actual data to write to the file

  • format (str) – The format the data should be saved in (pkl/csv/tab/xlsx/xls/dta/json). Defaults to pkl. This will be used as the file’s suffix.

  • hash_key (bool) – Whether or not to hash the provided key before saving the file. (Default=False)

  • add_timestamp (bool) – Optionally add a timestamp to the filename

  • io_kwargs – Additional parameters to pass along to the Pandas save function, if applicable

Returns

None

Note

When saving a csv, tab, xlsx, xls, or dta file, this function expects to receive a Pandas pandas.DataFrame. When you use these formats, you can also pass optional io_kwargs which will be forwarded to the corresponding pandas method below:

  • dta: pandas.DataFrame.to_stata()

  • csv: pandas.DataFrame.to_csv()

  • tab: pandas.DataFrame.to_csv()

  • xlsx: pandas.DataFrame.to_excel()

  • xls: pandas.DataFrame.to_excel()

If you’re trying to save an object to JSON, it assumes that you’re passing it valid JSON. By default, the handler attempts to use pickling, allowing you to save anything you want, as long as it’s serializable.

read(key, format='pkl', hash_key=False, **io_kwargs)[source]

Reads a file from the directory or S3 path, returning its contents.

Parameters
  • key (str) – The name of the file to read (without a suffix!)

  • format (str) – The format of the file (pkl/json/csv/dta/xls/xlsx/tab); expects the file extension to match

  • hash_key (bool) – Whether the key should be hashed prior to looking for and retrieving the file.

  • io_kwargs – Optional arguments to be passed to the specific load function (dependent on file format)

Returns

The file contents, in the requested format

Note

You can pass optional io_kwargs that will be forwarded to the function below that corresponds to the format of the file you’re trying to read in

  • dta: pandas.DataFrame.read_stata()

  • csv: pandas.DataFrame.read_csv()

  • tab: pandas.DataFrame.read_csv()

  • xlsx: pandas.DataFrame.read_excel()

  • xls: pandas.DataFrame.read_excel()