I/O Tools
This module contains utilities related to reading and writing files in a variety of formats. Right now, it consists exclusively of the pewtils.io.FileHandler
class, which provides a standardized interface for loading and saving data both locally and on Amazon S3. It doesn’t always work exactly as intended, but 99% of the time, it gives us a way to read and write files with just one or two lines of code - and accordingly, we use it everywhere. We hope you do too!
Classes:
|
Read/write data files in a variety of formats, locally and in Amazon S3 buckets. |
- class FileHandler(path, use_s3=None, bucket=None)[source]
Read/write data files in a variety of formats, locally and in Amazon S3 buckets.
- Parameters
path (str) – A valid path to the folder in local or s3 directory where files will be written to or read from
use_s3 (bool) – Whether the path is an S3 location or local location
bucket (str) – The name of the S3 bucket, required if
use_s3=True
; will also try to fetch from the environment as S3_BUCKET
Note
Typical rectangular data files (i.e.
csv
,tab
,xlsx
,xls
,dta
file extension types) will be read to/written from apandas.DataFrame
object. The exceptions are pkl and json objects which accept any serializable Python object and correctly-formatted JSON object respectively.Tip
You can configure your environment to make it easier to automatically connect to S3 by defining the variable
S3_BUCKET
.Usage:
from pewtils.io import FileHandler >>> h = FileHandler("./", use_s3=False) # current local folder >>> df = h.read("my_csv", format="csv") # Do something and save to Excel >>> h.write("my_new_csv", df, format="xlsx") >>> my_data = [{"key": "value"}] >>> h.write("my_data", my_data, format="json") >>> my_data = ["a", "python", "list"] >>> h.write("my_data", my_data, format="pkl") # To read/write to an S3 bucket # The FileHandler detects your AWS tokens using boto3's standard methods to find them in ~/.aws or defined as environment variables. >>> h = FileHandler("/my_folder", use_s3=True, bucket="my-bucket")
Methods:
Iterates over the directory and returns a list of filenames or S3 object keys
Deletes the path (if local) or unlinks all keys in the bucket folder (if S3)
clear_file
(key[, format, hash_key])Deletes a specific file.
get_key_hash
(key)Converts a key to a hashed representation.
write
(key, data[, format, hash_key, ...])Writes arbitrary data objects to a variety of file formats.
read
(key[, format, hash_key])Reads a file from the directory or S3 path, returning its contents.
- iterate_path()[source]
Iterates over the directory and returns a list of filenames or S3 object keys
- Returns
Yields a list of filenames or S3 keys
- Return type
iterable
Usage:
from pewtils.io import FileHandler >>> h = FileHandler("./", use_s3=False) >>> for file in h.iterate_path(): print(file) file1.csv file2.pkl file3.json
- clear_folder()[source]
Deletes the path (if local) or unlinks all keys in the bucket folder (if S3)
Warning
This is a destructive function, use with caution!
Usage:
from pewtils.io import FileHandler >>> h = FileHandler("./", use_s3=False) >>> len(list(h.iterate_path())) 3 >>> h.clear_folder() >>> len(list(h.iterate_path())) 0
- clear_file(key, format='pkl', hash_key=False)[source]
Deletes a specific file.
Warning
This is a destructive function, use with caution!
- Parameters
key (str) – The name of the file to delete
format (str) – The file extension
hash_key (bool) – If True, will hash the filename before looking it up; default is False.
Usage:
from pewtils.io import FileHandler >>> h = FileHandler("./", use_s3=False) >>> for file in h.iterate_path(): print(file) file1.csv file2.pkl file3.json >>> h.clear_file("file1", format="csv") >>> for file in h.iterate_path(): print(file) file2.pkl file3.json
- get_key_hash(key)[source]
Converts a key to a hashed representation. Allows you to pass arbitrary objects and convert their string representation into a shorter hashed key, so it can be useful for caching. You can call this method directly to see the hash that a key will be converted into, but this method is mainly used in conjunction with the
pewtils.FileHandler.write()
andpewtils.FileHandler.read()
methods by passing inhash_key=True
.- Parameters
key (str or object) – A raw string or Python object that can be meaningfully converted into a string representation
- Returns
A SHA224 hash representation of that key
- Return type
str
Usage:
from pewtils.io import FileHandler >>> h = FileHandler("tests/files", use_s3=False) >>> h.get_key_hash("temp") "c51bf90ccb22befa316b7a561fe9d5fd9650180b14421fc6d71bcd57" >>> h.get_key_hash({"key": "value"}) "37e13e1116c86a6e9f3f8926375c7cb977ca74d2d598572ced03cd09"
- write(key, data, format='pkl', hash_key=False, add_timestamp=False, **io_kwargs)[source]
Writes arbitrary data objects to a variety of file formats.
- Parameters
key (str) – The name of the file or key (without a file suffix!)
data (object) – The actual data to write to the file
format (str) – The format the data should be saved in (pkl/csv/tab/xlsx/xls/dta/json). Defaults to pkl. This will be used as the file’s suffix.
hash_key (bool) – Whether or not to hash the provided key before saving the file. (Default=False)
add_timestamp (bool) – Optionally add a timestamp to the filename
io_kwargs – Additional parameters to pass along to the Pandas save function, if applicable
- Returns
None
Note
When saving a
csv
,tab
,xlsx
,xls
, ordta
file, this function expects to receive a Pandaspandas.DataFrame
. When you use these formats, you can also pass optionalio_kwargs
which will be forwarded to the correspondingpandas
method below:dta:
pandas.DataFrame.to_stata()
csv:
pandas.DataFrame.to_csv()
tab:
pandas.DataFrame.to_csv()
xlsx:
pandas.DataFrame.to_excel()
xls:
pandas.DataFrame.to_excel()
If you’re trying to save an object to JSON, it assumes that you’re passing it valid JSON. By default, the handler attempts to use pickling, allowing you to save anything you want, as long as it’s serializable.
- read(key, format='pkl', hash_key=False, **io_kwargs)[source]
Reads a file from the directory or S3 path, returning its contents.
- Parameters
key (str) – The name of the file to read (without a suffix!)
format (str) – The format of the file (pkl/json/csv/dta/xls/xlsx/tab); expects the file extension to match
hash_key (bool) – Whether the key should be hashed prior to looking for and retrieving the file.
io_kwargs – Optional arguments to be passed to the specific load function (dependent on file format)
- Returns
The file contents, in the requested format
Note
You can pass optional
io_kwargs
that will be forwarded to the function below that corresponds to the format of the file you’re trying to read indta:
pandas.DataFrame.read_stata()
csv:
pandas.DataFrame.read_csv()
tab:
pandas.DataFrame.read_csv()
xlsx:
pandas.DataFrame.read_excel()
xls:
pandas.DataFrame.read_excel()