Using Pandas to Resample Time Series
One of the most common requests we receive is how to resample intraday data into different time frames (for example converting 1-minute bars into 1-hour bars).
This operation is possible in Excel but is extremely inefficient as Excel will struggle to handle large time-series files (anything over 500,000 rows is problematic on most systems) and the conversion process is very clunky requiring multiple calculation columns.
For resampling data, we always recommend customers use Pandas. It can take a little work to set up and install if the customer is new to Pandas but it is usually under an hour and it is very easy to work with Pandas in combination with Jupyter notebooks.
Although Python, Pandas and Jupyter Notebooks can all be installed separately the most efficient way to install all three is to install Anaconda (https://docs.anaconda.com/anaconda/install/windows/ )
We will work through a resampling example using Jupyter Notebooks. If you are performing multiple resamplings, executing a Python script is the most efficient method, however, to perform a single resample or for demonstrating the process, Jupyter Notebook is very quick to get started with.
In a new Jupyter notebook we will first import Pandas:
Next, we can load the time-series data using Panda’s read_csv method. In addition to reading .csv files the read_csv method with csv formatted files of any extension and will also unzipped zipped csv files.
In this example we will use the free 1-minute AMZN datafile provided by FirstRate Data and load the csv file into a Pandas dataframe from the read_csv method:
amzn_df = pd.read_csv("https://frd001.s3-us-east-2.amazonaws.com/AMZN_FirstRateDatacom1.zip",
Note in the above sample, the datafile does not contain a header row so we need to pass in a column_names array of the columns. Also, we need to parse the TimeStamp column into the date format (by default it will be a string) and then assign this as index using the index_col argument.
We can check the dataframe is correctly loaded by running
Which will outputs the first 5 rows of the dataframe.
Next we can proceed with the resampling. In this example we will resample the 1-minute bars into 1-hour bars.
This is done by combining the resample() and aggs() methods. The resample() method groups rows into a different timeframe based on a parameter that is passed in, for example resample(“B”) groups rows into business days (one row per business day).
Unfortunately, the resample() method does not aggregate the all the columns using different rules (such as sum the volume column but only use the high value from the high column). To do this we need to use the aggs() method which allows us to specify how each column is aggregated.
Thus combining the resample() and aggs() method :
Note that some older code samples use the ‘how’ argument in the resample() method which appears much simpler, for example:
However, the ‘how’ parameter is no longer available in Pandas and the agg() method needs to be used in its place.
The process is nearly complete. The only remaining issue is that Pandas will create empty bars for weekends and holidays which need to be removed. This can be done by passing the dataframe a filtering argument which will be true only for trading days. In the below example we only take bars where the close is above zero (which should only be trading days).
The process is now complete, and we can save the resampled dataframe as an Excel file by calling the to_excel() method:
That’s it. The entire resampling procedure will only takes five lines of code and will execute in seconds.