Easy datetime manipulation with the pandas to_datetime() function.

It seems like no matter what (programming) language you use, there are nuances when it comes to working with date and/or datetime values. Nevertheless, dates are super important. A lot hinges on date and time values. For instance, it is paramount that my employer and I both are conscious of the days and hours I work. Be it business-related or centered around life, we always tend to keep the time so to speak. Pandas has a to_datetime() function that provides quite a bit of useful functionality when working with datetime values. Inspired by a great video where I learned about this function, this post will cover several basic examples of its use. As always, read and learn with me…

Photo by Manasvita S on Unsplash
OS, Database, and software used:
     
  • OpenSuse Leap 15.0
  •  
  • Python 3.7.2


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


To start, I must give credit where credit is due. I learned so much from this fantastic pandas video. Several of the examples that follow are inspired or derived from it, applying the concepts to my own data sources. Be sure and check it out!

To use pandas we need to first, import it. I’ll also be using csv data so the powerful read_csv() function takes care of that. Do check out any one of the several links provided in the closing section for other blog posts I’ve written on the read_csv() function.

1
2
3
4
5
6
7
8
9
10
>>> import pandas as pd
>>> from datetime import date
>>> data = pd.read_csv(r'/home/joshua/Practice Data/Fitness_DB_Data/aug_2019_hiking_stats.csv', delimiter=',')
>>> data.head()
 day_walked cal_burned miles_walked duration mph shoe_id
0 20190801 358.1 3.53 01:05:51 3.2 6
1 20190802 354.7 3.46 01:05:12 3.2 6
2 20190804 354.2 3.55 01:05:07 3.3 6
3 20190805 351.3 3.52 01:04:36 3.3 6
4 20190806 292.2 2.83 00:53:44 3.2 4

Let’s look at the dtypes attribute, paying particular attention to the ‘day_walked’ columns’ type.

1
2
3
4
5
6
7
8
>>> data.dtypes
day_walked object
cal_burned float64
miles_walked float64
duration object
mph float64
shoe_id int64
dtype: object

Using to_datetime(), I’ll reassign the ‘day_walked’ columns type to a datetime of the same format that the column string currently represents: 4 digit year, 2 digit month, and 2 digit day:

1
data['day_walked'] = pd.to_datetime(data['day_walked'], format='%Y-%m-%d')

Now have a look at the ‘day_walked’ column type:

1
2
3
4
5
6
7
8
>>> data.dtypes
day_walked datetime64[ns]
cal_burned float64
miles_walked float64
duration object
mph float64
shoe_id int64
dtype: object

Since it is a datetime type, you now can access several properties relating to date and time functionality. It is important to note here that my dataset does not contain any timestamp values for column ‘day_walked’. Therefore, those specific types of properties are not available for it.

You can retrieve the actual weekday name via that property:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.weekday_name
0 Thursday
1 Friday
2 Sunday
3 Monday
4 Tuesday
5 Wednesday
6 Thursday
7 Friday
8 Sunday
9 Monday
10 Tuesday
11 Wednesday
12 Thursday
13 Friday
14 Sunday
15 Monday
16 Tuesday
17 Thursday
18 Friday
19 Sunday
20 Tuesday
21 Wednesday
22 Thursday
23 Friday

Although we can obviously see it in the dataset, you can also extract the year value through that particular property:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.year
0 2019
1 2019
2 2019
3 2019
4 2019
5 2019
6 2019
7 2019
8 2019
9 2019
10 2019
11 2019
12 2019
13 2019
14 2019
15 2019
16 2019
17 2019
18 2019
19 2019
20 2019
21 2019
22 2019
23 2019

Same with the month number:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.month
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
8 8
9 8
10 8
11 8
12 8
13 8
14 8
15 8
16 8
17 8
18 8
19 8
20 8
21 8
22 8
23 8

Need just the day of the month? There is an attribute for it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.day
0 1
1 2
2 4
3 5
4 6
5 7
6 8
7 9
8 11
9 12
10 13
11 14
12 15
13 16
14 18
15 19
16 20
17 22
18 23
19 25
20 27
21 28
22 29
23 30

The weekday attribute returns a number, indicating the day of the week where Monday would be 0 (zero) and Sunday is 6:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.weekday
0 3
1 4
2 6
3 0
4 1
5 2
6 3
7 4
8 6
9 0
10 1
11 2
12 3
13 4
14 6
15 0
16 1
17 3
18 4
19 6
20 1
21 2
22 3
23 4

To know the ordinal day of the year, that attribute is available:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.dayofyear
0 213
1 214
2 216
3 217
4 218
5 219
6 220
7 221
8 223
9 224
10 225
11 226
12 227
13 228
14 230
15 231
16 232
17 234
18 235
19 237
20 239
21 240
22 241
23 242

You can easily determine the first day of the month with is_month_start. For those dates that are not, False is returned:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.is_month_start
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False

According to the results above, I did walk on the first day of the month. However, calling is_month_end returns False for each row, that is not the last day of the month, indicating I did not walk on the last day of the month as shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
>>> data['day_walked'].dt.is_month_end
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False

By it being a datetime data type, column ‘walking_stats’ can be used in comparisons with other datetime values. Let’s assign the date of ‘2019–08–19’ to a variable:

1
>>> curious_day = pd.to_datetime('2019–08–19')

To retrieve those rows where the date is greater than or equal to the date value of ‘2019–08–19’, the comparison could look like this:

1
2
3
4
5
6
7
8
9
10
11
>>> data.loc[data['day_walked'] >= curious_day]
 day_walked cal_burned miles_walked duration mph shoe_id
15 20190819 370.1 3.68 01:08:03 3.2 5
16 20190820 360.1 3.60 01:06:20 3.3 5
17 20190822 376.1 3.75 01:09:09 3.3 5
18 20190823 351.7 3.46 01:04:39 3.2 5
19 20190825 355.1 3.51 01:05:17 3.2 5
20 20190827 378.1 3.71 01:09:31 3.2 5
21 20190828 345.1 3.41 01:03:27 3.2 5
22 20190829 365.8 3.58 01:07:15 3.2 5
23 20190830 205.5 2.02 00:37:47 3.2 5

Likewise, using the less-than comparison operator, retrieves those rows with dates prior to ‘2019–08–19’:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> data.loc[data['day_walked'] < curious_day]
 day_walked cal_burned miles_walked duration mph shoe_id
0 20190801 358.1 3.53 01:05:51 3.2 6
1 20190802 354.7 3.46 01:05:12 3.2 6
2 20190804 354.2 3.55 01:05:07 3.3 6
3 20190805 351.3 3.52 01:04:36 3.3 6
4 20190806 292.2 2.83 00:53:44 3.2 4
5 20190807 339.0 3.31 01:02:20 3.2 4
6 20190808 344.1 3.36 01:03:16 3.2 4
7 20190809 316.7 3.12 00:58:13 3.2 5
8 20190811 363.2 3.64 01:06:47 3.3 5
9 20190812 346.1 3.42 01:03:38 3.2 5
10 20190813 370.0 3.66 01:08:01 3.2 5
11 20190814 347.0 3.45 01:03:55 3.2 5
12 20190815 327.0 3.25 01:00:09 3.2 5
13 20190816 351.0 3.45 01:04:32 3.2 5
14 20190818 347.4 3.47 01:03:53 3.3 5

Explore the to_datetime() function, trying out it’s handy functionality on your data sets’ datetime values. Thanks for reading!

See these links to official documentation on many of the pandas topics covered in the post:

Other posts you may be interested in: Bulk CSV Uploads with Pandas and PostgreSQL

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Advertisements

Renaming pandas DataFrame columns – with examples.

In the blog post, How to drop columns from a Pandas DataFrame – with examples., I covered examples of how to completely remove certain columns from a pandas DataFrame. But what if you need to keep the columns, yet their names are not to your liking? Perhaps you need to provide a report with meaningful column names in a CSV or Excel file? Again, pandas make this relatively simple. Let’s understand with some examples…

[Keep reading for more SQL database and Python-centric goodness >>>]

How to drop columns from a Pandas DataFrame – with examples.

I use Pandas – and Python in general – for any type of scripting. Having grown to loathe redundant menial tasks, especially with CSV’s, I lean more and more on this powerful library. Since I manipulate and use them (CSV’s) daily at work, if I perform a routine 3 times, it finds its way into a Python script with pandas leading the charge. More often than not, when analyzing CSV data, they tend to be messy. Likely they have columns you don’t need or care about. Pandas DataFrames have a drop() function that allows you to get rid of those columns and keep only the ones you need. As always, I learn best by example so keep reading and learn with me…

[Keep reading for more SQL database and Python-centric goodness >>>]

Window Functions in PostgreSQL – example with 3-day rolling average.

After reading this fantastic post, Window Functions in Python and SQL, I decided to apply a similar function to a data set that interests me: the walking/hiking stats I keep up with for all of my (daily) walks. While this blog post will cover more of the SQL aspect, I plan to write one covering the Python and Pandas portion in the near future…

OS, Database, and software used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • PostgreSQL 11.4


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


I use this table and structure to store and track walking stats data. I have written several blog posts detailing different methods using PostgreSQL and pandas, for bulk loading CSV data in it. Be sure and visit those linked posts in the closing section below if you are interested.

1
2
3
4
5
6
7
8
9
10
walking_stats=> \d stats;
                          Table "public.stats"
    Column    |          Type          | Collation | Nullable | Default
--------------+------------------------+-----------+----------+---------
 day_walked   | date                   |           |          |
 cal_burned   | numeric(4,1)           |           |          |
 miles_walked | numeric(4,2)           |           |          |
 duration     | time without time zone |           |          |
 mph          | numeric(2,1)           |           |          |
 shoe_id      | integer                |           |          |

Using a Window Function, we can retrieve query results for a 3 day rolling average of calories burned:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
SELECT day_walked,
cal_burned,
AVG(cal_burned) OVER(ORDER BY day_walked ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_avg
FROM stats
WHERE EXTRACT(MONTH FROM day_walked) = 1;
 day_walked | cal_burned |    three_day_avg    
------------+------------+----------------------
 2019-01-01 |      132.8 | 132.8000000000000000
 2019-01-02 |      181.1 | 156.9500000000000000
 2019-01-07 |      207.3 | 173.7333333333333333
 2019-01-08 |      218.2 | 202.2000000000000000
 2019-01-09 |      193.0 | 206.1666666666666667
 2019-01-10 |      160.2 | 190.4666666666666667
 2019-01-11 |      206.3 | 186.5000000000000000
 2019-01-13 |      253.2 | 206.5666666666666667
 2019-01-14 |      177.6 | 212.3666666666666667
 2019-01-15 |      207.0 | 212.6000000000000000
 2019-01-16 |      248.7 | 211.1000000000000000
 2019-01-17 |      176.3 | 210.6666666666666667
 2019-01-19 |      200.2 | 208.4000000000000000
 2019-01-20 |      244.4 | 206.9666666666666667
 2019-01-21 |      205.9 | 216.8333333333333333
 2019-01-22 |      244.8 | 231.7000000000000000
 2019-01-23 |      231.8 | 227.5000000000000000
 2019-01-25 |      244.9 | 240.5000000000000000
 2019-01-27 |      302.7 | 259.8000000000000000
 2019-01-28 |      170.2 | 239.2666666666666667
 2019-01-29 |      235.5 | 236.1333333333333333
 2019-01-30 |      254.2 | 219.9666666666666667
 2019-01-31 |      229.5 | 239.7333333333333333
(23 rows)

To clean up all the extra digits in the ‘three_day_avg’ column, we can wrap the Window Function in the ROUND() function, keeping only 2 digits after the decimal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
walking_stats=> SELECT day_walked, cal_burned, ROUND(AVG(cal_burned) OVER(ORDER BY day_walked ROWS BETWEEN 2 PRECEDING AND CURRENT ROW),2) AS three_day_avg
FROM stats
WHERE EXTRACT(MONTH FROM day_walked) = 1;
 day_walked | cal_burned | three_day_avg
------------+------------+---------------
 2019-01-01 |      132.8 |        132.80
 2019-01-02 |      181.1 |        156.95
 2019-01-07 |      207.3 |        173.73
 2019-01-08 |      218.2 |        202.20
 2019-01-09 |      193.0 |        206.17
 2019-01-10 |      160.2 |        190.47
 2019-01-11 |      206.3 |        186.50
 2019-01-13 |      253.2 |        206.57
 2019-01-14 |      177.6 |        212.37
 2019-01-15 |      207.0 |        212.60
 2019-01-16 |      248.7 |        211.10
 2019-01-17 |      176.3 |        210.67
 2019-01-19 |      200.2 |        208.40
 2019-01-20 |      244.4 |        206.97
 2019-01-21 |      205.9 |        216.83
 2019-01-22 |      244.8 |        231.70
 2019-01-23 |      231.8 |        227.50
 2019-01-25 |      244.9 |        240.50
 2019-01-27 |      302.7 |        259.80
 2019-01-28 |      170.2 |        239.27
 2019-01-29 |      235.5 |        236.13
 2019-01-30 |      254.2 |        219.97
 2019-01-31 |      229.5 |        239.73
(23 rows)

And there it is, a 3-day rolling average of calories burned for the month of ‘January’.

We can use SQL to easily check the math for a particular row. I’ll focus on row 3, dated ‘2019-01-07’. Since the WINDOWING portion of the OVER() clause, ROWS BETWEEN 2 PRECEDING AND CURRENT ROW essentially means: Include (Average) the ‘cal_burned’ values for the current row and those 2 rows above – or the PRECEDING 2 rows- the math would look like this:

1
2
3
4
5
walking_stats=> SELECT ROUND((207.3 + 181.1 + 132.8) / 3,2) AS three_day_avg;
 three_day_avg
---------------
        173.73
(1 row)

I have written several blog posts about Window Functions within both the PostgreSQL and MySQL ecosystems, however, the 2 below are most similar to this post and provide more information concerning the windowing portion of the OVER() clause:

Other posts you may be interested in: Bulk CSV Uploads with Pandas and PostgreSQL

Try out Window Functions yourself to calculate rolling averages, sums, and the like on data sets that interest you. Hit me up in the comments with some examples. I’d love to know of more interesting use cases. Thanks for reading!

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Pandas dataframe ordering with examples using sort_values().

Often times, you need some form of ordering in a result set. In the SQL world, without an ORDER BY clause, query results order is not guaranteed. What if you are working in the pandas world? Fear not. You can order DataFrame results in a similar fashion as that of ORDER BY using the sort_values() function. Let’s learn together by example…

[Head this way for great PostgresSQL and Python blogging >>>]