Pandas to SQL – importing CSV data files into PostgreSQL

My goal with this post is to cover what I have learned while inserting pandas DataFrame values into a PostgreSQL table using SQLAlchemy. Interested in learning about this yourself? Want to see a simple example? You are in the right place so keep reading and learn with me…

OS, database, and software used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • PostgreSQL 11.4
  • Python 3.7.4
  • pandas-0.25.0


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


The final destination for all of my walking stats is this PostgreSQL table I have on my local learning/development environment:

1
2
3
4
5
6
7
8
9
10
walking_stats=> \d stats;
                          Table "public.stats"
    Column    |          Type          | Collation | Nullable | Default
--------------+------------------------+-----------+----------+---------
 day_walked   | date                   |           |          |
 cal_burned   | numeric(4,1)           |           |          |
 miles_walked | numeric(4,2)           |           |          |
 duration     | time without time zone |           |          |
 mph          | numeric(2,1)           |           |          |
 shoe_id      | integer                |           |          |

Also as part of the schema, I have a ‘staging’ table (description provided below) where I import all records from a CSV file. As I mentioned in the opening paragraph, we’ll populate it with, SQLAlchemy and pandas.

Getting started, we create a connection to the database with SQLAlchemy’s create_engine object:

1
2
>>> from sqlalchemy import create_engine
>>> engine = create_engine('postgresql://my_user:user_password@localhost:5432/walking_stats')

From the SQLAlchemy engine configuration page, we can see the basic structure and syntax is relatively straight-forward:

1
(sql_dialect://user_name:password:host:port/database)

Using a common convention, I’ll get pandas imported:

1
>>> import pandas as pd

This is the structure of the staging table:

1
2
3
4
5
6
7
8
9
10
walking_stats=> \d stat_staging;
             Table "public.stat_staging"
    Column    | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
 day_walked   | text |           |          |
 cal_burned   | text |           |          |
 miles_walked | text |           |          |
 duration     | text |           |          |
 mph          | text |           |          |
 shoe_id      | text |           |          |

The staging table is simply a mirror of the ‘stats’ table, with the exception that all columns are implemented as a TEXT data type.

CSV file with April’s walking stats in hand, let’s create a pandas DataFrame object from it with the read_csv() method (Check out this post I wrote on this method and other handy pandas functionality goodies):

1
>>> apr_csv_data = pd.read_csv(r'/home/my_linux_user/pg_py_database/apr_2019_hiking_stats.csv')

Then view the first 5 rows of data using the head() function:

1
2
3
4
5
6
7
>>> apr_csv_data.head()
   day_walked   cal_burned   miles_walked   duration   mph   shoe_id
0  2019-04-01        217.7           2.18   00:40:01   3.3         4
1  2019-04-02        240.1           2.39   00:44:09   3.2         4
2  2019-04-03        152.7           1.51   00:28:04   3.2         4
3  2019-04-04        207.6           2.04   00:38:10   3.2         4
4  2019-04-05        247.8           2.43   00:45:34   3.2         4

Table ‘stat_staging’ is empty at this time:

1
2
3
4
walking_stats=> TABLE stat_staging;
 day_walked | cal_burned | miles_walked | duration | mph | shoe_id
------------+------------+--------------+----------+-----+---------
(0 rows)

But, we are about to change all that and INSERT some data with pandas. But first, I need to import the String type from sqlalchemy:

1
>>> from sqlalchemy.types import String

Next, let’s compose – and execute – the INSERT operation in pandas:

1
>>> apr_csv_data.to_sql('stat_staging', engine, if_exists='append', index=False, dtype={"day_walked": String(), "cal_burned": String(), "miles_walked": String(), "duration": String(), "mph": String(), "shoe_id": String()})

You likely know why I needed the String() data type available. I am type casting the DataFrame column values to a compatible data type – String() – for table ‘stat_staging’, TEXT columns. The dtype{} dictionary parameter is what enables you to set the column data types.

Table ‘stat_staging’ has the DataFrame object records now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
walking_stats=> TABLE stat_staging;
 day_walked | cal_burned | miles_walked | duration  | mph | shoe_id
------------+------------+--------------+-----------+-----+---------
 2019-04-01 | 217.7      | 2.18         |  00:40:01 | 3.3 | 4
 2019-04-02 | 240.1      | 2.39         |  00:44:09 | 3.2 | 4
 2019-04-03 | 152.7      | 1.51         |  00:28:04 | 3.2 | 4
 2019-04-04 | 207.6      | 2.04         |  00:38:10 | 3.2 | 4
 2019-04-05 | 247.8      | 2.43         |  00:45:34 | 3.2 | 4
 2019-04-07 | 294.5      | 2.89         |  00:54:09 | 3.2 | 4
 2019-04-08 | 208.6      | 2.06         |  00:38:20 | 3.2 | 4
 2019-04-08 | 199.9      | 1.96         |  00:36:45 | 3.2 | 4
 2019-04-11 | 225.1      | 2.24         |  00:41:23 | 3.2 | 4
 2019-04-14 | 251.6      | 2.47         |  00:46:15 | 3.2 | 4
 2019-04-15 | 223.8      | 2.15         |  00:41:09 | 3.1 | 4
 2019-04-16 | 229.6      | 2.25         |  00:42:13 | 3.2 | 4
 2019-04-17 | 195.6      | 1.89         |  00:35:58 | 3.2 | 4
 2019-04-18 | 160.2      | 1.58         |  00:29:27 | 3.2 | 4
 2019-04-21 | 277.2      | 2.63         |  00:58:41 | 2.7 | 4
 2019-04-23 | 111.4      | 1.06         |  00:20:29 | 3.1 | 4
 2019-04-24 | 226.8      | 2.23         |  00:41:42 | 3.2 | 4
 2019-04-25 | 180.5      | 1.77         |  00:33:10 | 3.2 | 4
 2019-04-28 | 223.1      | 2.23         |  00:41:01 | 3.3 | 4
 2019-04-29 | 217.6      | 2.11         |  00:40:00 | 3.2 | 4
 2019-04-30 | 228.8      | 2.24         |  00:42:04 | 3.2 | 4
(21 rows)

How simple was that? Now all I need to do is type cast the column values – from table ‘stat_staging’ – to the appropriate data types during the INSERT when I move the rows of data over to table ‘stats’. I wrote all about it right here so you should definitely read this post if you are interested in how I accomplished it in PostgreSQL.

Prior to closing out this post, I want to call attention to perhaps, the most important parameter in the above call to to_sql(): if_exists.

The snippet below from the to_sql() documentation page, shows what values are acceptable, with their individual meanings:

1
2
3
4
5
6
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
How to behave if the table already exists.

fail: Raise a ValueError.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.

If you don’t specify anything, the INSERT will fail. Which is good, in my opinion, that way you don’t overwrite records unintentionally. Both replace and append are self-explanatory.

If your thinking is like mine, this thought has crossed your mind: “Since I can type cast the column values with SQLAlchemy, can’t I just skip the whole ‘staging’ table gambit and load them directly into table ‘stats’?”

And that my friends, is another post for another day. Check back in when it drops and see what it is all about!

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Advertisements

3 thoughts on “Pandas to SQL – importing CSV data files into PostgreSQL

Hey thanks for commenting! Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.