Pandas concat() then to_sql() – CSV upload to PostgreSQL

Recently, I wrote a post about inserting pandas DataFrame values into a PostgreSQL table, using SQLAlchemy data types to type cast them to the desired data type. In this post, I’ll explore – and leverage – pandas.concat() and get a two for one by merging 2 pandas DataFrames into 1, prior to uploading.

Photo by Carlos Muza on Unsplash
OS, Database, and software used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • PostgreSQL 11.4
  • Python 3.7.4
  • pandas-0.25.0


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


First, we need to import pandas and create a connection to the database:

1
2
3
>>> import pandas as pd
>>> from sqlalchemy import create_engine
>>> engine = create_engine('postgresql://my_user:user_password@localhost:5432/walking_stats')

Using read_csv(), I create 2 DataFrame objects: one with May’s walking stats CSV data, the other with June’s. (Visit, Basic CSV file import and exploration with Pandas – first steps, for more information on this pandas construct):

1
2
>>> may_csv_data = pd.read_csv(r'/home/linux_user/pg_py_database/may_2019_hiking_stats.csv')
>>> june_csv_data = pd.read_csv(r'/home/linux_user/pg_py_database/june_2019_hiking_stats.csv')

We can view the first 5 rows of data – along with the overall size of each DataFrame – using head() and shape

1
2
3
4
5
6
7
8
9
10
:
>>> may_csv_data.head()
   day_walked  cal_burned  miles_walked  duration  mph  shoe_id
0  2019-05-01       226.1          2.23  00:41:34  3.2        4
1  2019-05-02       226.7          2.25  00:41:41  3.2        4
2  2019-05-05       230.1          2.28  00:42:18  3.2        4
3  2019-05-06       222.7          2.23  00:40:56  3.3        4
4  2019-05-08       231.2          2.27  00:42:30  3.2        4
>>> may_csv_data.shape
(21, 6)

1
2
3
4
5
6
7
8
9
>>> june_csv_data.head()
   day_walked  cal_burned  miles_walked  duration  mph  shoe_id
0  2019-06-02       275.7          2.79  00:50:41  3.3        4
1  2019-06-03       229.1          2.26  00:42:07  3.2        4
2  2019-06-04       229.5          2.28  00:42:11  3.2        4
3  2019-06-05       187.2          1.84  00:34:25  3.2        4
4  2019-06-06       262.3          2.60  00:48:13  3.2        4
>>> june_csv_data.shape
(24, 6)

Using pandas.concat() (visit the online documentation here for more information), I will merge both DataFrames into a single DataFrame named ‘combined_data’:

1
2
3
>>> combined_data = pd.concat([may_csv_data, june_csv_data], ignore_index=True)
>>> combined_data.shape
(45, 6)

From the shape output, we can see that we now have 45 total rows of data with 6 columns.

Calling both the head() and tail() functions, returns May and June data records:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> combined_data.head()
   day_walked  cal_burned  miles_walked  duration  mph  shoe_id
0  2019-05-01       226.1          2.23  00:41:34  3.2        4
1  2019-05-02       226.7          2.25  00:41:41  3.2        4
2  2019-05-05       230.1          2.28  00:42:18  3.2        4
3  2019-05-06       222.7          2.23  00:40:56  3.3        4
4  2019-05-08       231.2          2.27  00:42:30  3.2        4
>>> combined_data.tail()
    day_walked  cal_burned  miles_walked  duration  mph  shoe_id
40  2019-06-25       302.2          3.00  00:55:37  3.2        4
41  2019-06-26       307.0          3.10  00:56:26  3.3        4
42  2019-06-27       303.4          3.04  00:55:47  3.3        4
43  2019-06-28       321.4          3.04  00:59:01  3.3        4
44  2019-06-30       332.0          3.35  01:01:02  3.3        4

In, Pandas to SQL – importing CSV data files into PostgreSQL, I wrapped up that post with a mention of skipping the staging table loading process and importing records directly into table ‘stats’. Using SQLAlchemy types, we proved we can perform the necessary type casting directly between pandas and PostgreSQL with them, having done so using the String() type.

Visiting the ‘stats’ table structure, we can gain a sense of the compatible SQLAlchemy data types for the mapping:

1
2
3
4
5
6
7
8
9
10
walking_stats=> \d stats;
                          Table "public.stats"
    Column    |          Type          | Collation | Nullable | Default
--------------+------------------------+-----------+----------+---------
 day_walked   | date                   |           |          |
 cal_burned   | numeric(4,1)           |           |          |
 miles_walked | numeric(4,2)           |           |          |
 duration     | time without time zone |           |          |
 mph          | numeric(2,1)           |           |          |
 shoe_id      | integer                |           |          |

The below SQLAlchemy types will cover the requirements:

1
>>> from sqlalchemy.types import Date, Numeric, Time, INT

The dtype python dictionary parameter enables the type casting. Each column is type casted to a compatible data type using an applicable SQLAlchemy type:

1
>>> combined_data.to_sql('stats', engine, if_exists='append', index=False, dtype={"day_walked": Date(), "cal_burned": Numeric(precision=4, scale=1), "miles_walked": Numeric(precision=4, scale=2), "duration": Time(timezone=False), "mph": Numeric(precision=2, scale=1), "shoe_id": INT()})

After calling to_sql(), I’ll verify the records were loaded into table ‘stats’:

1
2
3
4
5
walking_stats=> SELECT COUNT(*) FROM stats;
 count
-------
   112
(1 row)

Recall from above, combined_data.shape returned 45 rows (or records) of data, composed of both May and June’s respective combined DataFrame objects.

Using a simple query, I’ll retrieve a more specific count of records for those two months instead of just an overall count:

1
2
3
4
5
6
7
walking_stats=> SELECT COUNT(*)
walking_stats-> FROM stats
walking_stats-> WHERE EXTRACT(month FROM day_walked) IN (5, 6);
 count
-------
    45
(1 row)

(Read The EXTRACT() function with examples in PostgreSQL, I wrote on this handy function if you are interested. It is pretty awesome in my opinion. The function also!! LOL)

A count of 45 rows returned in the query results matches combined_data.shape output, proving the upload was successful.

Over the course of many posts, I have covered different means to load data from a CSV file into a PostgreSQL table. Certain posts focused more on a SQL-esque approach, while others used pandas and SQLAlchemy.

Now that the data is in the destination table, it is time to explore it with various Analytic Window functions along, with other pandas options. Look for several forthcoming blog posts on these topics and other data-centric posts here on Digital Owl’s Prose.

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Advertisements

2 thoughts on “Pandas concat() then to_sql() – CSV upload to PostgreSQL

Hey thanks for commenting! Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.