Pandas merge() and read_sql() – joining DataFrames.

I have written several articles recently, about pandas and PostgreSQL database interaction – specifically in loading CSV data. In this post, I’ll cover what I have recently learned using pandas merge() and read_sql_query(), retrieving query results using INNER JOIN‘s and similar queries.

Photo by pine watt on Unsplash
OS, Database, and software used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • PostgreSQL 11.4
  • Python 3.7.4
  • pandas-0.25.0


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!

Feel free to visit these similar posts about pandas and SQL…

  • Pandas to SQL – importing CSV data files into PostgreSQL
  • Basic CSV file import and exploration with Pandas – first steps.
  • Pandas concat() then to_sql() – CSV upload to PostgreSQL
  • For starters, I import pandas, SQLAlchemy, and create a connection to the PostgreSQL database:

    1
    2
    3
    4
    >>> import pandas as pd
    >>> import sqlalchemy
    >>> from sqlalchemy import create_engine
    >>> engine = create_engine('postgresql://my_user:user_password@localhost:5432/walking_stats')

    One of the tables I track for my exercise/walking – and UPDATE frequently – is a ‘shoe_brand’ table. I am finicky about footwear and always looking for a reason to buy more hiking/walking shoes (LOL). In PostgreSQL, I can retrieve the present data with a simple TABLE command:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    walking_stats=> TABLE shoe_brand;
     shoe_id |              name_brand              
    ---------+---------------------------------------
           1 | New Balance 510v2
           2 | New Balance Trail Runners-All Terrain
           3 | Keen Koven WP(keen-dry)
           4 | Oboz Sawtooth Low
           5 | Merrell MOAB Edge 2
           6 | Oboz Cirque Low
    (6 rows)

    Pandas has a read_sql_table() function that allows you to read a SQL table into a DataFrame. I’ll use it here and retrieve the same information from the ‘shoe_brand’ table:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> shoes = pd.read_sql_table('shoe_brand', engine)
    >>> shoes
       shoe_id                             name_brand
    0        1                      New Balance 510v2
    1        2  New Balance Trail Runners-All Terrain
    2        3                Keen Koven WP(keen-dry)
    3        4                      Oboz Sawtooth Low
    4        5                    Merrell MOAB Edge 2
    5        6                        Oboz Cirque Low

    1
    2
    >>> type(shoes)
    <class 'pandas.core.frame.DataFrame'>

    Pandas also provides a read_sql() function that will read a SQL query or table. According to the on-line documentation, it is just a convenience wrapper for read_sql_table() and read_sql_query(). Below, I pass in a SQL query along with the engine connection, filtering records for the month of ‘January’ using EXTRACT() in the WHERE clause (Have a look at this post I wrote about EXTRACT()):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    >>> stats = pd.read_sql('SELECT * FROM stats WHERE EXTRACT(MONTH FROM day_walked) = 1;', engine))
    >>> stats
        day_walked  cal_burned  miles_walked  duration  mph  shoe_id
    0   2019-01-01       132.8          1.27  00:24:24  3.1        4
    1   2019-01-02       181.1          1.76  00:33:18  3.2        3
    2   2019-01-07       207.3          2.03  00:38:07  3.2        4
    3   2019-01-08       218.2          2.13  00:40:07  3.2        4
    4   2019-01-09       193.0          1.94  00:35:29  3.3        4
    5   2019-01-10       160.2          1.58  00:29:27  3.2        4
    6   2019-01-11       206.3          2.03  00:37:55  3.2        4
    7   2019-01-13       253.2          2.49  00:46:33  3.2        4
    8   2019-01-14       177.6          1.78  00:32:39  3.3        4
    9   2019-01-15       207.0          2.03  00:38:03  3.2        4
    10  2019-01-16       248.7          2.42  00:45:43  3.2        4
    11  2019-01-17       176.3          1.76  00:32:25  3.3        4
    12  2019-01-19       200.2          2.01  00:36:48  3.3        4
    13  2019-01-20       244.4          2.42  00:44:57  3.2        4
    14  2019-01-21       205.9          2.03  00:37:52  3.2        4
    15  2019-01-22       244.8          2.43  00:45:01  3.2        4
    16  2019-01-23       231.8          2.35  00:42:37  3.3        4
    17  2019-01-25       244.9          2.44  00:45:02  3.3        4
    18  2019-01-27       302.7          3.04  00:55:39  3.3        4
    19  2019-01-28       170.2          1.66  00:31:17  3.2        4
    20  2019-01-29       235.5          2.31  00:43:18  3.2        4
    21  2019-01-30       254.2          2.52  00:46:44  3.2        4
    22  2019-01-31       229.5          2.27  00:42:11  3.2        4

    1
    2
    >>> type(stats)
    <class 'pandas.core.frame.DataFrame'>

    I’ll use read_sql() throughout the remainder of this post, but I want to point out a difference between it and read_sql()_query. It appears, you can pass in a table name and connection to read_sql() with no issues:

    1
    2
    3
    4
    5
    6
    7
    8
    >>> pd.read_sql('shoe_brand', engine)
       shoe_id                             name_brand
    0        1                      New Balance 510v2
    1        2  New Balance Trail Runners-All Terrain
    2        3                Keen Koven WP(keen-dry)
    3        4                      Oboz Sawtooth Low
    4        5                    Merrell MOAB Edge 2
    5        6                        Oboz Cirque Low

    However, not so much the case with read_sql_query():

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    >>> pd.read_sql_query('shoe_brand', engine)
    Traceback (most recent call last):
      File "/home/linux_user_name/pg_py_database/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
        cursor, statement, parameters, context
      File "/home/linux_user_name/pg_py_database/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
        cursor.execute(statement, parameters)
    psycopg2.errors.SyntaxError: syntax error at or near "shoe_brand"
    LINE 1: shoe_brand
            ^
    The above exception was the direct cause of the following exception:
    #errors below not shown for brevity's sake....
    .............................
    .............................
    .............................
    .............................
    sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near "shoe_brand"
    LINE 1: shoe_brand
            ^

    [SQL: shoe_brand]
    (Background on this error at: http://sqlalche.me/e/f405)
    >>>

    Apparently, you must provide an actual SQL query to read_sql_query() instead of just a table name:

    1
    2
    3
    4
    5
    6
    7
    8
    >>> pd.read_sql_query('SELECT * FROM shoe_brand;', engine)
       shoe_id                             name_brand
    0        1                      New Balance 510v2
    1        2  New Balance Trail Runners-All Terrain
    2        3                Keen Koven WP(keen-dry)
    3        4                      Oboz Sawtooth Low
    4        5                    Merrell MOAB Edge 2
    5        6                        Oboz Cirque Low

    I’ll retrieve results for the shoes worn during walks in January using an INNER JOIN with read_sql():

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    >>> shoes_worn = pd.read_sql('SELECT s.day_walked, s.miles_walked, sh.name_brand FROM stats AS s INNER JOIN shoe_brand AS sh ON s.shoe_id = sh.shoe_id WHERE EXTRACT(MONTH FROM s.day_walked) = 1;', engine)
    >>> shoes_worn.head(7)
       day_walked  miles_walked               name_brand
    0  2019-01-02          1.76  Keen Koven WP(keen-dry)
    1  2019-01-31          2.27        Oboz Sawtooth Low
    2  2019-01-30          2.52        Oboz Sawtooth Low
    3  2019-01-29          2.31        Oboz Sawtooth Low
    4  2019-01-28          1.66        Oboz Sawtooth Low
    5  2019-01-27          3.04        Oboz Sawtooth Low
    6  2019-01-25          2.44        Oboz Sawtooth Low


    Self-Promotion:

    I have written several posts on JOIN‘s you might like to read if you want to learn more about them.


    Although the read_sql() example works just fine, there are other pandas options for a query like this. Using the pandas DataFrame.merge() function, I can retrieve those same results in a slightly different manner versus the actual SQL JOIN query.

    Recall both the ‘stats’ and ‘shoes’ DataFrame’s have roughly the same data as that of the read_sql() INNER JOIN query. Using merge() you can do exactly that, merge DataFrames.

    Pandas merge() accepts several optional parameters. Visit the documentation (link in closing section) for the full range of them. For this example, I’ll look at just 3 of them: ‘shoes’, how, and on.

    Let’s see the syntax and results below:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    >>> merged_shoes_worn = stats.merge(shoes, how='inner', on='shoe_id')
    >>> merged_shoes_worn.head(7)
       day_walked  cal_burned  miles_walked  duration  mph  shoe_id         name_brand
    0  2019-01-01       132.8          1.27  00:24:24  3.1        4  Oboz Sawtooth Low
    1  2019-01-07       207.3          2.03  00:38:07  3.2        4  Oboz Sawtooth Low
    2  2019-01-08       218.2          2.13  00:40:07  3.2        4  Oboz Sawtooth Low
    3  2019-01-09       193.0          1.94  00:35:29  3.3        4  Oboz Sawtooth Low
    4  2019-01-10       160.2          1.58  00:29:27  3.2        4  Oboz Sawtooth Low
    5  2019-01-11       206.3          2.03  00:37:55  3.2        4  Oboz Sawtooth Low
    6  2019-01-13       253.2          2.49  00:46:33  3.2        4  Oboz Sawtooth Low

    Works perfectly.
    Almost…

    But, these results contain all columns from both tables. Let’s take it a step further and filter the DataFrame‘s, returning the ‘day_walked’, ‘miles_walked’, and ‘name_brand’ columns only, passing them in as a list:

    1
    >>> merged_shoes_worn = merged_shoes_worn[['day_walked', 'miles_walked', 'name_brand']]

    And display the first 7 rows from the reassigned DataFrame using the head() function:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> merged_shoes_worn.head(7)
       day_walked  miles_walked         name_brand
    0  2019-01-01          1.27  Oboz Sawtooth Low
    1  2019-01-07          2.03  Oboz Sawtooth Low
    2  2019-01-08          2.13  Oboz Sawtooth Low
    3  2019-01-09          1.94  Oboz Sawtooth Low
    4  2019-01-10          1.58  Oboz Sawtooth Low
    5  2019-01-11          2.03  Oboz Sawtooth Low
    6  2019-01-13          2.49  Oboz Sawtooth Low

    Here’s a rundown on the 3 parameters used in this example. First, ‘shoes’ is the DataFrame I am merging with.

    Next, the how='inner' parameter specifies a SQL-like JOIN. Below is a portion of the verbiage quoted from the on-line documentation:

    “inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.”

    (Note: inner is the default.)

    The on parameter specifies what key (or index according to the docs) the JOIN should be performed on. I think of this as pandas version of the SQL clause ON table_1.column = table_2.column so to speak.

    Visit the below on-line resources on many of the topics covered in this post for an in-depth look into them:

    The more I use and study pandas, the more it impresses me. Its interaction with SQL databases – and SQLAlchemy in general – opens up several possibilities to work with data in different ways. As I continue to learn about it, expect to see more blog posts forthcoming in the future…

    Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

    A Call To Action!

    Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

    Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

    Have I mentioned how much I love a cup of coffee?!?!

    To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

    Be sure and visit the “Best Of” page for a collection of my best blog posts.


    Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

    Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

    Advertisements

    Hey thanks for commenting! Leave a Reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.