Python and psycopg2 for CSV bulk upload in PostgreSQL – with examples…

In a previous post, I explored using both the COPY command and the CAST() function together in order to upload an entire CSV file’s data into a PostgreSQL database table. This post is a continuation, of sorts. However, I will use Python and the psycopg2 library, in a simple script, to handle these uploads instead of SQL. While I am still a novice with operations such as these, I feel that writing about it is a fantastic way to learn and obtain valuable feedback…

Photo by Markus Spiske on Unsplash

Note: All data, names or naming found within the database presented in this post, are strictly used for practice, learning, instruction, and testing purposes. It by no means depicts actual data belonging to or being used by any party or organization.

OS and DB used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • PostgreSQL 11.4
  • Python 3.7


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


Studied Resources

I found valuable information and learning from the below resources – which I borrowed heavily from for this post – so be sure and visit them as well:


To recap, I have this staging table where I initially ingest walking stats from a CSV.

1
2
3
4
5
6
7
8
9
10
walking_stats=> \d stat_staging;
             Table "public.stat_staging"
    Column    | Type | Collation | Nullable | Default
--------------+------+-----------+----------+---------
 day_walked   | text |           |          |
 cal_burned   | text |           |          |
 miles_walked | text |           |          |
 duration     | text |           |          |
 mph          | text |           |          |
 shoe_id      | text |           |          |

Presently, table ‘stat_staging’ is empty:

1
2
3
4
5
walking_stats=> SELECT COUNT(*) FROM stat_staging;
 count
-------
     0
(1 row)

I also have a Python file named db.py with this code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import psycopg2 as pg
import csv

file = r'/path/to/feb_2019_hiking_stats.csv'
sql_insert = """INSERT INTO stat_staging(day_walked, cal_burned, miles_walked,
                duration, mph, shoe_id)
                VALUES(%s, %s, %s, %s, %s, %s)"""

try:
    conn = pg.connect(user="my_user",
        password="my_password",
        host="127.0.0.1",
        port="5432",
        database="walking_stats")
    cursor = conn.cursor()
    with open(file, 'r') as f:
        reader = csv.reader(f)
        next(reader) # This skips the 1st row which is the header.
        for record in reader:
            cursor.execute(sql_insert, record)
            conn.commit()
except (Exception, pg.Error) as e:
    print(e)
finally:
    if (conn):
        cursor.close()
        conn.close()
        print("Connection closed.")

Running the db.py script via the terminal from my virtual environment, we can see the below output:

1
2
(pg_py_database) my_linux_user:~/pg_py_database$ python3.7 db.py
Connection closed.

‘Connection closed.’ is output due to execution having reached the finally portion of the try/except/finallyblock in the script. Likely not the most graceful or informative way to provide information, in which I am definitely seeking better practices, as I move forward with my learning.

Basically, the execute() method accepts a query and some variables that are bound to the query via either the %s or %(name)s placeholders. As stated in the documentation – linked above – these parameters can be “a sequence or mapping”.

Next, in a psql session, I’ll query table ‘stat_staging’, verifying the records were inserted:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
walking_stats=> SELECT * FROM stat_staging;
 day_walked | cal_burned | miles_walked | duration  | mph  | shoe_id
------------+------------+--------------+-----------+------+---------
 2019-02-01 |  243.6     |  2.46        |  00:44:48 |  3.3 |  4
 2019-02-03 |  285.5     |  2.91        |  00:52:29 |  3.3 |  4
 2019-02-04 |  237.6     |  2.44        |  00:43:41 |  3.4 |  4
 2019-02-05 |  242.3     |  2.41        |  00:44:33 |  3.2 |  4
 2019-02-06 |  204.9     |  2.03        |  00:37:40 |  3.2 |  4
 2019-02-07 |  183.5     |  1.80        |  00:33:44 |  3.2 |  4
 2019-02-08 |  177.5     |  1.71        |  00:32:38 |  3.1 |  4
 2019-02-10 |  244.8     |  2.47        |  00:45:00 |  3.3 |  4
 2019-02-11 |  232.8     |  2.33        |  00:42:48 |  3.3 |  4
 2019-02-12 |  241.8     |  2.39        |  00:44:27 |  3.2 |  4
 2019-02-13 |  235.2     |  2.34        |  00:43:15 |  3.2 |  4
 2019-02-14 |  245.5     |  2.45        |  00:45:08 |  3.3 |  4
 2019-02-15 |  204.8     |  2.03        |  00:37:38 |  3.2 |  4
 2019-02-17 |  244.9     |  2.46        |  00:45:01 |  3.3 |  4
 2019-02-18 |  246.0     |  2.50        |  00:45:14 |  3.3 |  4
 2019-02-18 |  201.9     |  2.07        |  00:37:07 |  3.3 |  4
 2019-02-20 |  201.8     |  2.05        |  00:37:06 |  3.3 |  4
 2019-02-21 |  179.5     |  1.80        |  00:33:00 |  3.3 |  4
 2019-02-22 |  164.0     |  1.64        |  00:30:09 |  3.3 |  4
 2019-02-24 |  241.3     |  2.40        |  00:44:22 |  3.2 |  4
 2019-02-25 |  247.2     |  2.47        |  00:45:27 |  3.3 |  4
 2019-02-26 |  238.9     |  2.35        |  00:43:55 |  3.2 |  4
 2019-02-27 |  244.1     |  2.46        |  00:44:52 |  3.3 |  4
 2019-02-28 |  246.2     |  2.46        |  00:45:16 |  3.3 |  4
(24 rows)

Where table ‘stat_staging’ was previously empty, it now has 24 rows of data from the successful INSERT operation in the db.py script. Although this CSV file is relatively small, is using the execute() method like this the most efficient way? Could looping and performing the INSERT for each record, have a performance impact for that many individual writes?

All that being said, psycopg2 does have an executemany() method? Is it better suited for tasks like this? That will be the focus of an upcoming blog post where I will at least explore its use. Be sure and check back in to read it when it is published as well!

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

Explore the official PostgreSQL 11 On-line Documentation for more information.

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Advertisements

Hey thanks for commenting! Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.