PostgreSQL LEFT() and RIGHT() functions revisitied – String comparison use case.

In my day job (Pipeline Survey Data Analyst) I sometimes have the opportunity to write custom SQL queries in an MS Access database, which is the back end of one of the proprietary GIS solutions, my colleagues use. Although I feel that Access’s SQL implementation is not quite as robust as other SQL dialects, it does get the job done in certain situations (the visual interface continues to grow on me). For a learning experiment, I decided to reproduce – and solve – the same requirements using PostgreSQL, as that I had in the MS Access environment. However, I discovered an all-together different challenge.

Both MS Access and MySQL provide several string functions. One of those is a particularly useful string comparing function. MySQL has STRCMP() while in Access, there is a similar StrComp(). Postgres does not have its own version that I am aware of. I leaned heavily on this type of function in MS Access. Hopefully, readers will clue me in on what Postgres-specific string function I could use in its stead. Meanwhile, read on to see the workaround I used…

[Head this way for great PostgresSQL blogging >>>]
Advertisements

Pandas concat() then to_sql() – CSV upload to PostgreSQL

Recently, I wrote a post about inserting pandas DataFrame values into a PostgreSQL table, using SQLAlchemy data types to type cast them to the desired data type. In this post, I’ll explore – and leverage – pandas.concat() and get a two for one by merging 2 pandas DataFrames into 1, prior to uploading.

[Head this way for great PostgresSQL blogging >>>]

Pandas to SQL – importing CSV data files into PostgreSQL

My goal with this post is to cover what I have learned while inserting pandas DataFrame values into a PostgreSQL table using SQLAlchemy. Interested in learning about this yourself? Want to see a simple example? You are in the right place so keep reading and learn with me…

[Head this way for great PostgresSQL blogging >>>]

Basic CSV file import and exploration with Pandas – first steps.

Having recently begun to use the Python pandas library in my day job to mitigate mostly redundant tasks with CSV files, I thought to step outside of my normal – and more comfortable wheelhouse – SQL, to share some of the beginner pandas things I am learning and using. The simple fact remains: the more I use pandas, the more I like it. Although the data set used here is not the same as the one I interact with at work, the same principles carryover.

Photo by Mika Baumeister on Unsplash
OS and software used:
  • Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
  • Python 3.7.4
  • pandas-0.25.0


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


It’s no secret that CSV files are used everywhere. Python itself has fantastic support for working with CSV files and data, out of the box, by way of the CSV module. However, I have really taken a liking to the pandas library for its CSV functionality.

I’ve had great success with pandas in scripting away redundant data cleaning work flows prior to importing CSV data into an AutoCADD points base application in my day job.

To use pandas, it must be installed, which can typically be accomplished with a simple:

1
pip install pandas

Once installed, pandas has to be loaded via importing the module. Below is a common convention when importing pandas:

1
>>> import pandas as pd

Now with pandas available, we can begin to work with it.

I have a CSV file, for the month of April, storing walking stats from my daily walks/hikes I track. Pandas provides a handy read_csv() function that accepts a multitude of different parameters. Consult the official documentation (linked below) for full descriptions on each of them as I won’t necessarily explore them all here.

Using the read_csv() function, I’ll create a pandas DataFrame by passing in the path to the CSV file as the lone argument in the function call:

1
>>> apr_data = pd.read_csv(r'/home/linux_user/Practice Data/Fitness_DB_Data/apr_2019_hiking_stats.csv')

We can see via printing out the ‘apr_data’ variables’ type in the console that it is indeed a DataFrame object:

1
2
>>> print(type(apr_data))
<class 'pandas.core.frame.DataFrame'>

Below are just a few common functions you can use to garner some quick insights and information about the data in a DataFrame.

Attaching shape to the DataFrame object, returns a tuple with the number of rows – including the header row – and columns:

1
2
>>> apr_data.shape
(21, 6)

Passing in the skiprows parameter, you can skip any number of rows when reading in the CSV. Here I skip the first row, which is the header.

The shape construct confirms there is one less row in this DataFrame since we skipped the first row:

1
2
3
>>> no_header_data = pd.read_csv(r'/home/linux_user/Practice Data/Fitness_DB_Data/apr_2019_hiking_stats.csv', skiprows=1)
>>> no_header_data.shape
(20, 6)

To see more information about a DataFrame like: indexes, null values (if any), along with column data types, pandas has the info() method:

1
2
3
4
5
6
7
8
9
10
11
12
>>> apr_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 6 columns):
day_walked       21 non-null object
 cal_burned      21 non-null float64
 miles_walked    21 non-null float64
 duration        21 non-null object
 mph             21 non-null float64
 shoe_id         21 non-null int64
dtypes: float64(3), int64(1), object(2)
memory usage: 1.1+ KB

These tidbits of functionality are quite handy without a doubt, however, perhaps you just want to see some data. Instead of loading all the rows present in a DataFrame, use the head() function, which by default returns the first 5 rows:

1
2
3
4
5
6
7
>>> apr_data.head()
   day_walked   cal_burned   miles_walked   duration   mph   shoe_id
0  2019-04-01        217.7           2.18   00:40:01   3.3         4
1  2019-04-02        240.1           2.39   00:44:09   3.2         4
2  2019-04-03        152.7           1.51   00:28:04   3.2         4
3  2019-04-04        207.6           2.04   00:38:10   3.2         4
4  2019-04-05        247.8           2.43   00:45:34   3.2         4

Notice the call to the DataFrame we created that skipped the header row:

1
2
3
4
5
6
7
>>> no_header_data.head()
   2019-04-01   217.7   2.18   00:40:01   3.3   4
0  2019-04-02   240.1   2.39   00:44:09   3.2   4
1  2019-04-03   152.7   1.51   00:28:04   3.2   4
2  2019-04-04   207.6   2.04   00:38:10   3.2   4
3  2019-04-05   247.8   2.43   00:45:34   3.2   4
4  2019-04-07   294.5   2.89   00:54:09   3.2   4

The head() function is not restricted to just 5 rows. You can pass a number value as the argument when calling it to see only that many rows:

1
2
3
4
>>> apr_data.head(2)
   day_walked   cal_burned   miles_walked   duration   mph   shoe_id
0  2019-04-01        217.7           2.18   00:40:01   3.3         4
1  2019-04-02        240.1           2.39   00:44:09   3.2         4

This function is handy when you want to get a sense of the dataset (DataFrame) you are working with.

Below you will find documentation links to many of the topics I have covered in the post for your convenience:

The power of pandas goes way beyond that of which I have covered here. If you have never used it, or just curious about it, do check it out. As I use and experiment with pandas, expect to see more blog posts in the future, especially those where I can integrate and leverage it in SQL development learning and work flows.

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Postgres, Python, and Psycopg2 – executemany() method CSV upload example.

Having previously covered a couple of different ways to upload a CSV’s worth of data into a PostgreSQL table, this post will be quite similar to the second one, with a slight change in the psycopg2 method used. Visit COPY and CAST() – Bulk uploads in PostgreSQL, and Python and psycopg2 for CSV bulk uploads in PostgreSQL – with examples to get up to speed. Aside from that, read on to see the differences between the methods used…

[Head this way for great PostgresSQL blogging >>>]