pandas Archives - Page 2 of 2 - Digital Owl's Prose

Having recently begun to use the Python pandas library in my day job to mitigate mostly redundant tasks with CSV files, I thought to step outside of my normal – and more comfortable wheelhouse – SQL, to share some of the beginner pandas things I am learning and using. The simple fact remains: the more I use pandas, the more I like it. Although the data set used here is not the same as the one I interact with at work, the same principles carryover.

Photo by Mika Baumeister on Unsplash

OS and software used:

Xubuntu Linux 18.04.2 LTS (Bionic Beaver)
Python 3.7.4
pandas-0.25.0

Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!

It’s no secret that CSV files are used everywhere. Python itself has fantastic support for working with CSV files and data, out of the box, by way of the CSV module. However, I have really taken a liking to the pandas library for its CSV functionality.

I’ve had great success with pandas in scripting away redundant data cleaning work flows prior to importing CSV data into an AutoCADD points base application in my day job.

To use pandas, it must be installed, which can typically be accomplished with a simple:

1	pip install pandas

Once installed, pandas has to be loaded via importing the module. Below is a common convention when importing pandas:

1	>>> import pandas as pd

Now with pandas available, we can begin to work with it.

I have a CSV file, for the month of April, storing walking stats from my daily walks/hikes I track. Pandas provides a handy read_csv() function that accepts a multitude of different parameters. Consult the official documentation (linked below) for full descriptions on each of them as I won’t necessarily explore them all here.

Using the read_csv() function, I’ll create a pandas DataFrame by passing in the path to the CSV file as the lone argument in the function call:

1	>>> apr_data = pd.read_csv(r'/home/linux_user/Practice Data/Fitness_DB_Data/apr_2019_hiking_stats.csv')

We can see via printing out the ‘apr_data’ variables’ type in the console that it is indeed a DataFrame object:

1 2	>>> print(type(apr_data)) <class 'pandas.core.frame.DataFrame'>

Below are just a few common functions you can use to garner some quick insights and information about the data in a DataFrame.

Attaching shape to the DataFrame object, returns a tuple with the number of rows – including the header row – and columns:

1 2	>>> apr_data.shape (21, 6)

Passing in the skiprows parameter, you can skip any number of rows when reading in the CSV. Here I skip the first row, which is the header.

The shape construct confirms there is one less row in this DataFrame since we skipped the first row:

1
2
3

>>> no_header_data = pd.read_csv(r'/home/linux_user/Practice Data/Fitness_DB_Data/apr_2019_hiking_stats.csv', skiprows=1)
>>> no_header_data.shape
(20, 6)

To see more information about a DataFrame like: indexes, null values (if any), along with column data types, pandas has the info() method:

1
2
3
4
5
6
7
8
9
10
11
12

>>> apr_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 6 columns):
day_walked 21 non-null object
cal_burned 21 non-null float64
miles_walked 21 non-null float64
duration 21 non-null object
mph 21 non-null float64
shoe_id 21 non-null int64
dtypes: float64(3), int64(1), object(2)
memory usage: 1.1+ KB

These tidbits of functionality are quite handy without a doubt, however, perhaps you just want to see some data. Instead of loading all the rows present in a DataFrame, use the head() function, which by default returns the first 5 rows:

1
2
3
4
5
6
7

>>> apr_data.head()
day_walked cal_burned miles_walked duration mph shoe_id
0 2019-04-01 217.7 2.18 00:40:01 3.3 4
1 2019-04-02 240.1 2.39 00:44:09 3.2 4
2 2019-04-03 152.7 1.51 00:28:04 3.2 4
3 2019-04-04 207.6 2.04 00:38:10 3.2 4
4 2019-04-05 247.8 2.43 00:45:34 3.2 4

Notice the call to the DataFrame we created that skipped the header row:

1
2
3
4
5
6
7

>>> no_header_data.head()
2019-04-01 217.7 2.18 00:40:01 3.3 4
0 2019-04-02 240.1 2.39 00:44:09 3.2 4
1 2019-04-03 152.7 1.51 00:28:04 3.2 4
2 2019-04-04 207.6 2.04 00:38:10 3.2 4
3 2019-04-05 247.8 2.43 00:45:34 3.2 4
4 2019-04-07 294.5 2.89 00:54:09 3.2 4

The head() function is not restricted to just 5 rows. You can pass a number value as the argument when calling it to see only that many rows:

1
2
3
4

>>> apr_data.head(2)
day_walked cal_burned miles_walked duration mph shoe_id
0 2019-04-01 217.7 2.18 00:40:01 3.3 4
1 2019-04-02 240.1 2.39 00:44:09 3.2 4

This function is handy when you want to get a sense of the dataset (DataFrame) you are working with.

Below you will find documentation links to many of the topics I have covered in the post for your convenience:

The power of pandas goes way beyond that of which I have covered here. If you have never used it, or just curious about it, do check it out. As I use and experiment with pandas, expect to see more blog posts in the future, especially those where I can integrate and leverage it in SQL development learning and work flows.

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.

Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Digital Owl's Prose

SQL and PHP Developer Diaries.

pandas

Pandas dataframe ordering with examples using sort_values().

Like this:

Pandas merge() and read_sql() – joining DataFrames.

Like this:

Pandas concat() then to_sql() – CSV upload to PostgreSQL

Like this:

Basic CSV file import and exploration with Pandas – first steps.

Self-Promotion:

A Call To Action!

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Self-Promotion:

A Call To Action!

Share this:

Like this: