Pandas and the python os module use case – appending source file information from CSV’s.

In my day job, I use Python to automate several redundant tasks. Between the Pandas library and the CSV module, there is always something available for me to reach for. I typically process several different CSV’s each day with a planned final destination in a SQL database. While contemplating the schema design, I determined it would be best to store the actual source file information from which the data is derived, using the source files’ name and appending it to the end of each row. How did Python help me accomplish this? Continue reading to find out…

various-color-folders-stacked-atop-each-other
Photo by Omid Kashmari on Unsplash

OS and Python used:

  • OpenSuse Leap 15.1
  • Python 3.7.2


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


While I won’t necessarily share any of the GIS survey data I process (for proprietary reasons), the concept remains the same and should work across any CSV file.

To get started, I’ll import pandas along with the os module, to take advantage of its operating system and file/folder functionalities:

1
2
>>> import pandas as pd
>>> import os

I have several CSV files in which I store my daily walking stats (distance, time spent walking, calories burned, etc). In this example, I’ll target October’s file, assigning it to a variable named ‘target_csv’:

1
>>> target_csv = r'/home/joshua/env_37/oct_2019_hiking_stats.csv'

Since the CSV file doesn’t have a header row, I’ll create one – as a python list – and pass it to the names argument when calling the read_csv() function:

1
2
>>> col_names = ['day_walked', 'calories_burned', 'distance_walked', 'duration', 'mph', 'shoe_id']
>>> data = pd.read_csv(target_csv, names=col_names)

Calling the pandas head() function, we can see the first 5 rows of the DataFrame:

1
2
3
4
5
6
7
>>> data.head()
day_walked  calories_burned  distance_walked  duration  mph  shoe_id
0  2019-10-01            370.3             3.63  01:08:05  3.2        5
1  2019-10-03            375.5             3.67  01:09:03  3.2        5
2  2019-10-04            294.2             2.88  00:54:05  3.2        5
3  2019-10-06            366.7             3.64  01:07:26  3.2        5
4  2019-10-07            372.0             3.66  01:08:24  3.2        5

In order to get the source file name, we can drill down into the actual files’ storage location with the os module. Among the many available functions, there is a os.path.splitext() (split extension) function that can be used to extract the file (type) extension and name:

1
>>> f_name, f_ext = os.path.splitext(target_csv)

After assigning both the file extension and file name values to variables, we can further use them in the processing. Here are their current values:

1
2
3
4
>>> print(f_name)
/home/joshua/env_37/oct_2019_hiking_stats
>>> print(f_ext)
.csv

We can see that the file name value happens to include the absolute file path. While that is useful information, it is not exactly what I was going for. I prefer to use just the actual file name, oct_2019_hiking_stats.

However, Python comes through in the clutch once again. The os module provides a basename() method that is exactly what I need to leverage. Passing in the ‘f_name’ variable as its argument returns the actual file name value without the attached absolute path:

1
2
3
>>> source_file = os.path.basename(f_name)
>>> print(source_file)
oct_2019_hiking_stats

Here I’ll assign ‘source_file’ as another column to the ‘data’ DataFrame and include the source file name value in each row:

1
2
3
4
5
6
7
8
>>> data['source_file'] = source_file
>>> data.head()
   day_walked  calories_burned  distance_walked  duration  mph  shoe_id            source_file
0  2019-10-01            370.3             3.63  01:08:05  3.2        5  oct_2019_hiking_stats
1  2019-10-03            375.5             3.67  01:09:03  3.2        5  oct_2019_hiking_stats
2  2019-10-04            294.2             2.88  00:54:05  3.2        5  oct_2019_hiking_stats
3  2019-10-06            366.7             3.64  01:07:26  3.2        5  oct_2019_hiking_stats
4  2019-10-07            372.0             3.66  01:08:24  3.2        5  oct_2019_hiking_stats

I’ll then overwrite the original ‘oct_2019_hiking_stats.csv’ file using the current ‘data’ DataFrame values:

1
>>> data.to_csv(target_csv, index=False)

And here are its contents now:

image-of-csv-file
New csv file with header row and source_file column row.

Other posts you may be interested in…

Feel free to visit several these other blog posts I have written on pandas if you are interested:

Try combining pandas and the os module for your next task in Python and let me know all about it in the comments below.

Thanks for reading!!!

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

Advertisements

Hey thanks for commenting! Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.