In my day job, I use Python to automate several redundant tasks. Between the Pandas library and the CSV module, there is always something available for me to reach for. I typically process several different CSV’s each day with a planned final destination in a SQL database. While contemplating the schema design, I determined it would be best to store the actual source file information from which the data is derived, using the source files’ name and appending it to the end of each row. How did Python help me accomplish this? Continue reading to find out…
OS and Python used:
- OpenSuse Leap 15.1
- Python 3.7.2
If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!
Interested in Learning Python? Check out this great resource: What is Python Used for? | Major Industries.
While I won’t necessarily share any of the GIS survey data I process (for proprietary reasons), the concept remains the same and should work across any CSV file.
To get started, I’ll import pandas along with the
os module, to take advantage of its operating system and file/folder functionalities:
>>> import pandas as pd
>>> import os
I have several CSV files in which I store my daily walking stats (distance, time spent walking, calories burned, etc). In this example, I’ll target October’s file, assigning it to a variable named ‘target_csv’:
>>> target_csv = r'/home/joshua/env_37/oct_2019_hiking_stats.csv'
Since the CSV file doesn’t have a header row, I’ll create one – as a python list – and pass it to the
names argument when calling the
>>> col_names = ['day_walked', 'calories_burned', 'distance_walked', 'duration', 'mph', 'shoe_id']
>>> data = pd.read_csv(target_csv, names=col_names)
Calling the pandas
head() function, we can see the first 5 rows of the DataFrame:
day_walked calories_burned distance_walked duration mph shoe_id
0 2019-10-01 370.3 3.63 01:08:05 3.2 5
1 2019-10-03 375.5 3.67 01:09:03 3.2 5
2 2019-10-04 294.2 2.88 00:54:05 3.2 5
3 2019-10-06 366.7 3.64 01:07:26 3.2 5
4 2019-10-07 372.0 3.66 01:08:24 3.2 5
In order to get the source file name, we can drill down into the actual files’ storage location with the
os module. Among the many available functions, there is a
os.path.splitext() (split extension) function that can be used to extract the file (type) extension and name:
>>> f_name, f_ext = os.path.splitext(target_csv)
After assigning both the file extension and file name values to variables, we can further use them in the processing. Here are their current values:
We can see that the file name value happens to include the absolute file path. While that is useful information, it is not exactly what I was going for. I prefer to use just the actual file name, oct_2019_hiking_stats.
However, Python comes through in the clutch once again. The
os module provides a
basename() method that is exactly what I need to leverage. Passing in the ‘f_name’ variable as its argument returns the actual file name value without the attached absolute path:
>>> source_file = os.path.basename(f_name)
Here I’ll assign ‘source_file’ as another column to the ‘data’ DataFrame and include the source file name value in each row:
>>> data['source_file'] = source_file
day_walked calories_burned distance_walked duration mph shoe_id source_file
0 2019-10-01 370.3 3.63 01:08:05 3.2 5 oct_2019_hiking_stats
1 2019-10-03 375.5 3.67 01:09:03 3.2 5 oct_2019_hiking_stats
2 2019-10-04 294.2 2.88 00:54:05 3.2 5 oct_2019_hiking_stats
3 2019-10-06 366.7 3.64 01:07:26 3.2 5 oct_2019_hiking_stats
4 2019-10-07 372.0 3.66 01:08:24 3.2 5 oct_2019_hiking_stats
I’ll then overwrite the original ‘oct_2019_hiking_stats.csv’ file using the current ‘data’ DataFrame values:
>>> data.to_csv(target_csv, index=False)
And here are its contents now:
Other posts you may be interested in…
Feel free to visit several these other blog posts I have written on pandas if you are interested:
- Basic CSV file import and exploration with Pandas – first steps.
- Pandas dataframe ordering with examples using sort_values().
- How to drop columns from a Pandas DataFrame – with examples.
- Renaming pandas DataFrame columns – with examples.
- Easy datetime manipulation with the pandas to_datetime() function.
Try combining pandas and the
os module for your next task in Python and let me know all about it in the comments below.
Thanks for reading!!!
Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!
A Call To Action!
Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.
Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.
Be sure and visit the “Best Of” page for a collection of my best blog posts.
Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.
Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.