Data prepping with Sed for bulk uploads using the PostgreSQL COPY command.

This post will highlight usage of the PostgreSQL COPY command. Along the way, I will use Sed, a Unix command-line utility, to ‘massage’ sample data into a more uniformed structure. Once the data has been formatted to our requirement, we will then load it into a PostgreSQL table.


Note: All data, names or naming found within the database presented in this post, are strictly used for practice, learning, instruction, and testing purposes. It by no means depicts actual data belonging to or being used by any party or organization.

I will be using Xubuntu Linux 16.04.2 LTS (Xenial Xerus) and PostgreSQL 9.6.4 for these exercises.


Objectives

  • Create a ‘staging’ table in the fab_tracking database for these bulk uploads and inserts.
  • Convert the data files to a consistent format.
  • Use the COPY command on the server-side to upload the sample data.

Staging…

First, I will create a table in the fab_tracking database to house this newly uploaded data.

fab_tracking=> CREATE TABLE staging_assets (
pipe_id VARCHAR(25) PRIMARY KEY,
kind VARCHAR(25),
pipe_jt_num VARCHAR(25),
pipe_heat VARCHAR(25),
pipe_length NUMERIC(4,2),
pipe_wall_thickness NUMERIC(4,3) DEFAULT 0.000,
degree NUMERIC(4,2) DEFAULT 0);
CREATE TABLE

Listing out the present database tables below confirms the staging_assets table exists:

fab_tracking=> \dt
List of relations
Schema | Name | Type | Owner
--------+----------------+-------+--------
public | bend_assets | table | j2112o
public | pup_assets | table | j2112o
public | staging_assets | table | j2112o
public | weld_log | table | j2112o
(4 rows)

The staging_assets table contains 7 columns of various data types.
This table will briefly house or stage the new uploads, to then be moved into another appropriate table, dependent upon what kind of asset the record is. The kind column will hold that determining (and be utilized in a forthcoming example blog post) value.


You Sed what?

Sed (stream editor) is a Unix utility that filters and transforms text. I will use it to ‘clean-up’ the data files being used for the inserts, however, its capabilities and power go far beyond this post.

Looking at the below output, a plain text file titled pipe_tally.txt contains the sample data to be uploaded. Perhaps this data was quickly collected in the field or punched in on a laptop. The end goal is to have this data in a comma-separated value format (aka CSV).

bigdaddy@LE2:~/Practice_Data$ cat pipe_tally.txt
Joint-2228 BEND 34493U 9100 11.8 0.893 22.50
Joint-8819 PIPE_JOINT 18393z 9159 13.0 0.893 0
Joint-9844 PIPE_JOINT 19293Q 8174 9.13 0.893 0
Joint-7834 BEND 17QTT 179 13.13 0.893 10.0
Joint-6844 PIPE_JOINT 17297Q 6114 11.34 0.893 0
Joint-4841 PIPE_JOINT 19395Q 5144 7.43 0.893 0
Joint-1827 BEND 12T29 6179 11.53 0.893 10.25
Joint-1941C PIPE_JOINT 22395A 995144 17.23 0.893 0
Joint-9941D PIPE_JOINT 92395U 48514 7.23 0.893 0
Joint-727 BEND 42G2 5180 14.13 0.893 17.75
Joint-41D PIPE_JOINT 2395P 68514 8.53 0.893 0
Joint-45666M PIPE_JOINT 1395P 68514 10.50 0.893 0
Joint-757V BEND 22R2 8189 24.25 0.893 11.00
Joint-45226I PIPE_JOINT 9395S 88314 7.21 0.893 0
Joint-45296 PIPE_JOINT 395O-1 96319 9.91 0.893 0

The pipe_tally2.txt (below) file is similar to the pipe_tally.txt file with the exception of that first line of text. Doesn’t it look familiar?

bigdaddy@LE2:~/Practice_Data$ cat pipe_tally2.txt
pipe_id kind pipe_jt_num pipe_heat pipe_length pipe_wall_thickness degree
Joint-28 BEND 39393A 9111 10.6 0.393 12.50
Joint-88 PIPE_JOINT 19393Y 9011 16.6 0.393 0
Joint-98 PIPE_JOINT 39393V 8171 6.6 0.393 0

Those are the names of the columns of the staging_assets table and will serve as the ‘headers’ for the final CSV file.
The spaces between the sections of text would be a perfect spot for a comma.
Let’s use Sed to make that happen with the below command:

bigdaddy@LE2:~/Practice_Data$ sed 's/ /,/g' pipe_tally2.txt
pipe_id,kind,pipe_jt_num,pipe_heat,pipe_length,pipe_wall_thickness,degree
Joint-28,BEND,39393A,9111,10.6,0.393,12.50
Joint-88,PIPE_JOINT,19393Y,9011,16.6,0.393,0
Joint-98,PIPE_JOINT,39393V,8171,6.6,0.393,0

Let me explain what is going on here with sed.

The above syntax is basically saying this:
's(ubstitute) what_is_here for_what_is_here g[lobal(accross the entire file)]-optional'
So the command is placing a comma in any area that contains a single space, on each line, across the entire file.

Technically this file has not been modified yet and remains in its original form:

bigdaddy@LE2:~/Practice_Data$ cat pipe_tally2.txt
pipe_id kind pipe_jt_num pipe_heat pipe_length pipe_wall_thickness degree
Joint-28 BEND 39393A 9111 10.6 0.393 12.50
Joint-88 PIPE_JOINT 19393Y 9011 16.6 0.393 0
Joint-98 PIPE_JOINT 39393V 8171 6.6 0.393 0

To permanently change this file, pass in the -i flag like so:
bigdaddy@LE2:~/Practice_Data$ sed -i 's/ /,/g' pipe_tally2.txt


WARNING: Be certain you want to permanently change the file. This is the point of no return (in a sense).


Now concatenating out the pipe_tally2.txt, shows the changes have been made.

bigdaddy@LE2:~/Practice_Data$ cat pipe_tally2.txt
pipe_id,kind,pipe_jt_num,pipe_heat,pipe_length,pipe_wall_thickness,degree
Joint-28,BEND,39393A,9111,10.6,0.393,12.50
Joint-88,PIPE_JOINT,19393Y,9011,16.6,0.393,0
Joint-98,PIPE_JOINT,39393,8171,6.6,0.393,0

We have used Sed to replace all spaces, in the file, with a comma.

I will also perform the same type of alteration to the pipe_tally.txt file, but with a little twist.
bigdaddy@LE2:~/Practice_Data$ sed 's/ /,/g' pipe_tally.txt >> pipe_tally2.txt
With the above command, pipe_tally.txt was not actually changed. Instead, I used the redirection operator (>) to write its changes to the pipe_tally2.txt file. By using the double >>, I actually appended the changes sed made to this file.

A single > would have overwritten pipe_tally2.txt which I did not want since I previously just made changes to it with Sed and the fact that it contains the lines I plan to use for the headers in the CSV file.
Now let’s look at pipe_tally2.txt again and see its contents:

bigdaddy@LE2:~/Practice_Data$ cat pipe_tally2.txt
pipe_id,kind,pipe_jt_num,pipe_heat,pipe_length,pipe_wall_thickness,degree
Joint-28,BEND,39393A,9111,10.6,0.393,12.50
Joint-88,PIPE_JOINT,19393Y,9011,16.6,0.393,0
Joint-98,PIPE_JOINT,39393V,8171,6.6,0.393,0
Joint-2228,BEND,34493U,9100,11.8,0.893,22.50
Joint-8819,PIPE_JOINT,18393z,9159,13.0,0.893,0
Joint-9844,PIPE_JOINT,19293Q,8174,9.13,0.893,0
Joint-7834,BEND,17QTT,179,13.13,0.893,10.0
Joint-6844,PIPE_JOINT,17297Q,6114,11.34,0.893,0
Joint-4841,PIPE_JOINT,19395Q,5144,7.43,0.893,0
Joint-1827,BEND,12T29,6179,11.53,0.893,10.25
Joint-1941C,PIPE_JOINT,22395A,995144,17.23,0.893,0
Joint-9941D,PIPE_JOINT,92395U,48514,7.23,0.893,0
Joint-727,BEND,42G2,5180,14.13,0.893,17.75
Joint-41D,PIPE_JOINT,2395P,68514,8.53,0.893,0
Joint-45666M,PIPE_JOINT,1395P,68514,10.50,0.893,0
Joint-757V,BEND,22R2,8189,24.25,0.893,11.00
Joint-45226I,PIPE_JOINT,9395S,88314,7.21,0.893,0
Joint-45296,PIPE_JOINT,395O-1,96319,9.91,0.893,0

We have a comma-separated values file to work with the PostgreSQL COPY command.
To fully complete the prepping stage, I will create and name a new CSV file titled tally_staging.csv containing the data from pipe_tally2.txt using the Linux terminal:
bigdaddy@LE2:~/Practice_Data$ cat pipe_tally2.txt > tally_staging.csv
Opening the tally_staging.csv file with LibreOffice (or your choice of other software), we can see a good old CSV file containing column headers as well:

master_tallycsv

Note: CSV files are not the only format that works with the COPY command. It is just the file extension I choose to use for this blog post.


COPY that…

With the staging table in place and usable data ready, I will upload and insert it with the COPY command as shown below:

fab_tracking=> COPY staging_assets(pipe_id, kind, pipe_jt_num, pipe_heat, pipe_length, pipe_wall_thickness, degree)
FROM '~/Practice_Data/tally_staging.csv' DELIMITER ',' CSV HEADER;
ERROR: must be superuser to COPY to or from a file
HINT: Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

Uh-oh. Turns out you must be the user postgres (or have superuser privileges) to copy to and from files. The j2112o user I am currently logged in with does not have sufficient privileges and could only use the psql ‘\copy‘ meta command, which is an all-together different command. Visit this blog post for an example of that command.

I will quickly login (not shown) to the fab_tracking database as the postgres user to perform the needed command.


Note: Pay attention to the database prompt below that now shows =# instead of =>.
This # is indicative of SUPERUSER status.


fab_tracking=# COPY staging_assets(pipe_id, kind, pipe_jt_num, pipe_heat, pipe_length, pipe_wall_thickness, degree)
FROM '~/Practice_Data/tally_staging.csv' DELIMITER ',' CSV HEADER;
COPY 18

The returned COPY 18 command tag indicates the statement is successful, however, I will also verify the number of records present in the staging_assets table with the below SELECT command:

fab_tracking=> SELECT COUNT(*) FROM staging_assets;
count
-------
18
(1 row)

The 18 records are present from the successful COPY command.

Digging into the above COPY command.
COPY table_name[optional_column_name(s)] FROM 'path_to_source_file' DELIMITER 'delimiter_here' CSV HEADER;

  • table_name – our table staging_assets.
  • column_names – Optional, but I decided to include them in this example. If no column list is specified, all columns of the table will be copied.
  • path_to_source_file – Path of the input (or output file if using COPY TO) file. The output filename path must be an absolute path (shown here as a relative path).
  • DELIMITER – This is the character that specifies what separates the columns in each row or line. A comma (,) the default for CSV files, is used for this example. For text files, the TAB character is the default.
  • CSV HEADER – Quoting the documentation: “Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.”

COPY Takeaways.

  • Moves data between standard file-system files and PostgreSQL tables.
  • Once again, the documentation explains user ID permissions exceptionally well: “Files named in a COPY command are read or written directly by the server, not by the client application. Therefore, they must reside on or be accessible to the database server machine, not the client. They must be accessible to and readable or writable by the PostgreSQL user (the user ID the server runs as), not the client. Similarly, the command specified with PROGRAM is executed directly by the server, not by the client application, must be executable by the PostgreSQL user. COPY naming a file or command is only allowed to database superusers since it allows reading or writing any file that the server has privileges to access.”
  • COPY TO (not demonstrated) copies contents of a table to a file.
  • COPY FROM copies the contents of a file to a table.
  • The DELIMITER specifies the character that separates columns in each row or line of the file. Must be a single one-byte character.
  • COPY cannot be used with views. Only plain tables.
  • The SQL Standard contains no COPY statement. It is a PostgreSQL extension.

My hope is that this blog post has sparked your interest in using the COPY command to automate some of your larger insert tasks. I encourage you to visit the excellent official PostgreSQL documentation at for an in-depth exploration of all available options and rules.


Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.


To receive notifications for the latest post from “Digital Owl’s Prose” via email, please subscribe by clicking the ‘Click To Subscribe!’ button in the sidebar!


Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

One thought on “Data prepping with Sed for bulk uploads using the PostgreSQL COPY command.

Leave a Reply