Dynamic MySQL CREATE TABLE statement with pandas and pyodbc

Have you ever had to type out a massive CREATE TABLE statement by hand? One with dozens of columns? Maybe several dozens of columns? There are likely some GUI tools to help with large CREATE TABLE commands. Or, other drag-n-drop types of software that I am not familiar with. What if you could write a few lines of Python code and take care of a huge CREATE TABLE statement with way less effort than typed manually? Interested? Continue reading and see how using pandas, pyodbc, and MySQL…

OS, Software, and DB used:

  • OpenSuse Leap 15.1
  • MySQL 8.0.20
  • Python 3.7.2
  • pandas 1.0.4


Self-Promotion:

If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!


You can download the Stack Overflow Developer Survey Results data set used in this post for your own exploration if you would like.

To start, I’ll import pandas into an ipython session and specify the path to the target CSV from the Stack Overflow survey results download:

1
2
In [1]: import pandas as pd
In [2]: csv_file = r'/home/joshua/Practice_Data/developer_survey_2019/survey_results_public.csv'

Using the pandas read_csv() function, I create a DataFrame named ‘data_set’. The DataFrame itself provides several attributes we can utilize for the CREATE TABLE statement:

1
In [3]: data_set = pd.read_csv(csv_file, delimiter=',')

Looking at the columns attribute, we can see there are quite a few columns in the imported CSV:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [4]: data_set.columns
Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource',
       'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor',
       'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode',
       'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney',
       'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz',
       'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc',
       'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan',
       'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev',
       'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat',
       'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith',
       'DatabaseDesireNextYear', 'PlatformWorkedWith',
       'PlatformDesireNextYear', 'WebFrameWorkedWith',
       'WebFrameDesireNextYear', 'MiscTechWorkedWith',
       'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers',
       'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn',
       'SocialMedia', 'Extraversion', 'ScreenName', 'SOVisit1st',
       'SOVisitFreq', 'SOVisitTo', 'SOFindAnswer', 'SOTimeSaved',
       'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams',
       'SOComm', 'WelcomeChange', 'SONewContent', 'Age', 'Gender', 'Trans',
       'Sexuality', 'Ethnicity', 'Dependents', 'SurveyLength', 'SurveyEase'],
      dtype='object')

1
2
3
In [9]: num_cols = len(data_set.axes[1])
In [10]: print(num_cols)
85

85 columns to be exact. Wow! That CREATE TABLE statement is going to be a whopper!

As I mentioned in the opening paragraph, typing out such a large CREATE TABLE statement manually by hand is cumbersome. And, unnecessary. With a little bit of craftiness, the pandas and pyodbc Python libraries are more than capable of handling the CREATE TABLE statement.

To provide the column count – based on the number of columns in the CSV file – I’ll create an arbitrary python list for its iterable capabilities (you’ll see why later in the post).

1
2
3
4
5
In [23]: col_count = [len(l.split(",")) for l in data_set.columns]
In [24]: print(type(col_count))                                                                                                                              
<class 'list'>
In [25]: print(len(col_count))
85

Be forewarned, I am no Python guru. There are likely better, Pythonic ways that accomplish the same. Savvy readers feel free to point them out to me in the comments below once you see what I use the list for.

There are a couple of hangups in the approach I’m using and want to point out the first one here. In order to have a dynamic – automated to a certain degree – CREATE TABLE statement, all columns must have the same datatype (initially at least). For this particular example, I chose the TEXT datatype but, a VARCHAR could also work. Depending on the data set, you could use all INTEGER‘s or DOUBLE‘s if you are working with strictly numeric data.

Just keep in mind that whichever datatype you decide on, you will either have to run ALTER TABLE statement(s) in MySQL and implement the correct datatypes for the needed columns once the upload completes. Or, move the rows of data from an initial staging table to another permanent table, typecasting them in that process. The ALTER TABLE could be expensive as there are over 80k rows of data. Your mileage may vary

As we will see, the power of string concatenation is crucial for the dynamic CREATE TABLE command we need to construct.

First, I establish the CREATE TABLE command and table name – ‘so_data’ – for this example, storing it in a ‘SQL_CREATE_TBL’ variable:

1
In [41]: SQL_CREATE_TBL = "CREATE TABLE so_data("

Next, using a for loop, append each column name – via the pandas columns attribute – along with the TEXT datatype to the ‘SQL_CREATE_TBL’ variable. The string appending occurs once for each of the number of columns present in the CSV file (85), by using the range() and len() methods. Simultaneously, the format() string method inserts the column name value – derived from the DataFrame columns attribute:

1
2
In [43]: for name in range(0, len(col_count)):
    ...:     SQL_CREATE_TBL += "{} TEXT, ".format(data_set.columns[name])

Let’s view the ‘SQL_CREATE_TBL’ string variable contents once the for loop completes:

1
2
In [47]: SQL_CREATE_TBL                                                                                                                                      
Out[47]: 'CREATE TABLE so_data(Respondent TEXT, MainBranch TEXT, Hobbyist TEXT, OpenSourcer TEXT, OpenSource TEXT, Employment TEXT, Country TEXT, Student TEXT, EdLevel TEXT, UndergradMajor TEXT, EduOther TEXT, OrgSize TEXT, DevType TEXT, YearsCode TEXT, Age1stCode TEXT, YearsCodePro TEXT, CareerSat TEXT, JobSat TEXT, MgrIdiot TEXT, MgrMoney TEXT, MgrWant TEXT, JobSeek TEXT, LastHireDate TEXT, LastInt TEXT, FizzBuzz TEXT, JobFactors TEXT, ResumeUpdate TEXT, CurrencySymbol TEXT, CurrencyDesc TEXT, CompTotal TEXT, CompFreq TEXT, ConvertedComp TEXT, WorkWeekHrs TEXT, WorkPlan TEXT, WorkChallenge TEXT, WorkRemote TEXT, WorkLoc TEXT, ImpSyn TEXT, CodeRev TEXT, CodeRevHrs TEXT, UnitTests TEXT, PurchaseHow TEXT, PurchaseWhat TEXT, LanguageWorkedWith TEXT, LanguageDesireNextYear TEXT, DatabaseWorkedWith TEXT, DatabaseDesireNextYear TEXT, PlatformWorkedWith TEXT, PlatformDesireNextYear TEXT, WebFrameWorkedWith TEXT, WebFrameDesireNextYear TEXT, MiscTechWorkedWith TEXT, MiscTechDesireNextYear TEXT, DevEnviron TEXT, OpSys TEXT, Containers TEXT, BlockchainOrg TEXT, BlockchainIs TEXT, BetterLife TEXT, ITperson TEXT, OffOn TEXT, SocialMedia TEXT, Extraversion TEXT, ScreenName TEXT, SOVisit1st TEXT, SOVisitFreq TEXT, SOVisitTo TEXT, SOFindAnswer TEXT, SOTimeSaved TEXT, SOHowMuchTime TEXT, SOAccount TEXT, SOPartFreq TEXT, SOJobs TEXT, EntTeams TEXT, SOComm TEXT, WelcomeChange TEXT, SONewContent TEXT, Age TEXT, Gender TEXT, Trans TEXT, Sexuality TEXT, Ethnicity TEXT, Dependents TEXT, SurveyLength TEXT, SurveyEase TEXT, '

All of the column names having the TEXT datatype are now a part of the string. However, there is some clean-up to do. The ‘SQL_CREATE_TBL’ string has a trailing comma after the TEXT keyword for the last column, ‘SurveyEase’. If left as is, that trailing comma will raise an error in MySQL if executed.

We can use the rstring() method and remove the trailing comma:

1
In [48]: SQL_CREATE_TBL = SQL_CREATE_TBL.rstrip(" ,")

All that’s left to complete the CREATE TABLE statement, is simply append the closing parenthesis and semicolon to the ‘SQL_CREATE_TBL’ variable:

1
In [49]: SQL_CREATE_TBL += ");"

The ‘SQL_CREATE_TBL’ variable is now complete, storing the full CREATE TABLE statement:

1
2
In [50]: SQL_CREATE_TBL                                                                                                                                      
Out[50]: 'CREATE TABLE so_data(Respondent TEXT, MainBranch TEXT, Hobbyist TEXT, OpenSourcer TEXT, OpenSource TEXT, Employment TEXT, Country TEXT, Student TEXT, EdLevel TEXT, UndergradMajor TEXT, EduOther TEXT, OrgSize TEXT, DevType TEXT, YearsCode TEXT, Age1stCode TEXT, YearsCodePro TEXT, CareerSat TEXT, JobSat TEXT, MgrIdiot TEXT, MgrMoney TEXT, MgrWant TEXT, JobSeek TEXT, LastHireDate TEXT, LastInt TEXT, FizzBuzz TEXT, JobFactors TEXT, ResumeUpdate TEXT, CurrencySymbol TEXT, CurrencyDesc TEXT, CompTotal TEXT, CompFreq TEXT, ConvertedComp TEXT, WorkWeekHrs TEXT, WorkPlan TEXT, WorkChallenge TEXT, WorkRemote TEXT, WorkLoc TEXT, ImpSyn TEXT, CodeRev TEXT, CodeRevHrs TEXT, UnitTests TEXT, PurchaseHow TEXT, PurchaseWhat TEXT, LanguageWorkedWith TEXT, LanguageDesireNextYear TEXT, DatabaseWorkedWith TEXT, DatabaseDesireNextYear TEXT, PlatformWorkedWith TEXT, PlatformDesireNextYear TEXT, WebFrameWorkedWith TEXT, WebFrameDesireNextYear TEXT, MiscTechWorkedWith TEXT, MiscTechDesireNextYear TEXT, DevEnviron TEXT, OpSys TEXT, Containers TEXT, BlockchainOrg TEXT, BlockchainIs TEXT, BetterLife TEXT, ITperson TEXT, OffOn TEXT, SocialMedia TEXT, Extraversion TEXT, ScreenName TEXT, SOVisit1st TEXT, SOVisitFreq TEXT, SOVisitTo TEXT, SOFindAnswer TEXT, SOTimeSaved TEXT, SOHowMuchTime TEXT, SOAccount TEXT, SOPartFreq TEXT, SOJobs TEXT, EntTeams TEXT, SOComm TEXT, WelcomeChange TEXT, SONewContent TEXT, Age TEXT, Gender TEXT, Trans TEXT, Sexuality TEXT, Ethnicity TEXT, Dependents TEXT, SurveyLength TEXT, SurveyEase TEXT);'

The entire code structure for the dynamic CREATE TABLE statement is shown below:

1
2
3
4
5
SQL_CREATE_TBL = "CREATE TABLE so_data("
for name in range(0, len(col_count)):
    SQL_CREATE_TBL += "{} TEXT, ".format(data_set.columns[name])
SQL_CREATE_TBL = SQL_CREATE_TBL.rstrip(" ,")
SQL_CREATE_TBL += ");"

Sure beats typing out that whopper of a CREATE TABLE statement by hand!

Now we have to actually execute the CREATE TABLE command in the MySQL instance using the ‘SQL_CREATE_TBL’ variable. There are numerous Python drivers/connectors you can access a MySQL database with. I am most familiar with the pyodbc driver so that is the one I will use for the upload.

In the same ipython session, I import pyodbc, establish a connection and cursor object and move to the ‘learning’ database with USE learning statement:

1
2
3
4
5
In [52]: import pyodbc
In [53]: conn = pyodbc.connect("""DRIVER={MySQL ODBC 8.0 Unicode Driver};SERVER=localhost;DATBASE=learning;USER=my_user;PASSWORD=my_password;""")
In [54]: cur = conn.cursor()
In [56]: cur.execute("USE learning")
Out[56]: <pyodbc.Cursor at 0x7f89531615d0>

First things first, let’s execute the CREATE TABLE command using the ‘SQL_CREATE_TBL’ variable:

1
2
3
4
5
6
7
8
In [57]: try:
    ...:     cur.execute(SQL_CREATE_TBL)
    ...: except pyodbc.DatabaseError as e:
    ...:     raise e
    ...:     cur.rollback()
    ...: else:
    ...:     cur.commit()
    ...:

Once complete, I’ll verify the ‘so_data’ table does exist in the ‘learning’ database:

1
2
3
4
5
6
7
 MySQL  localhost:33060+ ssl  learning  SQL > SHOW TABLES IN learning LIKE '%so_data%';
+--------------------------------+
| Tables_in_learning (%so_data%) |
+--------------------------------+
| so_data                        |
+--------------------------------+
1 row in set (0.0027 sec)

Querying the COLUMNS table in the information_schema database, we can see all 85 columns were created:

1
2
3
4
5
6
7
8
9
10
 MySQL  localhost:33060+ ssl  information_schema  SQL > SELECT COUNT(*) as column_count
                                                     -> FROM COLUMNS
                                                     -> WHERE table_schema = 'learning'
                                                     -> AND table_name = 'so_data';
+--------------+
| column_count |
+--------------+
|           85 |
+--------------+
1 row in set (0.0034 sec)

For all of the CSV rows’ column values, we need a parameterized query to use in the INSERT statement. Again, we can leverage the pandas columns attribute for the correct number of parameter placeholder ‘?’ values. Then, include those placeholders into an INSERT INTO query:

1
2
In [59]: placeholders = ",".join("?" * len(data_set.columns))
In [61]: SQL_INSERT = "INSERT INTO so_data VALUES({});".format(placeholders)

To continue, we need some CSV functionality so I’ll import the built-in csv module as well:

1
In [62]: import csv

Finally, we have come to the moment of truth; the upload…

WARNING!!!

This is by no means a fast operation and will take some time… There are over 88k rows in the CSV data file, requiring many INSERT‘s.

We know from trials and tribulations that everything is not in our hands or control. To be safe, we should work within some form of a transaction. To do that, I’ll set autocommit to false in the pyodbc connection (autocommit is set to True by default) and encapsulate the entire upload in a try/except/else/finally block. Should something go awry during the upload, any INSERT‘s that did complete will be rolled back, and no data is stored – or skewed.

Here goes…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [63]: try:
    ...:     conn.autocommit = False
    ...:     with open(csv_file, 'r') as f_csv:
    ...:         csv_reader = csv.reader(f_csv, delimiter=',')
    ...:         next(csv_reader) #advancing past the first row of header column names
    ...:         cur.executemany(SQL_INSERT, csv_reader)
    ...:         conn.commit()
    ...: except pyodbc.DatabaseError as e:
    ...:     raise e
    ...:     conn.rollback()
    ...: else:
    ...:     conn.commit()
    ...: finally:
    ...:     conn.autocommit = True

Once execution completes, over in the ‘so_data’ MySQL table, we can see the total count of rows is 88883:

1
2
3
4
5
6
7
 MySQL  localhost:33060+ ssl  learning  SQL > SELECT COUNT(*) FROM so_data;
+----------+
| COUNT(*) |
+----------+
|    88883 |
+----------+
1 row in set (0.2939 sec)

The upload is successful.

Things to keep in mind…
  • If the target CSV file does not have a header row of column names, the code I have presented in the post will not work.
  • The upload demonstrated here is not fast or optimized by any stretch of the imagination. That is many many inserts firing off. (My laptop is probably not the best test environment either.)
  • Remember for this script to work, all the columns datatypes must be set to a common type and you may have to ALTER some columns as your needs see fit once the upload is complete.
  • Weigh all of your options. The dynamic CREATE TABLE statement may be of little use if you have to spend 3X the amount of time executing ALTER TABLE statement(s) to implement the appropriate column datatypes.

Hopefully, you find a use for this type of dynamic CREATE TABLE command. I used it prototyping out a couple of MySQL tables for a particular data set I was interested in. Let me know what you think in the comments below and thanks for reading!

Pyodbc CRUD

If you are interested in using the pyodbc driver with MySQL, I wrote a 4 part series on CRUD operations you will find valuable.

  1. Pyodbc SQL CRUD – Create: Examples with MySQL
  2. Pyodbc SQL CRUD – Read: Examples with MySQL
  3. Pyodbc SQL CRUD – Update: Examples with MySQL
  4. Pyodbc SQL CRUD – Delete: Examples with MySQL

Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!

A Call To Action!

Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.

Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.

Have I mentioned how much I love a cup of coffee?!?!

To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)

Be sure and visit the “Best Of” page for a collection of my best blog posts.


Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.

Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.

2 thoughts on “Dynamic MySQL CREATE TABLE statement with pandas and pyodbc

Hey thanks for commenting! Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.