Have you ever had to type out a massive CREATE TABLE
statement by hand? One with dozens of columns? Maybe several dozens of columns? There are likely some GUI tools to help with large CREATE TABLE
commands. Or, other drag-n-drop types of software that I am not familiar with. What if you could write a few lines of Python code and take care of a huge CREATE TABLE
statement with way less effort than typed manually? Interested? Continue reading and see how using pandas, pyodbc, and MySQL…

OS, Software, and DB used:
- OpenSuse Leap 15.1
- MySQL 8.0.20
- Python 3.7.2
- pandas 1.0.4
Self-Promotion:
If you enjoy the content written here, by all means, share this blog and your favorite post(s) with others who may benefit from or like it as well. Since coffee is my favorite drink, you can even buy me one if you would like!
You can download the Stack Overflow Developer Survey Results data set used in this post for your own exploration if you would like.
To start, I’ll import pandas into an ipython
session and specify the path to the target CSV from the Stack Overflow survey results download:
1 2 | In [1]: import pandas as pd In [2]: csv_file = r'/home/joshua/Practice_Data/developer_survey_2019/survey_results_public.csv' |
Using the pandas read_csv()
function, I create a DataFrame named ‘data_set’. The DataFrame itself provides several attributes we can utilize for the CREATE TABLE
statement:
1 | In [3]: data_set = pd.read_csv(csv_file, delimiter=',') |
Looking at the columns
attribute, we can see there are quite a few columns in the imported CSV:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | In [4]: data_set.columns Index(['Respondent', 'MainBranch', 'Hobbyist', 'OpenSourcer', 'OpenSource', 'Employment', 'Country', 'Student', 'EdLevel', 'UndergradMajor', 'EduOther', 'OrgSize', 'DevType', 'YearsCode', 'Age1stCode', 'YearsCodePro', 'CareerSat', 'JobSat', 'MgrIdiot', 'MgrMoney', 'MgrWant', 'JobSeek', 'LastHireDate', 'LastInt', 'FizzBuzz', 'JobFactors', 'ResumeUpdate', 'CurrencySymbol', 'CurrencyDesc', 'CompTotal', 'CompFreq', 'ConvertedComp', 'WorkWeekHrs', 'WorkPlan', 'WorkChallenge', 'WorkRemote', 'WorkLoc', 'ImpSyn', 'CodeRev', 'CodeRevHrs', 'UnitTests', 'PurchaseHow', 'PurchaseWhat', 'LanguageWorkedWith', 'LanguageDesireNextYear', 'DatabaseWorkedWith', 'DatabaseDesireNextYear', 'PlatformWorkedWith', 'PlatformDesireNextYear', 'WebFrameWorkedWith', 'WebFrameDesireNextYear', 'MiscTechWorkedWith', 'MiscTechDesireNextYear', 'DevEnviron', 'OpSys', 'Containers', 'BlockchainOrg', 'BlockchainIs', 'BetterLife', 'ITperson', 'OffOn', 'SocialMedia', 'Extraversion', 'ScreenName', 'SOVisit1st', 'SOVisitFreq', 'SOVisitTo', 'SOFindAnswer', 'SOTimeSaved', 'SOHowMuchTime', 'SOAccount', 'SOPartFreq', 'SOJobs', 'EntTeams', 'SOComm', 'WelcomeChange', 'SONewContent', 'Age', 'Gender', 'Trans', 'Sexuality', 'Ethnicity', 'Dependents', 'SurveyLength', 'SurveyEase'], dtype='object') |
1 2 3 | In [9]: num_cols = len(data_set.axes[1]) In [10]: print(num_cols) 85 |
85 columns to be exact. Wow! That CREATE TABLE
statement is going to be a whopper!
As I mentioned in the opening paragraph, typing out such a large CREATE TABLE
statement manually by hand is cumbersome. And, unnecessary. With a little bit of craftiness, the pandas and pyodbc Python libraries are more than capable of handling the CREATE TABLE
statement.
To provide the column count – based on the number of columns in the CSV file – I’ll create an arbitrary python list
for its iterable
capabilities (you’ll see why later in the post).
1 2 3 4 5 | In [23]: col_count = [len(l.split(",")) for l in data_set.columns] In [24]: print(type(col_count)) <class 'list'> In [25]: print(len(col_count)) 85 |
Be forewarned, I am no Python guru. There are likely better, Pythonic ways that accomplish the same. Savvy readers feel free to point them out to me in the comments below once you see what I use the list
for.
There are a couple of hangups in the approach I’m using and want to point out the first one here. In order to have a dynamic – automated to a certain degree – CREATE TABLE
statement, all columns must have the same datatype (initially at least). For this particular example, I chose the TEXT
datatype but, a VARCHAR
could also work. Depending on the data set, you could use all INTEGER
‘s or DOUBLE
‘s if you are working with strictly numeric data.
Just keep in mind that whichever datatype you decide on, you will either have to run ALTER TABLE
statement(s) in MySQL and implement the correct datatypes for the needed columns once the upload completes. Or, move the rows of data from an initial staging table to another permanent table, typecasting them in that process. The ALTER TABLE
could be expensive as there are over 80k rows of data. Your mileage may vary
As we will see, the power of string concatenation is crucial for the dynamic CREATE TABLE
command we need to construct.
First, I establish the CREATE TABLE
command and table name – ‘so_data’ – for this example, storing it in a ‘SQL_CREATE_TBL’ variable:
1 | In [41]: SQL_CREATE_TBL = "CREATE TABLE so_data(" |
Next, using a for
loop, append each column name – via the pandas columns
attribute – along with the TEXT
datatype to the ‘SQL_CREATE_TBL’ variable. The string appending occurs once for each of the number of columns present in the CSV file (85), by using the range()
and len()
methods. Simultaneously, the format()
string method inserts the column
name value – derived from the DataFrame columns
attribute:
1 2 | In [43]: for name in range(0, len(col_count)): ...: SQL_CREATE_TBL += "{} TEXT, ".format(data_set.columns[name]) |
Let’s view the ‘SQL_CREATE_TBL’ string variable contents once the for
loop completes:
1 2 | In [47]: SQL_CREATE_TBL Out[47]: 'CREATE TABLE so_data(Respondent TEXT, MainBranch TEXT, Hobbyist TEXT, OpenSourcer TEXT, OpenSource TEXT, Employment TEXT, Country TEXT, Student TEXT, EdLevel TEXT, UndergradMajor TEXT, EduOther TEXT, OrgSize TEXT, DevType TEXT, YearsCode TEXT, Age1stCode TEXT, YearsCodePro TEXT, CareerSat TEXT, JobSat TEXT, MgrIdiot TEXT, MgrMoney TEXT, MgrWant TEXT, JobSeek TEXT, LastHireDate TEXT, LastInt TEXT, FizzBuzz TEXT, JobFactors TEXT, ResumeUpdate TEXT, CurrencySymbol TEXT, CurrencyDesc TEXT, CompTotal TEXT, CompFreq TEXT, ConvertedComp TEXT, WorkWeekHrs TEXT, WorkPlan TEXT, WorkChallenge TEXT, WorkRemote TEXT, WorkLoc TEXT, ImpSyn TEXT, CodeRev TEXT, CodeRevHrs TEXT, UnitTests TEXT, PurchaseHow TEXT, PurchaseWhat TEXT, LanguageWorkedWith TEXT, LanguageDesireNextYear TEXT, DatabaseWorkedWith TEXT, DatabaseDesireNextYear TEXT, PlatformWorkedWith TEXT, PlatformDesireNextYear TEXT, WebFrameWorkedWith TEXT, WebFrameDesireNextYear TEXT, MiscTechWorkedWith TEXT, MiscTechDesireNextYear TEXT, DevEnviron TEXT, OpSys TEXT, Containers TEXT, BlockchainOrg TEXT, BlockchainIs TEXT, BetterLife TEXT, ITperson TEXT, OffOn TEXT, SocialMedia TEXT, Extraversion TEXT, ScreenName TEXT, SOVisit1st TEXT, SOVisitFreq TEXT, SOVisitTo TEXT, SOFindAnswer TEXT, SOTimeSaved TEXT, SOHowMuchTime TEXT, SOAccount TEXT, SOPartFreq TEXT, SOJobs TEXT, EntTeams TEXT, SOComm TEXT, WelcomeChange TEXT, SONewContent TEXT, Age TEXT, Gender TEXT, Trans TEXT, Sexuality TEXT, Ethnicity TEXT, Dependents TEXT, SurveyLength TEXT, SurveyEase TEXT, ' |
All of the column names having the TEXT
datatype are now a part of the string. However, there is some clean-up to do. The ‘SQL_CREATE_TBL’ string has a trailing comma after the TEXT
keyword for the last column, ‘SurveyEase’. If left as is, that trailing comma will raise an error in MySQL if executed.
We can use the rstring()
method and remove the trailing comma:
1 | In [48]: SQL_CREATE_TBL = SQL_CREATE_TBL.rstrip(" ,") |
All that’s left to complete the CREATE TABLE
statement, is simply append the closing parenthesis and semicolon to the ‘SQL_CREATE_TBL’ variable:
1 | In [49]: SQL_CREATE_TBL += ");" |
The ‘SQL_CREATE_TBL’ variable is now complete, storing the full CREATE TABLE
statement:
1 2 | In [50]: SQL_CREATE_TBL Out[50]: 'CREATE TABLE so_data(Respondent TEXT, MainBranch TEXT, Hobbyist TEXT, OpenSourcer TEXT, OpenSource TEXT, Employment TEXT, Country TEXT, Student TEXT, EdLevel TEXT, UndergradMajor TEXT, EduOther TEXT, OrgSize TEXT, DevType TEXT, YearsCode TEXT, Age1stCode TEXT, YearsCodePro TEXT, CareerSat TEXT, JobSat TEXT, MgrIdiot TEXT, MgrMoney TEXT, MgrWant TEXT, JobSeek TEXT, LastHireDate TEXT, LastInt TEXT, FizzBuzz TEXT, JobFactors TEXT, ResumeUpdate TEXT, CurrencySymbol TEXT, CurrencyDesc TEXT, CompTotal TEXT, CompFreq TEXT, ConvertedComp TEXT, WorkWeekHrs TEXT, WorkPlan TEXT, WorkChallenge TEXT, WorkRemote TEXT, WorkLoc TEXT, ImpSyn TEXT, CodeRev TEXT, CodeRevHrs TEXT, UnitTests TEXT, PurchaseHow TEXT, PurchaseWhat TEXT, LanguageWorkedWith TEXT, LanguageDesireNextYear TEXT, DatabaseWorkedWith TEXT, DatabaseDesireNextYear TEXT, PlatformWorkedWith TEXT, PlatformDesireNextYear TEXT, WebFrameWorkedWith TEXT, WebFrameDesireNextYear TEXT, MiscTechWorkedWith TEXT, MiscTechDesireNextYear TEXT, DevEnviron TEXT, OpSys TEXT, Containers TEXT, BlockchainOrg TEXT, BlockchainIs TEXT, BetterLife TEXT, ITperson TEXT, OffOn TEXT, SocialMedia TEXT, Extraversion TEXT, ScreenName TEXT, SOVisit1st TEXT, SOVisitFreq TEXT, SOVisitTo TEXT, SOFindAnswer TEXT, SOTimeSaved TEXT, SOHowMuchTime TEXT, SOAccount TEXT, SOPartFreq TEXT, SOJobs TEXT, EntTeams TEXT, SOComm TEXT, WelcomeChange TEXT, SONewContent TEXT, Age TEXT, Gender TEXT, Trans TEXT, Sexuality TEXT, Ethnicity TEXT, Dependents TEXT, SurveyLength TEXT, SurveyEase TEXT);' |
The entire code structure for the dynamic CREATE TABLE
statement is shown below:
1 2 3 4 5 | SQL_CREATE_TBL = "CREATE TABLE so_data(" for name in range(0, len(col_count)): SQL_CREATE_TBL += "{} TEXT, ".format(data_set.columns[name]) SQL_CREATE_TBL = SQL_CREATE_TBL.rstrip(" ,") SQL_CREATE_TBL += ");" |
Sure beats typing out that whopper of a CREATE TABLE
statement by hand!
Now we have to actually execute the CREATE TABLE
command in the MySQL instance using the ‘SQL_CREATE_TBL’ variable. There are numerous Python drivers/connectors you can access a MySQL database with. I am most familiar with the pyodbc
driver so that is the one I will use for the upload.
In the same ipython
session, I import pyodbc
, establish a connection and cursor object and move to the ‘learning’ database with USE learning
statement:
1 2 3 4 5 | In [52]: import pyodbc In [53]: conn = pyodbc.connect("""DRIVER={MySQL ODBC 8.0 Unicode Driver};SERVER=localhost;DATBASE=learning;USER=my_user;PASSWORD=my_password;""") In [54]: cur = conn.cursor() In [56]: cur.execute("USE learning") Out[56]: <pyodbc.Cursor at 0x7f89531615d0> |
First things first, let’s execute the CREATE TABLE
command using the ‘SQL_CREATE_TBL’ variable:
1 2 3 4 5 6 7 8 | In [57]: try: ...: cur.execute(SQL_CREATE_TBL) ...: except pyodbc.DatabaseError as e: ...: raise e ...: cur.rollback() ...: else: ...: cur.commit() ...: |
Once complete, I’ll verify the ‘so_data’ table does exist in the ‘learning’ database:
1 2 3 4 5 6 7 | MySQL localhost:33060+ ssl learning SQL > SHOW TABLES IN learning LIKE '%so_data%'; +--------------------------------+ | Tables_in_learning (%so_data%) | +--------------------------------+ | so_data | +--------------------------------+ 1 row in set (0.0027 sec) |
Querying the COLUMNS
table in the information_schema
database, we can see all 85 columns were created:
1 2 3 4 5 6 7 8 9 10 | MySQL localhost:33060+ ssl information_schema SQL > SELECT COUNT(*) as column_count -> FROM COLUMNS -> WHERE table_schema = 'learning' -> AND table_name = 'so_data'; +--------------+ | column_count | +--------------+ | 85 | +--------------+ 1 row in set (0.0034 sec) |
For all of the CSV rows’ column values, we need a parameterized query to use in the INSERT
statement. Again, we can leverage the pandas columns
attribute for the correct number of parameter placeholder ‘?’ values. Then, include those placeholders into an INSERT INTO
query:
1 2 | In [59]: placeholders = ",".join("?" * len(data_set.columns)) In [61]: SQL_INSERT = "INSERT INTO so_data VALUES({});".format(placeholders) |
To continue, we need some CSV functionality so I’ll import the built-in csv
module as well:
1 | In [62]: import csv |
Finally, we have come to the moment of truth; the upload…
WARNING!!!
This is by no means a fast operation and will take some time…
There are over 88k rows in the CSV data file, requiring many INSERT
‘s.
We know from trials and tribulations that everything is not in our hands or control. To be safe, we should work within some form of a transaction. To do that, I’ll set autocommit
to false
in the pyodbc
connection (autocommit
is set to True
by default) and encapsulate the entire upload in a try/except/else/finally
block. Should something go awry during the upload, any INSERT
‘s that did complete will be rolled back, and no data is stored – or skewed.
Here goes…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | In [63]: try: ...: conn.autocommit = False ...: with open(csv_file, 'r') as f_csv: ...: csv_reader = csv.reader(f_csv, delimiter=',') ...: next(csv_reader) #advancing past the first row of header column names ...: cur.executemany(SQL_INSERT, csv_reader) ...: conn.commit() ...: except pyodbc.DatabaseError as e: ...: raise e ...: conn.rollback() ...: else: ...: conn.commit() ...: finally: ...: conn.autocommit = True |
Once execution completes, over in the ‘so_data’ MySQL table, we can see the total count of rows is 88883:
1 2 3 4 5 6 7 | MySQL localhost:33060+ ssl learning SQL > SELECT COUNT(*) FROM so_data; +----------+ | COUNT(*) | +----------+ | 88883 | +----------+ 1 row in set (0.2939 sec) |
The upload is successful.
Things to keep in mind…
- If the target CSV file does not have a header row of column names, the code I have presented in the post will not work.
- The upload demonstrated here is not fast or optimized by any stretch of the imagination. That is many many inserts firing off. (My laptop is probably not the best test environment either.)
- Remember for this script to work, all the columns datatypes must be set to a common type and you may have to
ALTER
some columns as your needs see fit once the upload is complete. - Weigh all of your options. The dynamic
CREATE TABLE
statement may be of little use if you have to spend 3X the amount of time executingALTER TABLE
statement(s) to implement the appropriate column datatypes.
Hopefully, you find a use for this type of dynamic CREATE TABLE
command. I used it prototyping out a couple of MySQL tables for a particular data set I was interested in. Let me know what you think in the comments below and thanks for reading!
Pyodbc CRUD
If you are interested in using the pyodbc driver with MySQL, I wrote a 4 part series on CRUD operations you will find valuable.
- Pyodbc SQL CRUD – Create: Examples with MySQL
- Pyodbc SQL CRUD – Read: Examples with MySQL
- Pyodbc SQL CRUD – Update: Examples with MySQL
- Pyodbc SQL CRUD – Delete: Examples with MySQL
Like what you have read? See anything incorrect? Please comment below and thanks for reading!!!
A Call To Action!
Thank you for taking the time to read this post. I truly hope you discovered something interesting and enlightening. Please share your findings here, with someone else you know who would get the same value out of it as well.
Visit the Portfolio-Projects page to see blog post/technical writing I have completed for clients.
Have I mentioned how much I love a cup of coffee?!?!
To receive email notifications (Never Spam) from this blog (“Digital Owl’s Prose”) for the latest blog posts as they are published, please subscribe (of your own volition) by clicking the ‘Click To Subscribe!’ button in the sidebar on the homepage! (Feel free at any time to review the Digital Owl’s Prose Privacy Policy Page for any questions you may have about: email updates, opt-in, opt-out, contact forms, etc…)
Be sure and visit the “Best Of” page for a collection of my best blog posts.
Josh Otwell has a passion to study and grow as a SQL Developer and blogger. Other favorite activities find him with his nose buried in a good book, article, or the Linux command line. Among those, he shares a love of tabletop RPG games, reading fantasy novels, and spending time with his wife and two daughters.
Disclaimer: The examples presented in this post are hypothetical ideas of how to achieve similar types of results. They are not the utmost best solution(s). The majority, if not all, of the examples provided, is performed on a personal development/learning workstation-environment and should not be considered production quality or ready. Your particular goals and needs may vary. Use those practices that best benefit your needs and goals. Opinions are my own.
Lovvely post
Thank you. I’m glad you liked the post.