59,000,000,000,000,000,000,000 bytes. That’s 59 zettabytes or 59 sextillion pieces of discrete information. Don’t feel bad; I had to look up those words too. That is the total amount of data estimated to have been generated through 11:59 pm on December 31, 2020, according to the International Data Corporation, since the start of the digital age.
|Data on Planet Earth||1021||Zetta-|
|Rough Estimate of the Number of Sand Grains on Earth||1018||Exa-|
|Current Estimate of the Number of Stars in the Universe||1021||Zetta-|
|Estimate of the Drops of Water on Earth||1024||Yotta-|
Assume for a moment that you were able to magically download all that data onto a single computer. It would take 3,277,777,778 of the largest currently commercially available hard drives (18 TB), each costing approximately $1,000 at this posting, for a grand total of well over 3 trillion dollars. Again, let’s say that, through science or magic, each HDD was the thickness of one of a US dollar bill; they would form a stack 223 miles thick. One more figure because I am having WAY too much fun [OpenClinica asked a data nerd to write about data, so they should have known what they wrought]. In roughly a century (give or take a decade), humanity is expected to have generated more data than atoms on Earth.
In a recently leaked document, NASA has expressed interest in converting Mars into the largest data center in the solar system.
With all this data, it’s hardly a surprise that data science is one of the “it” careers and that more and more career paths need to be data fluent.
For several years now, I’ve held a top-down view of data science that might make me a lot less popular: learn data science, and I mean really learn it, and clinical data science will be a snap [mostly applying principles you already know; after all, to a computer, an int data type is pretty much an int whether it’s a count of the number of TVs sold at a store or a heart rate], the inverse is rarely true. I once had a colleague who is a statistician (as in real, Ph.D. holding, with more Greek symbols on his whiteboard than English type of statistician) who convinced me of a similar rationale with regards to statistics vs. biostatistics.
What makes a good data scientist?
If you are reading this blog, especially at 10:30 at night or while you munch away on lunch, you probably already have some of the qualities of a data scientist! Do you have an innate, some might argue pathological, desire to understand how things work? Are you infamous for your attention to detail? Obviously, success in data science is a bit more complicated and nuanced. Still, at the heart of it, all data science is driven by a desire to use information to improve decision-making. No knee-jerk decisions or gut feelings are allowed here. Those devils and angels on your shoulders can stay home too. So if you fit into these criteria, read on because you may have just found your calling. If you don’t, read on anyway.
How do I break into the data science field?
This is a path that I just took, perhaps the second or third step on, so please, please don’t treat this as an exhaustive stay inside the lines, type article. Your path is yours alone. I can only offer some guidance and helpful hints that I’ve found along the way.
The first question you really need to grapple with is how much you want to get into data science. That isn’t meant to be derisive or anything of the sort. You’ll learn, if you haven’t already, that virtually everything in life has a cost. Is that super-specialized Ph.D. program worth the 5-7 years of work and time away from the workforce, not to mention the late nights staring at the computer screen? Maybe it is, and if so, go for it, but you still need to ask yourself if it’s worth it to you.
In the broadest sense, entering or advancing in the data science field takes the form of formal vs. informal training. Traditionally, formal training takes the form of an advanced degree of some type, while the informal is far more self-driven and created by you. Although neither is inherently better than the other, they both have some positives and some drawbacks, as you’ll soon see.
Back to school: is formal higher education suitable for you?
Theoretically, you can still enter the data science field with a bachelor’s degree and strong math, science, and programming background; however, those days seem to be limited. As more and more institutions have started offering advanced degrees in data science, the expectation that a serious candidate should have post-bachelors training has increased. Therefore, if you decide to invest the time, money, and effort into a graduate degree, you should know a few things first.
First, before you set foot in a data science class at a major university, you are usually expected to have completed:
- three semesters of calculus,
- one semester of linear algebra,
- freshman computer programming,
- and possibly differential equations and/or upper level statistics.
Not only that but you’re expected to remember them. I know, I was shocked too. So if you don’t have those classes on your transcript or you don’t recall how to calculate the multiplicative inverse of a matrix, you should probably brush up on those topics. More on that later.
This is my second attempt at a graduate program, so some advice for those considering a research-based degree. First, you should plan on spending roughly 4 hours every week studying and preparing for every 1 hour of coursework on your schedule. Sometimes this can be more, rarely less, depending on the specific course and your background. Then if you have teaching duties, plan on another 15 or so hours per week teaching and grading. Oh, then you have research and writing you’re expected to do, so budget another 15 hours for that. Oh, I almost forgot lab group meetings and any administrative responsibilities you might have. To train graduate students how to run a lab, faculty often delegate to their students. Finally, as you continue on in any program, you will usually have mentoring and leadership roles to take up your few remaining hours of freedom. Can you somehow sleep a negative number of hours??
A word to all the “Gentlemen C’s” out there, know that you are generally expected to maintain a 3.0 GPA in any graduate program.
Now that I’ve gotten most of the bad stuff out of the way, now the good. If you like those kinds of environments and have a deep desire to learn, you may well have the time of your life. You’ll often be working on the cutting edge of science and technology and working with people who literally wrote the book in their subject. There is also an increasing number of online degree programs and ones designed specifically for working professionals.
Notice a couple paragraphs earlier I mentioned research-based degrees. Usually, you can tell it’s researched-based by the degree’s letters: M.S. and Ph.D. On the other hand, professional degrees are more likely to be MPH, MDS, or any number of other acronyms. Professional degrees will usually be more practically based and, perhaps more importantly to you, may require less in the way of prerequisites. There is always a tradeoff though, professional degrees typically have requirements for work experience, usually 3-5 years. For instance, the degree I’m currently pursuing (a professional degree) doesn’t require any specific courses on your transcript BUT you are still held accountable for those skills and knowledge as if you had just taken those prerequisites I listed earlier.
A few words about online graduate degrees
Yes, there are a few high-quality online degrees out there, but it is often buyer-beware. Some things you want to look for are:
- Are they attached to a traditional, physical university? And if so, do they offer an approximate equivalent “on-campus” program? This is always a plus because such universities must maintain a certain standard or risk losing their accreditation. There is also a greater likelihood that you will be dealing with quality faculty.
- Will your degree or transcript say “Online” on them? Honestly, this is a very subjective issue. For me, concern 1 is far more critical than this. Still, if you’re going to put in 2-5 years of work on a master’s degree, you don’t want to risk coming out the other side with it being perceived as worth less than a “traditional” program.
Eh…got anything else???
One step below in intensity from an official degree are boot camps, webinars, certificates, and professional development programs. These can be very practical options, allowing you to demonstrate your talent with a type of real-life legitimacy that formal degrees don’t inherently impart.
Lastly, just taking time to read and learn about a given subject can be hugely beneficial. Perhaps you don’t need to know every nook and cranny of data science all at once. Maybe you can start by asking a single question or picking a specific topic. For example, “How do I use SQL to program a database?” or “What is this ‘GitHub’ thing I keep seeing in my Google results?” Then you can expand your knowledge from there.
If you consider that the true pioneers of any field had to discover it and make it up as they went, there’s no reason why we couldn’t learn the same way.
Some (but far from all) US Specific Programs
Traditional Degrees (MS, PhD, or both)
- University of Arkansas Medical School [on campus certificate, masters & phd in biomedical informatics; including a specialization specifically in clinical research informatics]
- University of Colorado Boulder [on campus MS-DS]
- Indiana University Bloomington [several on-campus masters & ph.d programs in bioinformatics and data science]
Online Degree Programs
- University of Colorado Boulder [online MS-DS]
- Indiana University Bloomington [online MS in Data Science]
- University of New Hampshire [online MS in Health Data Science]
Other Educational Oportunities
- SpringBoard [online data science bootcamp]
- University of Colorado Clinical Data Science Specialization [Certificate on Coursera]
Some Resources that I like
- www.leetcode.com [focuses almost exclusively on coding skills]
- GitHub Student Developer Pack
- The Data Book: Collection and Management of Research Data by Meredith Zozus [available from Amazon]
- Cracking the Coding Interview: 189 Programming Questions and Solutions [suitable for any technical interview type situations; available from Amazon]
- Statistics for People Who (Think They) Hate Statistics [I like the premise but have yet to read it cover-to-cover yet, and I think the author is mistaken about SPSS being the dominant technology that he implies it to be, but it could be a result of the prevalence of SAS and R in our industry; available from Amazon]
Addendum: My Data Science Pathway (a personal story)
So, now that we’ve covered all that, I’d like to give you a peek into my journey so far to becoming a data scientist. Keep in mind, we could be polar opposites for all I know, so you may favor a different path, and I certainly have no patent on this method, so feel free to rip it off 100%.
First, it’s been a while since I had any type of formal schooling. I left my original graduate program in 2012 and haven’t had much exposure since then. So I knew I needed to brush up on my background knowledge. Running the occasional ANOVA test since then hardly demonstrates everything you need to know. To refresh my math & statistics chops, I chose Brilliant.org. Though there are plenty of similar websites, I picked the one that matches my learning style and budget. I generally try to do a lesson (they call them quizzes) per day.
I knew I also needed to improve my programming skills within data science. I feel pretty comfortable with C++, C#, and Java, for instance. Python, not so much. It’s just a question of exposure to the language for me, so I picked DataCamp.com. This site is a dependable resource because it deals specifically with data science applications. It costs admittedly a little more than I wanted to spend, but I found a nice 3-month free subscription. Check out GitHub Student Developer Pack, where all I needed was a valid school email address.
One last preparatory tip I’ll offer is to make a cheat sheet. No, not to actually cheat. Virtually every language and package (you can commonly have 10 or more active at any time in real-life applications) has slight variations between them. So I’ve found it really is impossible to keep them all straight at once. Case and point, perhaps you’re working with 2 different SQL databases at once, one programmed with PostgresSQL the other in T-SQL. There is just enough difference between those two to cause you headaches, so maintaining a 3-4 page quick reference can be a life-saver. I’ve found the ones from www.quickstudy.com to be a solid starting point, but mine usually end up covered in post-it notes all the same.
I made good use of Kindle and YouTube with virtually every other topic to make sure I was solid. This leads me to another pearl of wisdom about graduate school: you can’t be too prepared. Transitioning from an undergraduate (or what you remember) to a graduate program is like suddenly being drafted into a professional sport. The level to which you are expected to perform is high, and the learning curve is steep.
For my actual master’s program, I chose one that launched only recently. The University of Colorado Boulder has an online version of their on-campus master’s degree that is a solid fit for what I want. I also have found I do better with shorter courses, and this program has academic terms of 2 months. To a good approximation, they take each 3-hour class in the on-campus degree and chop it into 3 pieces. It is also 100% asynchronous; hence as long as the coursework is completed by the deadline, I can watch the lectures and do the work whenever I can fit them in.