First and foremost: what does a data scientist do? The 1 million question. For some companies, this person is busy with business issues and building dashboards. Sometimes you will find it in product teams working on applications with computer vision.
In other data science teams, this guy/girl works out algorithms to predict consumer behavior, and elsewhere he should really know lookups in Excel. The role and function have become so general that you can actually reduce it to a data scientist who creates (business) value from data.
A misconception is that data scientists make theoretical inventions: how they can develop faster and better algorithms. Actually, that is not correct. Think of it as a physicist or chemist versus an engineer.
The latter used the insights of the former to meet a particular need or solve a problem. A data scientist works with insights from computer science, statistics, and possibly from the field in which she/he works: retail, biophysics, mobility, etc.
I have split a data scientist’s skill set into three broad categories, a generally accepted division. Below that, I have put in some specific skills, and I will talk about specific tools.
In terms of math & statistics, I was an absolute late bloomer myself. In high school, I threw my hat at it, and at university, I did my very best to avoid it all. But once I had a goal, I threw myself entirely into it.
How to learn?
When it comes to these pure areas of knowledge, there is only one place: Khan Academy. Better yet, it’s free! You will find everything about differentials & integrals and linear algebra. As a reference work, I think this book is really excellent. Everything is explained as briefly as possible, all calculation rules are listed and contain hundreds of examples.
I regularly come back to it when I do not immediately understand specific calculations performed by a computer. It’s dirt cheap too.
Descriptive statistics are usually the first two chapters of a statistics book. If you studied at the University of Ghent, the chances are that this book is gathering dust somewhere. You will also find all current statistical tests and learn enough to reason in an uncertain world (e.g., inferential statistics). Typically, any introductory statistics book should be enough to get started.
Nevertheless, I recommend following a MOOC that deals with this within the context of data science. The Data Science Specialization via Coursera (see below) gives some lessons to this. It shows you, for example, how to grab samples in R, generate distributions and how to perform statistical tests.
You can program in hundreds of languages, each with its advantages and disadvantages and with a specific focus. In the data economy, you will encounter different programming languages. The two most popular are, without a doubt, R and Python.
The latter is gaining market share strongly: Python is truly becoming a universal programming language that is being embraced within various ecosystems. Nevertheless, R is more stable, more robust, and really made for data analysis and a lot slower than Python. With many detours, you will be able to achieve about the same with both languages.
Personally, R is my favorite programming language. Because you really feel that it has been developed for research data. You also don’t have to mess around with virtual environments, and you don’t have to work in the terminal to update packages and have much fewer compatibility problems.
In comparison, there are many ways to manage your Python libraries alone (like Anaconda), making it difficult for the novice data scientist to see the forest for the trees. However, when asked to develop a web service or write an application, Python is definitely the first choice. Finally, for unstructured data, you are also better off (for speed alone) with Python.
How to learn?
Simple: you do programming. You don’t read books for that.
DataCamp or CodeCademy? There is no correct answer. What I do recommend is to choose one. Get the one you feel most comfortable with. Because let’s be honest: in the coming months, you will be programming during your lunch break, after working hours, and on the weekend.
You read that right. You are supposed to practice. Once you complete a class or specialization, you must maintain your skillset by working on personal projects. Find something that interests you and work with available data sets — or create your own.
SQL can be pronounced as the English sequel. It is a language to communicate with databases. You can write it from a user interface or process it in your R or Python code. There are different variants (Microsoft, Oracle, BigQuery, NoSQL, MongoDB, etc.), but you can switch fairly quickly if you know the basics.
How to learn?
In many cases, you will only read data in your career. That is handy because that means that it is already tough to break something. By focusing on that side, you will get far. Several websites offer interactive SQL lessons—similar to Codecademy and Datacamp. I do not recommend books here, and you learn SQL by writing queries.
Applied Machine Learning
Machine Learning is the study of algorithms (often, but not always, borrowed from statistics) that can automate a particular task. In the past, those lines were written manually. With ML, you show thousands of examples to an algorithm that learns to discover patterns in those data. This results in a model that can be used on other data points to make predictions. ML is currently the method that drives AI applications.
I like to draw a line between ML for structured and non-structured data such as text, photo, and video. In the first case, you are often busy creating features that your algorithm can learn from. In the second case, you will mainly apply algorithms that have stood the test of time. These are mainly different configurations of neural networks.
How to learn?
For structured data. The book that actually starts it all is An Introduction to Statistical Learning or the XL version Elements of Statistical Learning. With these books, you will gain insight into the algorithms’ theory and apply them in R.
Perhaps a bit outdated, an excellent MOOC is the Data Science Specialization from Johns Hopkins via Coursera. DataCamp offers excellent and interactive training such as Data Scientist with R or Data Scientist with Python.
For unstructured data, I can recommend this e-book from Michael Nielsen that goes through deep learning fundamentals. Do you prefer paper in your hands? Then I can recommend Deep Learning in Python, a book that also helped me a lot.
The chances are that you will come into contact with TensorFlow (and the high-level interface Keras). In this wildly popular MOOC with Andrew Ng, TensorFlow in Practice, you will already come into contact with many examples.
In this section, I mean everything that has to do with computer infrastructure, networks, and tooling in general. Because more and more things are happening in the cloud, it doesn’t hurt to start there.
If you will use a dataset with tens of thousands of photos, for example, to distinguish cats from dogs, then that is impossible to do on your own computer. You really need computing power. You can borrow these from the major cloud platforms such as Microsoft Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP).
They offer a complete set of tools for building and hosting websites, applications, and entire business infrastructures. What is characteristic of cloud platforms is that you pay for your use. For example, An hour of training an algorithm can cost several euros.
In the data economy, you will mainly use tools to move, clean, store, and make data available for specific purposes.
How to learn
I propose to choose one platform and get to know the different tools in it. Overall, AWS is the largest. But when it comes to data and AI, Google surprises friend and foe with GCP. There it goes really fast. Google is fully committed to AI and is the basis of many frameworks and standards that are also integrated into other cloud platforms.
Azure is also on the rise, thanks to the fact that most large companies work with Microsoft technology. It is useful for those companies to switch to cloud technology with small investments and that it is all compatible.
All platforms offer in one way or the other a trial period or free credits with which you can play around at will. It’s not a bad idea to have your registration for the MOOCs coincide with your free cloud subscription.
It is important to know that you can really make a difference with this skillset. There are already a lot of data scientists around who can train a model on their computer.
Not many graduates or even have several years of experience can distribute their work in a cloud environment. Finally, you also feel that those expectations are present in the labor market. It is no longer enough to be a script kiddie who can work with Python or R.
Anyone who wants to work in the data economy should be curious. And that for two different reasons.
- You often look for insights and need to understand the content to arrive at valuable insights and products. Independent of the field (e.g., retail, biology, politics, or space).
- Technological progress is currently occurring at breakneck speed. Yesterday you would use a Random Forest algorithm to tackle a problem. Today you have enough computing power to better result with neural networks. And you have to keep up with that always, often in your spare time.
The chance that you will be asked to come up with a new training algorithm is almost non-existent. But you will have to link existing solutions to specific problems. In other words, you need analytical insight. It would help if you didn’t make it a secret that you don’t know all the applications. But you must have a vague understanding of what exists to use the right solution when a problem crosses your path.
Depending on your background, it can be rugged or more comfortable to transition to a data scientist. If you list all of the above skills, you may end up with a solid curriculum. Remember: Through persistence, the snail reaches the ark. It was only after three years that I “dared” to call myself a data scientist.
You don’t have to be able to do everything. Try to specialize in specific skills and aspects of the data economy. In the next blog post, I will elaborate on the different possible profiles you can work towards data scientist, data engineer, ML engineer, BI developer, BI analyst, …
Please take advantage of any opportunity that presents itself to do something with data. Test a hypothesis, set up an experiment, perform a regression, train a small model, clean some data in R instead of Excel, and so on.