Discovering the ‘Science’ in Data Science — Part 2

Scientific Method

Typically, the following steps are undertaken when a scientist wants to explore the natural world.

  1. Define Objective Here is when you get acquainted with purpose of the scientific study.In most cases this objective will translate into a problem statement or question. Let us consider this:Objective: Determine the shape of Earth
  2. Acquire Information At this stage, you will come up with various ways to make observations. These observations must provide elaborate information about the problem/objective.You look at the horizon on a sea shore and observe its shape
  3. Formulate Hypothesis This is the most crucial step. Here you will be making an educated guess regarding the solution or the answer to the question.How can you do this? You employ logical reasoning to make a statement that can be tested and either proved or disproved by an experiment.
    Hypothesis: The Earth is Spherical in shape.
    However, remember a hypothesis is an educated guess based on observations and inferences from these observations. In our example, you made an observation that there is a curvature in the horizon. Hence this is a hypothesis.
  4. Conduct Experiment Here you may need engineering skills to design an experiment that can take measurements of the independent and dependent variables with accuracy. Einstein was famous for his ‘thought experiments’. If interested, you can read more about his famous five thought experiments here. Experimentation is the fuel that drives the machine.So, in our example let’s keep things simple by observing and recording lunar eclipse with the naked eye on a clear night. It will clearly show the spherical shadow of the Earth on the Moon. To be 100% sure let’s measure this phenomena in both hemispheres and in the four corners of the earth.

Data Science (Scientific) Method

Now that we know how to apply science to study the natural world let’s attempt to apply the above methods to a Data Science problem and see how far we can go.

  1. Define Objective
    The key objective here is to target the right leads so that it will lead to better conversion and hence boost the sales. Typically, the objective for finding a solution to a business problem will be outlined in a few sentences, and may sometimes be ambiguous.It is extremely critical to study and understand the objective and then ask the right questions to expand the problem definition into multiple smaller parts. Having the required domain knowledge is a big advantage since that will help in asking the relevant questions.
  2. Acquire Information
    This involves data capture, collection and preparation for downstream use. As mentioned in this article the company information was extracted from a website through an API. However, some preparatory work went into coming up with input URL. Most often than not data integration is hard due to lack of standards and it is extremely difficult to templatize. In our example, data size is not that high. However, for large scale data extraction and preparation proper data engineering practices must be followed. Especially when dealing with big data, an advanced technology stack is needed for building a data pipeline to process the data.In the article, the author also talks of data cleaning. This is a critical step and helps improve data quality. In this process, we get rid of text and other superfluous data. Tokenization, stemming and removal of stop words are steps in NLP to clean data and improve quality. Next, the data is transformed from text to vectors to create numeric representation of the data and make it more suitable for use in prediction algorithms. Vectorization of words / phrases / sentences is a technique to create numeric data and hence make it easy for use in mathematical operations. Data gathering and processing to make it suitable for machine learning algorithms also saves a lot of time. This is important to understand as people often think (erroneously) that Data Science is just about using algorithms for classification and prediction and not about data preparation.
  3. Formulate Hypothesis
    In our example, we want to pick the companies that are better leads than others and thus have a higher chance for converting into customers. Since we have the company descriptions how would we go about figuring out who would be the better leads?This makes us understand how to formulate this hypothesis:Given a company description we can predict the possibility of that company being a potential customer.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store