Madison Hunter, Author at Towards Data Science

How to Read and Analyze GDAT Files Using Python

Madison Hunter — Thu, 18 Apr 2024 23:06:30 +0000

Data comes in all shapes and sizes.

While many of us spend most of our data education and careers working with data in relatively "friendly" formats, such as spreadsheets and CSV files, there may come a time when you’re confronted with data that isn’t so friendly. You might not even be able to visualize it straight out of the box.

This happened to me recently, when a computer model I was running was outputting data in a gridded binary format. The tricky thing about binary files is figuring out how to read them to access and analyze their contained data. After scouring the edges of the internet for a solution, I cobbled together a simple Python function that allows you to read gridded binary data so that it can later be analyzed using your favorite Python libraries, such as matplotlib, or NumPy.

This niche solution will allow you to read gridded binary data files with GDAT file endings produced from Computer Models, particularly those modeling natural processes, such as environmental or meteorological phenomena. As such, the code below makes these assumptions:

Your GDAT file follows GrADS conventions (though this will likely work for various other binary files).
Your GDAT file represents a gridded study area over a specified study period.
In the GDAT file, there is a grid of data values for each day in the specified study period.
Each cell in the grid of data values contains a tuple of values.
The grid of data values has a set number of rows and columns that can be used to index cells.

Visual representation of the gridded binary data, with each grid of values representing a study area organized into days (for each day of the study period). The cells in each grid can be indexed using row and column indices. Figure by the author using Canva.

Reading the binary GDAT file

Python">import struct

def read_gdat_file(file_path, format_string, number_rows, 
number_columns, number_days):
  data = []
  with open(file_path, 'rb') as f:
    for _ in range(number_days):
      day_data = []
        for _ in range(number_rows):
          row_data = []
          for _ in range(number_columns):
            value = struct.unpack(format_string, f.read(4))[0]
            row_data.append(value)
          day_data.append(row_data)
        data.append(day_data)
  return data

The above code reads a binary GDAT file and structures its data to resemble the grid of your study area for easier interpretation and analysis.

import struct: struct is a Python module that allows you to work with binary data. This module contains functions that allow you to convert binary data into Python objects and vice versa.
def read_gdat_file(file_path, format_string, number_rows, number_columns, number_days): This line begins the function that will allow us to read a binary file. For it to work, we will need to pass in arguments that detail the path to the correct GDAT file, the format type of the GDAT file, the number of rows and columns representing the study area, and finally the number of days the GDAT data covers. Knowing the number of days represented in the GDAT file allows the function to correctly partition the binary data into the rows and columns necessary to represent the study area for each day. This facilitates accurate analysis of the data later on. You should be able to find the number of days, as well as the number of rows and columns needed to represent the study area within whatever computer model parameters you’re using to generate the GDAT data.
data = []: This line initializes an empty Python list that will be used to contain the GDAT data in its final grid format.
with open(file_path, ‘rb’) as f: This line opens the binary file in read mode (designated by the ‘rb’ argument), allowing the function to access its data. Opening the binary file using the ‘with’ statement ensures that the file is closed after you have accessed the data.
for _ in range(number_days): This for loop iterates through the binary data and reads the data for each specified day. I’ve opted to use an underscore in this for loop (as well as the following for loops) because the variable doesn’t need to have a name as I will not be using it later. You can use typical iteration counter variables, such as i, or j if it better suits your programming style.
day_data = []: This line initializes an empty Python list that will be used to contain the binary data for each day. It will contain all of the rows of binary data relating to that specific day.
for _ in range(number_rows): This for loop iterates through the specified number of rows within the specified day.
row_data = []: This line initializes an empty Python list that will be used to contain the binary data for the current row within the specified day.
for _ in range(number_columns): This for loop iterates through the data found in the specified number of columns within the specified row.
value = struct.unpack(format_string, f.read(4))[0]: This line initializes a variable called value and, using the unpack function from the struct module, reads however many bytes of binary data from the GDAT file at a time and interprets it according to the format_string specified (read section "Format Characters" to understand which format string you need to specify). The unpack function returns a tuple. In the code line, [0] is placed at the end to indicate that the function should only return the first (and in some instances, the only) value of the tuple. If each cell in your modeled study area contains a tuple with multiple values, it is unnecessary to include [0] at the end of the line unless you’re only interested in one of the cell values. For example, a scenario where you may have cells containing tuples with multiple values arises when the value measured in the cell has x and y components (i.e., wind).
row_data.append(value): This line appends the unpacked float value to row_data, which represents the current row.
day_data.append(row_data): This line appends the current row to day_data, which represents the current day.
data.append(day_data): This line appends the data for the current day to data, which represents the overall dataset.
return data: This function will continue iterating through the binary data file until it has read the grid data for each day into the overall dataset, designated as data. This line returns the overall dataset, converted from the binary file into a Python list. data returns gridded data separated into each day of the study period. This dataset can now be analyzed.

Returning the data for a particular cell in the study area grid for the entire study period

While your computer model likely produces data for a large study area, you may only be interested in analyzing the data for a particular cell within the grid across the entire study period.

Say, for example, you want to see how closely the computer model produced wind speed values compared to observed wind speed values. There exists a meteorological station with wind speed observations in a particular cell. We will extract the data for the cell containing the meteorological station for the entire study period, after which you will be able to plot the observed versus the modeled data to determine how accurate the model is.

The Python function below uses the Python list data returned from the previous function.

def read_cell_data_for_study_period(data, row_index, column_index):
  cell_data = []
  for day_data in data:
    reversed_day_data = day_data[::-1] #Optional
    cell_value = reversed_day_data[row_index][column_index]
    cell_data.append(cell_value)
  return cell_data

The above code extracts the specified cell data for the entire study period.

def read_cell_data_for_study_period(data, row_index, column_index): This line begins the function that will extract the cell data for a specified cell using a row index and a column index to specify the cell’s location. The data argument takes the variable containing the list holding the GDAT data in its final grid format (this was created using the previous function). The row_index and column_index arguments take the integers specifying the row and column where the cell of interest is located.
cell_data = []: This line initializes an empty Python list that will contain the cell data for the entire study period.
for day_data in data: This for loop iterates through the gridded data for each day of the study period.
reversed_day_data = day_data[::-1]: This optional line of code is used if, upon printing out the cell data for the specified study period, you find that the gridded data is not being read from the correct starting point. In most scenarios, gridded data will be read from the upper left corner and will therefore be "0 indexed". However, in some scenarios, the gridded data is read from the lower left corner. This phenomenon causes the grid indexing to be wrong, resulting in the wrong cell being read using your specified row_index and column_index. Therefore, this optional line of code flips the gridded data vertically so it is read beginning from the upper left corner. Note: This line should only be used if it is determined that the grid of data is being read from the lower left corner. Omit this line if your data grid is being read correctly to avoid erroneous data readings.
cell_value = reversed_day_data[row_index][column_index]: This line initializes a variable called cell_value which will contain the cell data at the specified row and column index for each day of the study period. As you can see, your specified row_index and column_index arguments are used to access the correct cell in the gridded data.
cell_data.append(cell_value): This line appends the cell data for the current day to cell_data, which represents the overall list containing all of the cell values for the entire study period.
return cell_data: This function will continue iterating through each day of data and appending the value at a specific cell to the list designated as cell_data. This line returns the list, after which you will be able to print out and analyze the cell values for each day of the study period.

Example of how you can analyze cell data

import struct
import matplotlib.pyplot as plt

#Function that reads the binary file (see above)
def read_gdat_file(file_path, format_string, number_rows, 
number_columns, number_days):
  data = []
  with open(file_path, 'rb') as f:
    for _ in range(number_days):
      day_data = []
        for _ in range(number_rows):
          row_data = []
          for _ in range(number_columns):
            value = struct.unpack('f', f.read(4))[0]
            row_data.append(value)
          day_data.append(row_data)
        data.append(day_data)
  return data

#Function that returns the data for a specific cell for the entire study
# period (see above)
def read_cell_data_for_study_period(data, row_index, column_index):
  cell_data = []
  for day_data in data:
    reversed_day_data = day_data[::-1] #Optional
    cell_value = reversed_day_data[row_index][column_index]
    cell_data.append(cell_value)
  return cell_data

#Specifying the file path to the binary file, wherever it's located
# on your system; also, specifying the format_string for the file.
file_path_binary_data = "file-path-binary-data.gdat"
format_string = 'f'

#Specifying the number of rows, columns, and days represented in the 
# binary file
number_rows_in_gridded_study_area = 45
number_columns_in_gridded_study_area = 108
number_days_in_study_period = 365

#Reading the binary file
data = read_gdat_file(
  file_path=file_path_binary_data, 
  format_string=format_string, 
  row_index=number_rows_in_gridded_study_area,
  column_index=number_columns_in_gridded_study_area,
  day_index=number_days_in_study_period)

#Specifying the day, row, and column index used to read the values from
# a specific cell. These index values must abide by the specified number
# of rows and columns in the study area (above).
day_index = 0
row_index = 30
column_index = 90

#Reading the cell data for each day in the study period
data_for_specific_cell_for_study_period = read_cell_data_for_study_period(
  data=data,
  row_index=row_index,
  column_index=column_index)

#Plotting the cell data for each day in the study period
plt.figure(figsize=(10,6))
plt.plot(1, len(data_for_specific_cell_for_study_period) +1), 
  data_for_specific_cell_for_study_period, 
  label='Simulated Data',
  color='blue')
plt.xlabel('Day')
plt.ylabel('Unit of simulated data')
plt.title('Simulated data at specified cell for study period')
plt.legend()
plt.show()

Troubleshooting

Read your computer model documentation to understand how its output is formatted. This will help you determine which values you want to extract from the tuple of data representing each cell, as well as what format the cell values are in (i.e., floating point, etc.).
If possible, create TIF files from your GDAT files and open them in a GIS program. This will allow you to visualize your gridded data, as well as to check that your gridded data is being read from the upper left corner by the function used to read cell data for each day of your study period.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Subscribe to my newsletter to get more exclusive data-driven content with an environmentalist spin: DataDrivenEnvironmentalist

The post How to Read and Analyze GDAT Files Using Python appeared first on Towards Data Science.

9 Simple Tips to Take You From “Busy” Data Scientist to Productive Data Scientist in 2024

Madison Hunter — Sun, 04 Feb 2024 14:31:29 +0000

Photo by Ümit Yıldırım on Unsplash

Are you actually busy, or are you just being unproductive?

Whenever it feels like one more task will sink my ship irreparably, I ask myself this question. It brings me down to earth and makes me consider whether I’ve made the best use of my time in the past week or if I’m just "busy".

We wear the badge of "business" with honor, like something of a status symbol in our deranged work-obsessed society. If you’re not busy, you’re not working hard enough to make it as a data scientist – there will always be someone who will work harder and take your place in a heartbeat. But as John Spencer so succinctly puts it: "You don’t get a trophy for packing your schedule with more projects and more accomplishments and more meetings". Of course, we subconsciously know this, but somehow we keep working like we will get some award or some bonus or some raise – even though we all know that our hard work will go unnoticed 9 times out of 10 because we’re just in the data department.

Therefore, I can’t think of a better goal for 2024, than to become a productive data scientist. Here are my 9 best tips for going from a "busy" data scientist to a productive data scientist.

1. Produce a task-prioritization matrix

An example of a task matrix to help you prioritize your tasks. I couldn’t improve on this example by Asana. While we may not all be in a position where we can delegate tasks (I jokingly delegate my urgent but not important tasks to my dog), it’s still a relevant category in which you can place your tasks (perhaps just change the headlines depending on your needs).

You’ve heard it before and you’ll hear it here again: using a task-prioritization matrix will increase your productivity as a data scientist.

Whenever I’m feeling overwhelmed with the number of tasks piling up, I write them all out on pieces of paper and then begin to categorize them by their urgency and importance, using the same formula as the Eisenhower Matrix you see above. This immediately provides a clear picture of what needs to be done, what can wait, and what is irrelevant. These tasks might be unique to a specific data analysis you’re conducting for a client (so you would build a matrix on a project-by-project basis), or they might just be the tasks you have for an entire workweek.

Productivity stems from focusing on what is immediately urgent and getting rid of all the distractions and unnecessary tasks that crowd our desks. In Data Science where most of your work is wrapped up in producing deliverables for deadlines, this can be a handy way of ensuring that you’re prioritizing the right stuff.

However, it’s also important to not get too wrapped up in ensuring that everything falls into these categories as we think it should. Your boss will inevitably give you a task that you think is neither urgent nor important (like changing the chart bars from red to orange), but to them, it will be. This is not a war to fight. However, for the rest of your standard tasks, you can generally get away with organizing them this way with very effective results.

2. Set goals for each day, week, month, and project

Productivity stems from having a clear roadmap of what we want to accomplish and how we plan on getting there. Yearly resolutions are not going to get you through the days, weeks, months, and multitude of projects therein, which is why you need to set these types of goals accordingly.

You’ll rarely see someone accomplish their big yearly goal if they don’t have a framework of goals for each day, week, month, and project leading up to the end of the year. Similar to the phenomenon of people losing motivation to complete their New Year’s Resolutions by February, you won’t have a very productive year as a data scientist if you don’t have some daily, weekly, monthly, and project-oriented goals to push you forward.

In the data science work environment, most of these goals will be set for you by your project team, all of which will revolve around project deadlines. However, there will be some leeway in how you go about achieving the goals unique to your part of the project. To ensure you’re productive in achieving these over-arching goals, you’ll need to weave in your own set of goals to ensure that each day, week, and month leading up to a project deadline is full of productive work. These could be small, such as, "On Friday, I will have read the client-provided documentation and will have my notes and questions ready for the meeting next Monday", or large, such as, "By the end of the month, I will have refined my model to give statistically significant results to p<0.05".

Whatever your goals, you should have a monthly calendar filled with goals color-coded by whether they’re goals for the day, week, month, or are affiliated with a specific project deadline. This gives you a clear roadmap of what you need to be accomplishing each day to remain productive.

One last thing to note is that you should only pick 2–3 goals for each day. Be realistic with yourself about how long it will reasonably take to complete these goals. By keeping the number of daily goals achievable, you’re ensured to have achieved some level of success at the end of each day – and who doesn’t need that?

3. Track your time

How much time do you think you work every day? No seriously, like actually work?

The "screen time" feature implemented on many popular devices has been revolutionary for me in thinking differently about how much time I’m actually spending on work. You may think you’re busy at work, but the amount of time your phone says you’ve been on Instagram today begs to differ.

Tracking your time can be a game-changing tactic to help you determine how much time you’re spending on work each day, how much time you’re spending on each type of task, and whether or not that time is truly productive work. It can also give you a good idea to see if certain tasks are taking you longer than they should. For example, perhaps you’re spending too much time on making data visualizations. This could be fixed by creating a stylesheet and sticking to it to ensure that you’re not wasting time looking at fonts or making sure your colors are accessible to all viewers.

There are time-tracking apps available (one of my favorites being Forest) but I find that using the timer on your phone and the Notes app is the simplest way. But heck, you’re a data scientist, why don’t you just write the code to automate this task? All you need to do to get started is to start the clock when you begin working, stop the clock when you’re finished, and then make a note on your phone of how long you worked and what task you were working on.

It’s truly eye-opening to have your working hours right in front of your eyes. You may be contracted to work 9–5, but how many hours are you truly working? Further, how many of those hours would you deem to have been productive? The jury is still out on the golden number of hours we’re productive in a work day, with it currently residing somewhere between 2–6. From personal experience conducting data analyses and computer modeling, my limit on productive work seems to hinge somewhere between 5 and 6 hours – anywhere beyond that and my brain feels like it’s fried. However, it’s important to note that this varies depending on the work I’m doing. Technical work with Deep Dives into code or data (especially when it’s not cooperating as it should) drains me much quicker than if I’m writing a report or creating pretty visualizations. This will be unique for everyone, so it’s not a bad idea to also log how you feel at the end of each workday to see if you can begin to sense some patterns in the tasks you work on, your productivity, and your energy levels.

4. Time block

I don’t know about you, but my code comments make no sense if I’m also having a meeting, answering emails, and petting my dog while I’m trying to write them.

While everyone has a different working style that works for them, I think we can all agree in 2024 that multi-tasking is probably the most ineffective Productivity hack ever known. Instead, time blocking has begun to take its place as a true method for increasing productivity.

Time blocking is when you allocate certain chunks of time within your day to a specific task. For example, the first two hours of your workday could be allocated to administrative tasks, such as answering emails and attending your daily standup meeting. The next hour of your workday could be focused on dealing with your bug backlog. After lunch, you dedicate three hours to cleaning, preparing, and analyzing data. The last hour of your day is spent in a couple of meetings.

Why does time blocking make you more productive? It allows you to settle into a task or specific type of work by not allowing for distractions that might break your stream of focus. For example, you’re not going to get anywhere in debugging your code if you’re constantly having team members poke their heads into your office to ask you questions. Our best work gets completed when we can singularly focus on what is required, and our "busy" work happens when we have six different distractions demanding our attention away from what needs to be done.

Time blocking can be challenging to do in an office environment due to your accessibility to those around you. However, blocking off sections of time within your day in a shared calendar space and ensuring your office door stays closed during those deep work periods can be a great way to start this productivity habit.

I found time blocking around the time when I began working with data and code again and found it to be the best tool in my productivity toolbox. Giving yourself three uninterrupted hours to work with code or data allows you to live within it, get a deeper understanding of how the data fits together, and get a clearer picture of how the code is supposed to work. In other words, working for three uninterrupted hours will produce better, more efficient results than working for six interrupted hours.

5. Set up a workflow of productivity tools, i.e., Git for version control, Trello or Notion for organizing projects, Scribe Chrome extension to create visual step-by-step guides, website blockers, etc.

Your potential for productivity as a data scientist is only as strong as the tools you surround yourself with.

To start, pick three areas of your work that need some organizing and streamlining. For example, this could be your code version control, your daily schedule and project planning, and your tendency to get distracted and start doomscrolling when your code stops working. From there, decide whether you want to handle these areas in a digital or analog manner – in other words, do you need a tool on your computer or do you need a physical solution? For code version control, you’ll need a digital version control repository, such as that available through GitHub. To handle your daily schedule and project planning, you could go analog or digital. Going analog could mean buying yourself a daily planner for your scheduling and a wall calendar for your project management. Digitally, some of the go-to tools include Notion, which could be used for both daily schedules and project planning, and Trello, a handy little tool for project planning. Finally, dealing with your doomscrolling can be as easy as installing website blockers on your work devices and setting your phone to "Do Not Disturb". There, you just set up a productivity workflow!

Productivity workflows are unique systems, which means that trial and error will be involved. For example, I found that I loved using GitHub Desktop to make my code commits, but couldn’t be bothered to use Git from the command line. In the same way, I use Notion to plan my projects but just use the calendar on my phone to schedule my day. Further, I need my phone to be in a completely different room if I need to get some deep work done. It’s a great idea to look up other people’s productivity workflows on sites like Medium and YouTube to get some inspiration and ideas for which tools might work best for you, but it’s never a good idea to base your workflow on someone else’s. It may take a few months to perfect your system, but you’ll know that it’s the right one for you when you don’t even have to think about it.

6. Improve your communication skills and re-consider what actually requires a meeting

While many job-hunting-related articles here on TowardsDataScience stress the importance of communication skills for data scientists, many of us are admittedly less accomplished at these skills than we like to admit. We may be masters at communicating data stories to a client, but we may fall short at communicating where it can really make a difference: with our teams.

Nothing stalls productivity more for a data scientist than team members poking their heads in or sending Teams messages asking them to re-explain a task that needs to be done. Even worse, is working within a team who doesn’t have a good concept of what topics or crises require a meeting. While there are exceptions to both of these scenarios, such as working with interns or actually having something blow up, more often than not, these small productivity inconveniences can be eliminated. It all starts with strong, clear communication.

At a personal level, begin practicing your communication skills when it comes to explaining tasks, concepts, and results. This can be as easy as talking to your desk duck or your mom, writing blog posts and articles online, or creating YouTube tutorials. Whatever your method, practice communicating until you can get your point across once without follow-up questions. If you do get a follow-up question, see if you can answer it without getting a second. Clear, concise communication only once per question = more flow state productivity for you.

At a team level, set some team standards (and more importantly, enforce and stick to them) for what constitutes the need for a meeting. Virtual or in-person, meetings are more often than not a drain on productivity and energy when they’re not spent tackling an honest problem. This simple task is a quick way to produce constructive teamwork and communication that can easily be modified to fit the workflows and projects that may come up.

In the same vein, do endeavor to also become a better listener this year. Asking someone to repeat themselves because of a lack of understanding of what they’re asking or lack of knowledge of how to do it is perfectly acceptable; asking someone to repeat themselves because you weren’t listening is not.

7. Do the hard tasks first thing in your workday and do the easy tasks later in your workday

The beginning of your workday, whether that’s 6 am or 6 pm, is the time when you will feel the most focused, persistent, and energized. That is the time when you need to tackle the hardest to-do on your list for the day.

Whether that’s getting through the task of cleaning data, developing presentations, or fixing data pipelines, you need to begin your workday with your hard tasks and save the easy stuff for later.

We’re all familiar with the mid-workday slump, where your eyes can barely stay open and it feels like you’ve been staring at the same bug for 30 minutes without having really tried to fix it. This is not the time when you should be attempting to fix your computer model because, more often than not, you’ll probably make things worse, or more likely yet, won’t get anything moving in the remote direction of "fixed".

The best way to make sure your hard tasks get done first thing is to start your day off by making a master to-do list of everything you need to get done that day. Referencing your task-prioritization matrix (see tip #1 above) comes in handy here. Pick 1–2 tasks that you know would just drain the life force out of you and work on those first thing – the rest of your tasks can be worked on later. Save mindless tasks or tasks you find easy, for the mid-to-late point in your workday where you begin feeling more sluggish and uninspired. You’ll find after a couple of days of doing this that your days start more productive and remain more productive until the time when you log off.

8. Become a better team player to optimize collaboration and task delegation

I once worked on a team where one of the members was new to team leadership on this scale and had a lot at stake regarding how well the team performed. This resulted in this team leader not trusting their team members to do their work, so while task delegation occurred, it always resulted in the team leader being deeply involved, often right up until the time the tasks were due. Naturally, this devolved into everything being completed only at the last minute, with the team leader always asking for last-minute changes to be made down to the last second before task submission. This often looked like a lot of "busy" work followed by one or two 12-hour days trying to make changes before the submission deadline. Not a very healthy work environment where every single project comes down to the last second, right?

From this story, I hope you gathered that optimized teamwork is one of the best ways to ensure your productivity as a data scientist. Whether you’re a single data scientist at a small startup or part of a team of data professionals at a large corporation, setting up effective inter- and intra-team systems is paramount to your productivity.

It all boils down to team norms for collaboration and task delegation – both of which require that each party trust that the other can complete their work to a set standard by a set deadline. While many of us can feel as though we’re on an island while working, it’s important to remember that our results and deliverables are often depended upon by other members of an organization to complete their tasks. As such, teams should get together early on to establish standards for deciding who needs to be working together, which tasks need to be delegated where, and how long before project deadlines these tasks need to be completed to ensure sufficient time for edits.

Now these will seldom perfectly stick because, let’s face it, this is the real world – if we’re not constantly moving towards a state of disorder, then we’re disobeying the laws of physics in a concerning way. Data can be harder to wrangle than expected, your teammate may get egregiously ill while they’re supposed to be visualizing data only they understand, your predictive model may be returning results that are suspect at best, and your supervisor may give you last-minute corrections seconds before the deliverable is supposed to be sent out. However, even just having the above-suggested standards in the back of each team member’s mind can be enough to make everyone a little more cognisant of how to work together more productively. The way to make this stick a little better every time is to keep reading to see how monthly reflections can be used to evaluate the effectiveness of your productivity strategies.

9. Take some time at the end of every month to reflect on what’s working, what isn’t, and how things could be done more effectively

It’s not a bad thing at the end of a month to sit there, be honest with yourself, and go: "What the heck happened in there?". However, you shouldn’t make it a habit of ending every month like this. How you respond and move forward is what matters.

Stress is a quick eroder of productivity, which is why it’s a good idea to take a step back at the end of every month (a period in which many stressful things have likely occurred and inevitably placed you back in the "busy" camp) and see what worked, what didn’t, and how you could be more productive next month. Maybe this means you need to develop a better folder system for your data storage, or maybe you need to delegate more explanatory tasks to your analysts to allow the scientists to focus on more predictive tasks. Maybe you just need to put your foot down and put your phone in a time-locked security box to keep from "doomscrolling" instead of dealing with the bug-filled pile of spaghetti your code is currently resembling.

Whatever your setbacks or gremlins, take some time at the end of each month to honestly write down what increased your productivity, what took away from it, and what you’ll try to implement or keep doing in the next month to keep yourself from getting "busy".

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Subscribe to my newsletter to get more exclusive data-driven content with an environmentalist spin: DataDrivenEnvironmentalist

The post 9 Simple Tips to Take You From “Busy” Data Scientist to Productive Data Scientist in 2024 appeared first on Towards Data Science.

Do These 5 Simple Things to Make Your Data Scientist Resume Stand Out From the Crowd

Madison Hunter — Sun, 19 Nov 2023 17:02:56 +0000

The Data Science field is currently oversaturated, to say the least.

However, the field isn’t oversaturated with qualified candidates, if that’s what you thought I meant. You may see hundreds of candidates applying for a single job posting, but very few of those candidates are actually qualified individuals who could analyze data if their lives depended on it.

A few years ago I was involved in screening resumes for a position my company at the time was hiring for. I posted the job on LinkedIn, along with a few criteria questions. In the first couple of days, we received about 20–30 applications, but after reviewing the applications, as well as sifting through the ones that LinkedIn filtered out due to their answers to the criteria questions, only two were viable candidates. Extrapolate this experience to those data science postings where 300 candidates have applied, and you can imagine how many of those are actual contenders.

In tech, it’s very easy to call yourself a designer, a software engineer, or a data scientist. In many instances, the meaning of these job titles has changed over time. For example, "data scientist" used to refer to a very senior person at a company who had a Master- or PhD-level Education in the field. In contrast, now, anyone who completes a data science bootcamp may feel qualified to call themselves a data scientist because they understand statistics and can use some data analysis libraries.

All of this means that with all of the noise recruiters have to wade through to get to the legitimate candidates, you need to step up your resume game to stand out from the rest. None of the suggestions listed below are groundbreaking or revolutionary, and if anything, you’ve probably heard of them before. However, once you start paying attention to these details, recruiters will begin paying more attention to you. Here are five things you can do to your resume to make a recruiter look at your resume twice.

What are recruiters looking for in a data scientist’s resume?

Learning and growth beyond the classroom: Learning data science in university or through a coding bootcamp is great and all, but these are very sterile environments that only take your skills to a certain level. While many employers know what they’re getting when they hire a fresh graduate, it can be beneficial to impress them with your ability to have gone beyond what you learned in the classroom. Working in data science is unique in that you will be constantly forced throughout your entire career to learn new technologies, apply new skills, and generally roll with the punches (i.e., learning how to work with ChatGPT and other advanced AI models instead of fearing them). Therefore, you might as well get in the groove of lifelong learning early on and impress the recruiters by showing them how you’ve advanced your knowledge already beyond the core fundamentals.
Ability to both lead and collaborate as part of a team: There appears to currently be a glut of entry-level data scientists thanks to the popularity the field has received, especially in the last five or so years. That, coupled with a mass exodus of boomer data scientists who decided to retire in recent years, and you’re left with companies scrambling to fill positions at all levels, but especially those in more senior "team lead" or management positions. Not only that, but they’re also having to employ entry-level data scientists who may have never worked as part of a team due to completing their education in isolation during the height of the pandemic. Teamwork and leadership are two things that can’t really be taught (no matter how much they claim they can in those organizational behavior courses you’re forced to take), but you can improve your skills in them by practicing. It can be as simple as entering a hackathon and working as part of a team with people you’ve never met, or taking on a leadership role in a club or volunteer organization you’ve joined. Either way, you can demonstrate to employers that you can work as part of a team, and even lead a team if required.
Domain expertise: While anyone can analyze data, only a few can draw meaningful conclusions that can help companies make vital decisions moving forward. This is why it can be useful when transitioning into a data science career from another career (i.e., as an engineer, teacher, nurse, scientist, etc.) to remain in the field and use the domain knowledge you’ve acquired to help companies looking to solve problems in the area. However, domain knowledge can also be acquired, whether by auditing university courses for free, reading books, or attending networking events. Whatever way you choose to acquire domain expertise, make sure you can speak to it under a variety of different scenarios and that you’ve created a portfolio where you’ve put your knowledge to the test in creating personal projects that solve problems in that field (see below).
Quantifiable impact: As Ken Jee states in his article Data Science Resume Mistakes to Avoid, your worth as a data scientist is tied to your ability to impact a company positively. In other words, recruiters want to know what problems you’ve solved and what the outcomes of the project were. Did a company optimize its process and increase its earnings after you found a way for it to market to its customers more efficiently? Did you discover a relationship between a nearby coal mine and its effect on wildlife that could initiate discussions on pollution management in your area? Whatever you’ve accomplished, recruiters want to know the details of how you’ve produced impact through the projects you’ve completed.
Education and experience: Whatever your background or number of years as a data scientist, recruiters still want to know what your education and experience are in the field. A healthy mix of education and experience is sought after. But don’t worry if you’re a new graduate – you may have developed experience in unexpected places, like a capstone project, an internship, or through a club you were a part of. Additionally, you may have even made your own experience by developing a portfolio, starting a blog or newsletter, or doing pro bono work. Whatever your education and experience, both should have prominent places on your resume.
The right stuff: You could be a knockout R programmer, someone who can conduct time-series analysis on penguin population numbers with the best of them, but if you can’t write code in Python, understand business problems in the business sphere, or give recommendations to clients based on well-rounded domain knowledge, you probably won’t be the right fit for certain companies. Recruiters are looking for individuals who have the "right stuff" – people with abilities in the right technologies and the right domain knowledge for that company. This means that you need to tailor the technologies, the skills, and the domain knowledge you acquire to fit a certain type of company. For example, if you’re going to a science-based company, you’ll likely need to be proficient in R; however, if you’re going to pretty much any other type of company, Python will be the standard. Luckily, nearly every industry out there needs data scientists, which means you get to pick a focus that you’re genuinely interested in!
A professional portfolio: "Show, don’t tell". As I’ve mentioned before, you can tell an employer all day that you’ve got the skills for the job and that you’ve applied them to solve real-world problems. That’s great and all, but how are they supposed to believe you? I mean, they’ll find out either way if you make it to the technical interview, but you may never even get there if they interview another candidate who can show them definitively that they have the skills and the impact to back it up. A personal portfolio containing projects that solve problems in your target industry will help you convince recruiters that you have the skills they’re looking for.

5 things to make your data scientist resume stand out

1. Tailor, tailor tailor

For some reason, it still amazes me how few people understand the importance of tailoring their resume with how many resources are available online that stress the importance of using keywords and doing your research for each resume you submit.

From experience, nothing quite causes disinterest in a resume like one that isn’t tailored to the company it was sent to. Not only do generic resumes not tell recruiters if you could be a good fit for the team, but they also show a lack of effort and preparation. By all means, I understand the grind of applying to 200 jobs. However, have you ever considered that you may not need to apply to 200 jobs if you took the time to sit down and craft a resume that was tailored to each job you were applying for? It could very well be that you start getting calls for interviews after applying to only 30 jobs because you’ve taken the time to give recruiters exactly what they’re looking for. To tailor your resume effectively, you’ll want to:

Keep it as short as possible (one page if you’re a fresh graduate, slightly longer if you have many years of experience)
Only include what is absolutely relevant (yes, you may have worked 3 years as a barista, but you may have more relevant and impactful experience to include from when you were part of a club that taught coding to underprivileged kids – however, I should mention that there are some benefits to including non-relevant work experience early on, which could show that you are dependable, trainable, etc., but these merits could be debated all day)
Match the language of your resume to that of the job ad (this is mostly important for getting through the all-important screening phase of job application software, but also shows the human who may eventually read your resume that you’ve taken the time to read the job description carefully)
Include keywords or phrases that you see emphasized in the job description (see below)

It should also be noted that the importance of keywords doesn’t necessarily have to do with the recruiters reviewing resumes (though they sure do like to see important keywords that have to do with project impact or the technologies you know), but more so to do with the application software companies use to filter out resumes. With few, if any, companies even accepting resumes personally via email anymore, it’s becoming harder and harder to stand out when a piece of code is deciding whether or not to show your resume to the recruiter. As much as this seems like a stupid game to have to play, you’ll have to become an expert in including keywords in your resume to ensure that it even gets the remotest chance of being looked at by a human. The best way to do this is to include keywords throughout your entire resume, such as in your education section, descriptions of past experience, and in the skills section. It’s also not a bad idea to include the most important keywords at the beginning of the resume, which will stand out most to a recruiter.

2. Quantify achievements, experience, and impact

"I wrote a program that analyzed sales trends" is a lot less demonstrative than saying "I developed a program to analyze and predict sales trends using X technology that resulted in improved efficiency in future sales predictions by 20%". The first statement makes a recruiter say "Okay great, so what?", whereas the second statement explains what you did, how you did it, and most importantly, why it was important.

Quantifying your achievements, experience, and impact is an important small step towards solidifying your resume which offers a few benefits. First, it provides results to back up the claims you make. Second, it suggests to recruiters that results are what power and guide your priorities and future performance. Finally, quantification is a great way to stick in the minds of recruiters and stand out from the rest of the competition.

Quantifying your achievements, experience, and impact does take some work to get started, so it’s important to begin setting up your workflow as soon as you begin something new, whether it’s school, work, or a project. Here is the general workflow I’ve used for the past four years to gather the numbers I need to quantify my achievements, experience, and impact:

Track your work: I track absolutely everything that revolves around my work, including my hours, the number of projects I’ve completed, details on what those projects entailed, and the results of those projects. I’ve even logged my hours spent debugging! I find that the Notes app on my phone is sufficient since I typically bring everything over to a more permanent document at the end of each month, but I know plenty of people who have developed systems using Notion or spreadsheets.
Develop some ranges for when exact numbers are lacking: If you’ve been tracking your working data for long enough, you’ll notice times when you don’t have exact numbers. For these instances, develop some ranges for data to indicate the relative amount of work completed while acknowledging that it can vary sometimes. For example, I could say I completed on average 5 data science articles per month during the year 2021.
Focus on the key performance indicators that recruiters love to see most: Money, people, time, and rankings are the four most important metrics to recruiters. Examples for each include stating how much money your project made for a company, how many people you managed within a team, how long you worked on a specific project, or how much you improved a certain ranking for or within a company. It can be helpful to write these things down during a project to help keep them fresh in your memory.

Once you’ve collected and checked your data, you can begin turning them into 1–2 sentence summaries to include in your education, project, or experience descriptions on your resume.

3. Include projects that impart impact

How many people have you talked to looking to get into data science who have created a project that will "predict the stock market"? Just think of how many poor recruiters have had to sift through portfolios with those types of projects in there. Yeah, enough said.

If you’re serious about getting a job in data science and standing out from the 500 other people who want the same thing, you need to include projects on your resume that impart impact. This will be industry-dependent, so it’s not a bad idea to do some preliminary research to see what problems are facing organizations and companies in that industry. It’s important to note that these problems don’t have to be huge, nor do they have to have never been solved before. You could even find an alternative solution to a problem that has already been solved. For example, you could run an analysis that finds that the undefinable anomalies in pollution data are coming from a road construction project that’s kicking up a lot of dust. Alternatively, you could run an analysis that tells a small business in your community that they should be marketing to their customers on Fridays because that incites the most weekend shoppers.

The main point of including impactful projects is to show recruiters that you can find a problem and work your way through to a solution. Project management through an entire project lifecycle is not something that comes easily, so showing you can successfully carry it out will mean a lot to recruiters, especially those who are looking for candidates to hit the ground running in an established data science department, or who are looking for candidates to be the data science department themselves. Further, these types of industry-specific projects are a great way to demonstrate domain knowledge and provide some impact, regardless of whether they get acted upon.

4. Education first if you’re a recent graduate; experience first if you’re not

The order of resume sections can be debatable, but the standard is to put your education first if you’re a recent graduate and your experience first if you’re not. For those that fall in a bit of a grey area where they may have experience in an unrelated field and went back to school to get some form of education in data science, putting education first, followed by an experience section with only the most relevant jobs could be the way to go.

I’ve seen plenty of resume first drafts where people have put their sections in questionable orders that don’t necessarily impart confidence to the recruiter. However, by using the method above, you’re most likely to present that which you are most confident about first. Further, in an interview, recruiters will often begin asking about your resume in the order it’s presented, so it’s always nice to be able to speak most confidently about the first couple of sections of your resume.

5. Make it eye-catchingly skimmable

If you can’t get the gist of your resume in 15 seconds, it’s time to rewrite.

When I was reviewing resumes, my brain seemed to instantly shut down when faced with a wall of words. You may even find the same thing happens to you when you’re reading something online and it’s just a wall of text with no headlines to break up content and make it skimmable.

The most eye-catching resumes that I’ve seen from experience are the ones that are not overwhelming with words. There are always plenty of whitespaces, large headers to break up sections, and bullet points containing only one to two sentences that provide thorough, but short summaries. When the recruiter doesn’t feel overwhelmed or drowned in words, it makes it a lot more pleasant to consider the resume.

Given that recruiters are capable of looking over a resume and making a decision in 7.4 seconds, a good benchmark for you to use is about 15 seconds – if you can’t get the big picture of your resume in that time, you need to make it more skimmable and make the important information easier to glean. Like with this article, I make all of the important points stand out, so you can skim this article in less than a minute, get the important information, and then read the body of the paragraphs if something is particularly relevant to you. You need to do the same with your resume.

Making your resume skimmable can be as easy as using big, bold headings to title your sections, dividing lines between sections, and bullet points to break up the text into readable, easy-to-digest segments. If you’re finding it difficult to include all of the relevant details, you can always pick the most insightful ones, and then fill in the remaining details in your portfolio or the interview. As much as everything should be said in the resume, much more can be said in a portfolio or interview, so the main focus should just be on imparting the details that will make or break a recruiter phoning you for an interview.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Subscribe to my newsletter to get more exclusive data-driven content with an environmentalist spin: DataDrivenEnvironmentalist

Support my writing by donating to fund the creation of more stories like this one: Donate

The post Do These 5 Simple Things to Make Your Data Scientist Resume Stand Out From the Crowd appeared first on Towards Data Science.

Set These Boundaries for a Better-Quality Work-Life Balance as a Data Scientist In 2024

Madison Hunter — Sat, 04 Nov 2023 05:05:45 +0000

Work-life balance is something everyone yearns for but only some have the guts to achieve.

With 2.9 billion search results for "work-life balance" on Google, it’s pretty obvious that it’s something we’re all after. Not only has it become a focus of our search efforts, but in the last three years it now seems to make its way into our everyday conversations.

In 2020, data science became seen as a career that could impart some of this mystical work-life balance that everyone is talking about. However, many seem to be realizing that working in some aspect of data science can be just as life-consuming as any other job, if not more thanks to our more prolific "flexible" work arrangements where bosses seem to feel the need to tighten their grip on us even more to soothe their micromanaging insecurities.

Unfortunately, a work-life balance is not always given. Sometimes it has to be taken, using boundaries and non-negotiables. With 2024 just two months away, now is the perfect time to begin preparing for how you plan on getting your work-life balance back to make this coming year your most balanced year yet. Here are the five boundaries you need to set for a better-quality work-life balance in 2024.

1. Prepare a documentation system

Inevitably, project milestones will be overshot, budgets will be strained, and timelines will be confused. When this happens, you may end up being the punching bag that takes the full wrath of your team lead, the client, or worse, your boss.

However, chances are also high that you were the only one who dotted their i’s and crossed their t’s. Therefore, it’s important for the sake of your work-life balance to create a documentation system that supports your ability, when things go wrong, to prove that you were the one doing your job.

One of my favorite ways to document these types of interactions is using a spreadsheet. There, you can create a simple document where each row is a unique incident/email/conversation (delete as required), with columns for information such as event ID (because we’re data scientists aren’t we, and not having this would just be chaos), date, name of the person you interacted with, the issue, their response, your response, whether this was followed up on, what the resolution was, whether escalation measures were needed, etc. I keep this kind of document open throughout the day so I’m assured that everything I encounter is written down.

As much as this may seem tedious or overkill, the one thing I’ve learned from working in tech that will always keep you from having to do more work than necessary is to document, document, document.

2. Make project timelines twice as long as needed

Anyone who has worked in tech for more than two minutes knows that projects always end up taking way longer than expected. That’s why, in the year when you will be getting your work-life balance back, you need to give realistic estimates of how long projects will take – in other words, always say that projects will take at least twice as long because they always will.

Data may be unexpectedly unusable, your team gets sick, your code may kill the software developers trying to make it production-worthy, your client may completely change their requirements one week before you’re set to launch, etc.

It’s imperative for the sake of your contracted hours (which, by the way, you should never work over, see below) that you set appropriate timelines for projects that allow you to produce quality work, even with all of the derailments, without forsaking your work-life balance. Nothing sends project quality down the drain faster than stress, which is why the careful work of proper data cleaning, analysis, and visualization is best done with a cool head and a deadline that feels weeks away. Better yet, if you manage to complete the project ahead of time, you’ve underpromised and over-delivered, which should really be your mantra as a data scientist.

3. Poor planning on someone else’s part does not constitute an emergency on yours

Sometimes, instead of the situation presented above where you get to set the appropriate project deadline, someone else sets it for you. And it’s unrealistic. And it’s going to affect your work-life balance to meet it.

Solution?

Tell people you will not meet their unrealistic deadlines and they should consult with you first to prevent issues like this from occurring again.

Oof. Yeah, I can see how that sounds scary to a data scientist who’s just getting started. However, I also know from experience that if you don’t stand up for yourself right from the beginning, it’ll be a heck of a lot harder to do later on, to the point where it may end up being easier to just leave the job than to try to rein in your ever-expanding set of responsibilities, projects, and unattainable deadlines.

Data Analysis is best left unrushed, even the tasks that seem minor, like changing the colors on a matrix scatter plot. While you will get faster with certain tasks over time, and automation may take on some of the heavy lifting, there is no point in producing half-baked results to a client who is depending heavily on the results of your analysis. The conclusions and resulting strategies you present could be life-altering for the client (not for the person, but for the company), which means that you need to be darn certain about what your findings are telling you. As such, you are not to be rushed. Everyone will be better off for it in the end, you might just need to remind them of that.

4. Never work overtime to artificial deadlines

By beginning to stand your ground on poorly planned deadlines as described above, you’ll already be making good ground on achieving this next non-negotiable. The next step is for you to never work overtime to artificial deadlines.

Artificial deadlines are a quick and easy way for you to suddenly begin checking your emails at dinner time, pushing code on the weekend, and having a team chat dissolve into another full-blown "quick" pair-Programming session.

2024 is the time for you to decline working overtime to artificial deadlines and instead save this sometimes necessary evil for only the most extreme and exceptional cases. But to be completely honest, when does a data scientist ever really need to work overtime? Very few times, though these times seem to appear when the data scientists are not just data scientists, but they’re also systems analysts, the local IT help desk for their cubicle cluster, and potentially also the software developer that makes everything production-ready.

The point is that to maintain your work-life balance, you need to be honest with yourself about the criticality of your overtime. Because as long as you’re in a healthy work situation within your company, there never really needs to be a "data emergency" complete with artificial deadlines.

5. Stipulate that quality is your only modus operandi

Clients want all three corners of the proverbial project triangle: speed, cost, and quality. It’s a fact of life.

As likely one of the few data scientists in your company, it somehow automatically becomes your job to enlighten the client about how data analysis projects work, and particularly why speed is never the answer. Sure the software department can slap together a product management system in a few hours, but thorough data analysis where business trends are being forecasted a year into the future should be done with some tact.

In other words, when your results may dictate where your client positions their business going forward, quality will never be abandoned for speed. It always ends poorly in the end. The client will benefit from you taking a few extra hours to refine your prediction models, especially if complex variables are at play, and more insight will be gained from an analysis that isn’t rushed for the sake of just getting something a couple of days faster.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Subscribe to my newsletter to get more exclusive data-driven content with an environmentalist spin: DataDrivenEnvironmentalist

Support my writing by donating to fund the creation of more stories like this one: Donate

The post Set These Boundaries for a Better-Quality Work-Life Balance as a Data Scientist In 2024 appeared first on Towards Data Science.

How to Create a 1-Year Data Science Self-Study Plan Using the Seasonality of Your Brain

Madison Hunter — Thu, 06 Jul 2023 18:48:52 +0000

Teaching yourself Data Science can sure seem out of reach when all you’re inundated with on social media these days is stories of how people taught themselves data science in three months and were hired by a FAANG company quicker than you could say "database".

When you can’t even get your simple Python program to run without errors, these types of stories can be the most disheartening.

At times like these, it can seem like an impossible dream to teach yourself data science and get started with a new career. It can also seem like a pointless exercise when you’ve tried everything in the past to teach yourself the concepts and tools you’ll need to succeed, only for you to give up in a couple of weeks due to a lack of commitment, advancement, or enjoyment.

However, if there’s one thing I’ve learned in successfully self-studying for the past four years straight, is that success comes when you finally learn how to work with your brain – instead of against it. And that comes with learning how to work with the seasonality of your brain.

Here’s how to set up your 1-year data science self-study plan using the seasonality of your brain to maximize your learning potential and effectiveness.

Your 1-year seasonal study plan

Brain functionality is affected by season, as much as it is by time of day. Annual rhythms of brain activity were studied in 2016 and were found to fluctuate depending on the season. The study found that the brain performed at maximum capacity in sustained attention tasks during the summer, but at minimum capacity in the same tasks during the winter. Additionally, the brain performed at maximum capacity in working memory (working memory refers to the memory needed for "planning, comprehension, reasoning, and problem solving") tasks during the autumn, but at minimum capacity in the same tasks during the spring. While more studies are needed to solidify these findings, we can still use them to produce a one-year data science study plan that will use your brain to its maximum potential.

Winter: programming and data structures

According to the study discussed above, winter is a time when your brain isn’t exactly kicking when it comes to sustained attention tasks. However, that doesn’t mean that you can’t begin working your way through Programming tutorials and becoming familiar with databases and data structures.

From experience, I can say that you shouldn’t be spending more than three hard hours a day learning to code or work with databases. There’s just something about learning to code that lends itself best to giving your all in two to three hours of lectures and then spending the rest of your time working on practice problems – which is typically where you do most of your learning anyways.

freeCodeCamp.org

Now is the time to begin working your way through the lectures on freeCodeCamp to learn the basics of Python (and/or R), SQL, and maybe even some JavaScript.

Then, the remainder of your day should be spent adding to your own personal projects or completing Leetcode practice questions. The application process of coding is where you will learn the most. Writing code, running into errors, learning to navigate StackOverflow, and making corrections is what will solidify the concepts you learned earlier in the day.

Spring: data visualization

As the study mentioned above suggests, spring is a low point for your brain’s working memory – this means that it’s time for you to begin ripping through some data visualization concepts and trying to commit them as best you can to memory.

Data visualization can be considered the "coasting" part of learning data science and for good reason – you’re learning about accurate data representations, visualization types, and aesthetics. However, don’t be fooled into thinking that this stuff isn’t important. Quit the opposite. Data visualization is where you tell the data’s story, as well as give your predictions for the future.

You’ll want to work on establishing a workflow that makes sure you’re answering all the right questions before preparing your visualization: what is the goal of your visualization? who is your audience? how much information do you need to give in one visualization? how can you use colors and charts more effectively?

While you don’t know much in the way yet of data cleaning (that will come in the fall when you put everything together into your first full data analysis), you can begin visualizing some pre-prepared data thanks to the programming skills you developed in the winter. Check this list out for data sets you can use to begin building visualizations.

Summer: algebra, statistics, calculus

According to the study discussed above, summer is your brain’s optimum season for sustained attention tasks. This means that you want to tackle the hardest data science concepts during the summer. For most people, this means math.

The next three months are the time to break open the textbooks and tutorial videos on Youtube and begin mastering the topics of algebra, statistics, and calculus. These three areas of math are the ones that you’ll need for most general data science jobs (industry-specific requirements may require higher levels of mathematics, such as multivariable calculus, differential equations, and discrete math).

Professor Leonard

Professor Leonard is my favorite Youtube instructor for algebra, statistics, and calculus. He provides high-quality, full-length university lectures going from precalculus to differential equations. My only regret is not starting to watch his lectures earlier.

Fall: putting it all together – data analysis

Fall is when your brain is working at its maximum working memory capacity, which means that it’s time to put together everything you’ve learned in the past year and complete your first full data analysis.

Data analysis follows the steps of determining an objective for the analysis, collecting, cleaning, and analyzing the data, and finally interpreting the results and producing a conclusion. This will bring together everything you’ve previously learned, with the end goal of you being able to conduct the work of a real data scientist.

The goal here is not for you to be perfect. Heck, you’ve spent the last nine months learning the fundamentals behind data analysis – that’s not a lot of time. Instead, the goal is for you to methodically think through the steps involved in data analysis while applying what you have been able to learn in the previous year. You may not have all the answers, and there may still be some techniques that elude you from being able to produce the best analysis possible. However, you should have the basic skills necessary to draw some insightful conclusions from the data you’re working with.

Final thoughts

It’s critical to reiterate my message from the beginning of this article: the goal of this plan isn’t for you to teach yourself data science in one year – it’s for you to develop a consistent routine that gets you regularly progressing through your data science learning program.

While the plan seems neatly laid out for you to become a data scientist in one year, that isn’t always the case – there will be roadblocks.

Instead, this plan is nothing more than a guide to help you study the subjects you need to become a data scientist at the best times of year to make the most of your brain’s natural ebbs and flows. With each subsequent year, you can be sure that the skills are becoming even more ingrained in your brain, thanks to the seasonality of our brains.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Support my writing by donating to fund the creation of more stories like this one: Donate

The post How to Create a 1-Year Data Science Self-Study Plan Using the Seasonality of Your Brain appeared first on Towards Data Science.

How to Upgrade Your Junior-Level Data Science Code to Senior-Level Data Science Code

Madison Hunter — Mon, 12 Jun 2023 14:49:27 +0000

You did it. After years of hard work, you got hired as a junior data scientist. Your first few weeks flew by with company onboarding and before you realized it, a few years had passed. You worked on countless projects, both individually and as part of a team, and your solutions are making a positive impact on the company.

But now, you’re ready for your next challenge: becoming a senior data scientist. But how do you bridge the gap? What are some of the things that a senior data scientist needs to know? And most importantly: how do you transform your junior-level data science code into senior-level data science code?

Luckily, this last question is the easiest to answer and is the easiest skill to improve on your path toward becoming a senior data scientist. I’ve singled out the top four areas where your junior-level code can be transformed into something that would encourage any company to promote you to a senior data scientist position. The key is to master the fundamentals, ditch the spaghetti code, begin implementing testing and QA skills, and learn to optimize your code.

Master the fundamentals of data science code

You can’t run before you walk, so it follows that before you can write senior data scientist-level code you will need to master the fundamentals of code.

At the beginning of your data science journey, it’s an accomplishment to simply write code that runs properly. Now, however, is the time to begin mastering those fundamentals so that it’s no longer a surprise when your code runs properly.

This is the one tip that you can’t speed up, and that will just be achieved by spending time doing the work. Over your first few years as a junior data scientist, you’ll be given opportunities every day to work on mastering the fundamentals of data science code, from programming fundamentals to algorithms, to data structures, and to design patterns.

Furthermore, now is the time to deepen your knowledge base by learning other programming languages (likely the ones that your company uses or those that you have time to learn on your own for fun) and other technologies that can improve your quality of work (i.e., Notion for organizing your projects, Git for version control, code syntax-checking extensions in your code editor, etc.). Some of these languages and tools will stick, while others will simply provide insightful lessons that will make you a better data scientist even if you never use them again.

Now is also the time to stretch your capabilities and begin exploring even more intense concepts in data science. For example, you may be in more of a data analyst position where you’re explaining the causes of past events. However, your boss is now wanting you to move into the predictive side of things which requires you to begin learning about machine learning and Artificial Intelligence. Pushing yourself to learn these topics will allow you to move into more senior and supervisory roles, where you can begin passing on your knowledge to new junior data scientists who are starting out just like you did.

Focus on writing clean, maintainable, and readable code

I’ve often joked in previous articles that data scientists write terrible code. The spaghetti code is real, especially when you’re starting out. This may be permissible for the first couple of years that you’re working as a junior data scientist, but as your experience increases, it becomes less and less acceptable to write messy code.

One thing that will set you apart as the perfect candidate for a senior data scientist position is your ability to write clean, maintainable, and readable code. Not only does this make you easy to work with and immensely professional, but it also shows that you can pass on these techniques to future junior data scientists under your tutelage.

Therefore, to upgrade your junior-level code to senior-level code, you need to focus on making your code clean, maintainable, and readable at all times.

Both Python and R have great guides on best practices and styles which can help you begin formatting your code more professionally. Code cleanliness, maintainability, and readability are the cornerstones of a data scientist who is a pleasure to work with, which is why these standards should be emblazoned on your brain (or at the very least, have a prominent place on your desk within easy reach). Best practices and style are two things that should always be considered and reviewed heavily before pushing your final commit or sending your code to the software engineering department for translation into production-ready code.

This also means that you should be adhering to DRY coding principles (at the very least) and SOLID coding principles (at the more advanced), to ensure that you’re writing the best code possible. While these principles may not be relevant if you’re primarily writing code that will never be touched by anyone else or that will only be run on a small set of internal machines, it’s not a bad idea to become proficient in these principles if you ever change jobs or begin producing production-level code.

Additionally, at this point in your career, you should be a beacon for pristine industry/company code standards. Each code commit you push to the repository should be a gleaming example of what your industry or company is looking for, and should be something that could be printed off and used in a training manual. Yes, it will take extra time for you, but the extra bit of thoughtfulness will pay dividends when it comes time for your company to promote internally. What’s one thing they’ll look for? An employee that consistently writes clean, maintainable, and readable code – and that should be you!

This Quick and Easy 7-Step Checklist Will Help You Write Better Python Code for Data Science

Develop testing and QA skills

Becoming proficient in unit tests, integration tests, and automated testing frameworks is a great way to immediately take your code to the next level. While these are all skills you should be aware of as a junior data scientist, they’re skills you should be proficient in as a senior data scientist.

Testing and QA skills are where you can begin to write excellent code that works as it was designed and that can work in tandem with other pieces of code. Where before you may have just sent your code off to the software engineering department where they would get everything ready for integration, you are now going to be writing code like a senior data scientist and must ensure that your code functions properly and can be integrated into larger code bases.

While your company may have specific unit and integration tests they want you to run, it’s not a bad idea to begin building your own to ensure that your code is running and integrating the way you should. Your own forms of quality assurance are great ways to take responsibility for your own code and to ensure that if your code can pass your own tests, it can pass your company’s tests with no issues. Not only does this make you a better data scientist in the long run, but it allows you to become more efficient when writing code in the first place.

Developing testing and QA skills is a great way to show your company that you’re committed to improving your craft and that you care about the quality of your work and the code you push to the production environment. These are all attributes that make you a great candidate for a senior data scientist position.

Unit Testing for Data Science with Python

Make performance optimization a priority

Nothing is a better motivator to learn how to optimize code than walking past the software engineering department after you’ve pushed your code to them and hearing the grumbles synonymous with having received a data scientist’s code. It’s a humbling experience that every data scientist should go through.

Learning code optimization isn’t just about maintaining a healthy working relationship with the software department – it’s also about making yourself a more surefooted data scientist who can write excellent code without the support of another department. Being able to write stable, optimized code the first time is a great move toward becoming a senior data scientist.

Becoming educated in topics such as caching (storing a copy of the data in front of the main data store – not necessarily relevant in all applications but can be useful when producing dashboards for clients), time complexity (the amount of time it takes your algorithm to run), database indexing (a structure that can speed up data retrieval operations in a database table), and query optimization (figuring out the best way to improve query performance) are great places to get started in optimizing your Data Science code.

While not all of the topics mentioned above are relevant for all types of data scientist work, they’re all great tools to keep in your back pocket, whether for future jobs or for that one time the need arises and you can immediately hit the ground running to solve the problem – an essential attribute of a senior data scientist.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Support my writing by donating to fund the creation of more stories like this one: Donate

The post How to Upgrade Your Junior-Level Data Science Code to Senior-Level Data Science Code appeared first on Towards Data Science.

These 5 Tips Will Help You Learn Data Science When You Have No Motivation to Study

Madison Hunter — Sun, 21 May 2023 15:18:38 +0000

Yeah, we’ve all been there.

One day you’re all gung-ho, creating a study schedule to teach yourself Data Science, and the next you’re finding excuses as to why you don’t have the time to study today. We all know that there are only so many times you can wash the walls of your apartment, clean out your refrigerator, or walk the dog before you’re stuck sitting in front of your computer again, trying to summon some form of excitement for studying.

As someone who has been studying data science on the side for four years now, I understand the drain that self-studying can have, especially in the motivation department. However, I’ve also become somewhat of an expert in developing methods to have the motivation to study every single day. While some people need the structure of a classroom to get any studying done, it is possible to develop your own practices that can keep you excited to study every day. Well, maybe not excited, but at least able to sit down and focus for a study period every day.

Here are five tried and true tips that will help even the most unmotivated individual teach themselves data science.

Use learning resources that keep you moving forward

One of the most important things I’ve learned since I started studying data science is that your learning resources can make or break you.

For example, I knew that I would need to have a grasp of calculus to carry out many calculations found in data science. Coincidentally, I had to take a calculus course as part of the requirements for my university degree. Since I was having to pay for the calculus course, I decided to use it to teach myself the calculus I would need for data science. However, the learning materials from my university were so atrocious that it took me five months to learn functions, limits, and differentiation. It was soul-sucking. That is until I found the best math teacher on Youtube. Professor Leonard’s calculus lectures were life-changing, and I found myself able to teach myself calculus through these videos in record time compared to when I was using the materials my university had provided.

To keep your motivation to self-study strong, you need to use resources that are helping you learn at pace, instead of keeping your wheels spinning for weeks on end trying to understand a concept. Nothing will kill your motivation quicker than being stuck trying to understand a concept for longer than one month.

There’s no reason to stick with a learning resource if it’s not doing its job. Luckily, the internet is so incredibly full of data science learning resources that you have many options.

For example, many people have had great experiences learning data analytics through the Google Data Analytics Professional Certificate that became popular in 2021. This self-paced course is designed to keep students moving forward by using extremely well-designed learning materials that allow you to complete the program in less than 6 months with 10 hours of study every week. Codecademy is another learning resource that has also had great success in helping people learn to code with their easy-to-follow and digest modules that keep you moving forward in your studies without getting stuck.

In sum, there’s no good reason why you should stick with a learning resource if it’s draining your will to live by not being conducive to moving your studies forward. Self-studying data science should always be a form of forward progression. Yes, the forward movement may be slow at times, but there should never be a complete stop or a reverse of direction – there are too many different learning resources out there for that to happen. All you need to do is be able to admit when something isn’t working and change tactics to something that will.

Find study support in the form of online study spaces, "study with me" videos, and Discord chats

It’s weird how something as simple as studying along with someone, even if they’re halfway across the world, can be so motivating.

Online study spaces, "study with me" videos, and Discord chats have seemed to take off in popularity over the last three years, with many of these resources hitting thousands of viewers and members every day.

One of my favorite study channels on Youtube is run by Merve, who also coincidentally studies data science. The channel has 822k subscribers, and posts "study with me" videos seen by millions of viewers each week. There’s just something so inspiring about "studying" with someone that also helps keep you motivated.

Study Together, StudyStream, and Studyverse, are all virtual study rooms where you can study with people from all across the world. These study rooms can help bust procrastination and keep you focused for hours at a time. Additionally, many study accounts on Instagram are using the broadcast and live features to host study sessions for all of their followers to tune in.

The other top tool to keep yourself motivated is to join a Discord server, especially one dedicated to the different aspects of learning data science. These communities are great opportunities to keep yourself motivated to study, but also to get your questions answered immediately when you get stuck on a topic. Communicating with like-minded people is also a great way to learn more about the data science industry, network, and become a more well-rounded data scientist in the future.

Study for no more than 6 hours a day

When you’re self-studying data science, it can be difficult to determine exactly how much you should be studying every day. This can be compounded when you don’t have other commitments, which can leave your entire day open to studying.

This is also further affected by how much you see others around you studying. Social media has made the toxic study culture even more prevalent, with many people posting about how many hours a day they study. This can put unnecessary pressure on you to also be studying 12 hours a day.

While the amount of time that everyone can study effectively is different, I can attest to the fact that you should not be studying for any longer than 6–7 hours per day. Studying is an intensive form of brain use that is completely different than how you would use your brain working an 8-hour-a-day job. For example, working 8 hours a day does not mean that you’re using your brain intensively for all of those 8 hours. Some of those hours will be spent on energy-intensive tasks, but for the most part, your day will be spent using your brain less intensively, such as going to meetings, answering emails, and taking breaks.

Comparatively, your brain is being continuously worked hard while studying. Studying requires 100% of your concentration to do it effectively (especially when you’re exploring topics such as calculus and neural networks), which is why studying for 6–7 hours should be your maximum goal every day. This also takes into account the fact that you need to take care of yourself in other ways during your study day, including rest, socialization, exercise, and nutrition.

When your brain learns that it only has to work hard for up to 6 hours a day, you’ll likely find that it becomes easier to focus for those 6 hours. You’ll no longer feel like being distracted by your phone because you know that you’ll only have 6 hours to get through your learning tasks for the day. You’ll also find that you feel more refreshed going into the next day of studying because your brain has had ample time to rest. You may also find that your retention of material learned is greater, as your brain has more time to build strong connections to the material that you’ve learned without being constantly bombarded by new information.

Find practical applications that inspire you

Let’s face it – not all topics in data science are created equal. Unfortunately, the good stuff, such as Machine Learning, data visualization, and real-world applications can only come after you’ve learned code, mathematics, and communication skills. With topics like these to grind through, it can be difficult to remain motivated for the good stuff yet to come.

One of my favorite techniques to get past this slump is to find the practical applications of the material that inspire me. For example, learning limits and differentiation can be pretty draining, but only if you forget that they can be used to determine the rate of change of a function which can tell you all sorts of cool things, like how climate change is quickening, costs of goods are increasing, or how access to healthcare is declining.

When you’re passionate about how you want to apply your data science knowledge (such as in healthcare, science, engineering, business, education, etc.), then it becomes easy to find the different ways that you can apply the knowledge you’re developing. For example, once you’ve mastered data analysis, you could do some pro bono work for a small business in your community to help them increase their sales. Or, you could create a predictive model of how many people would be affected by a particular natural disaster as a portfolio project.

Whatever your interests, there are always ways to apply what you’ve learned in a way that can inspire you to keep moving forward. Find what inspires you and apply data science to it.

Set time-sensitive learning objectives

I don’t care how much of a procrastinator you are, setting time-sensitive learning objectives works every time. It doesn’t matter if you leave it to the 11th hour as long as you complete it by the deadline.

One of the things I’ve seen people struggle with while self-teaching data science is a lack of motivation due to no structured time-sensitive goals. Many have said to me that they could never teach themselves data science because they don’t have the motivation to sit down and get their work done. However, this is simply due to a lack of structure.

One of the great benefits of going to school is that you’re in a structured environment with deadlines. Deadlines for assignments, deadlines for exams, and deadlines for graduation. You name it and there’s a deadline for it. This time-sensitive structure helps people focus and get down to work without even trying too hard. As I mentioned above, even if you leave it to the very last minute, you’re going to complete your work because you know there’s a hard deadline to abide by.

Therefore, the trick to attaining this motivation in the self-study world is to set time-sensitive learning objectives that you see as hard deadlines.

For many, this can be easy because you’re in the middle of a career change and only want to be out of work for so many months. For others, this can be more difficult because there may not be a specific time crunch that you’re working against.

Vacations, birthdays, events, and even weekends are great hard deadlines that can be used to motivate you to get your work done. Nothing feels better than getting all of your work done on Friday and knowing that your weekend is now open to do anything. The same goes for not having to worry about work during your vacation, your friend’s birthday party, or your child’s school play. Whatever the occasion, date, or end-of-week ritual, finding a hard deadline to structure your learning around can help motivate even the biggest procrastinator to teach themselves data science.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Support my writing by donating to fund the creation of more stories like this one: Donate

The post These 5 Tips Will Help You Learn Data Science When You Have No Motivation to Study appeared first on Towards Data Science.

How to Write Better Study Notes for Data Science

Madison Hunter — Mon, 03 Apr 2023 14:11:58 +0000

I’ve been a student for a long time. Like six years-in-post-secondary-so-far kind of long.

In all of those six years and various areas of study – including Data Science – the one thing that I’ve become an expert in is note-taking. Not only that, but I’ve built and refined a system for note-taking in data science that allows you to self-teach data science concepts more efficiently and effectively. No matter the topic, from programming to statistics to machine learning, this note-taking system helps you to build a deeper understanding of data science topics while also helping you better retain the information in the long run.

1. Distill key concepts into summary and cheat sheets

One of the best tips I received from a friend in law school is to create single-page summary sheets for each unit you complete. The goal of these sheets is to condense all of your many pages of notes from one unit into one document that highlights only the absolutely most important stuff. I began playing around with this concept for data science and it began to make a real difference in my ability to retain and recall concepts I had learned, especially those to do with coding, mathematics, and the intricacies of building Machine Learning projects.

This is a great exercise in pulling out the most important pieces of information that you know you will continue to use daily as you progress as a data scientist. Furthermore, it helps you focus on what is truly important while discarding any fluff you may have taken note of. Not only that, but these sheets are perfect to keep on hand for quick reference when you’re studying or working on a project. I like to do this by keeping my sheets handy on my desk or taped to a nearby wall. That way, when I’m working on projects, I can quickly reference my notes without having to dig around too much on Google for the answer.

My favorite technique for creating these sheets is to build a mind map with the unit name in the center. The topics that branch off from the center are taken from the learning objectives for that unit. For example, to create a mind map for a unit of calculus concerning derivatives, I would create branches for interpreting derivatives as rates of change, interpreting derivatives as slopes of tangent lines, differentiating algebraic and trigonometric functions, using differentials to estimate numbers and errors, applying derivatives to solve problems, and using implicit differentiation to solve related rate problems. Then, I fill in all of the relevant tidbits of information for each branch, such as formulas, important reminders, key tables of information, and other such pieces that are continuously used or relevant.

2. Use in-line examples to relate concepts

From personal experience, your data science notes are nothing without in-line examples that help you better relate, identify, and understand concepts.

How many times have you looked at your notes and, for example, noted that "classes are a blueprint that specifies the unique attributes and properties that an object may have" (see below) without actually being able to visualize what they are or what they look like? Don’t worry, this is more common than you think.

Our notes are only as good as the examples we apply to them, and when it comes to studying data science, our examples become even more critical when looking at concepts in Programming, mathematics, and the production of visualizations (to name a few). These are examples of topics where in-line examples next to your written notes can make a concept click for you, allowing you to identify visually what you’re talking about, and helping you relate that concept to other knowledge you have.

My favorite way to include in-line notes is to use note-taking apps such as OneNote, GoodNotes, or Notability which allows you lots of freedom to create customized notes using typed text, handwritten notes, screenshots, drawn diagrams, recorded verbal notes, and more. These solutions are perfect for when you need to include screenshots of code, diagrams of database systems, mathematical equations, and examples of data visualizations, to name a few.

It’s also important to note that your in-line examples are also perfect places to add context to your notes. For example, it may not click for you why differentials in calculus are important to know until you understand that they’re vital for estimating numbers and errors or developing equations to describe how the rate of an event can change over time. Alternatively, you may not appreciate the importance of using different types of data visualizations until you learn that each one is better suited to representing certain forms of data over others. By providing context in your notes as to how certain data science concepts fit into the bigger picture of data analysis, you’ll be better able to apply these concepts and fit them together to solve a data science problem.

3. Insert diagrams, flowcharts, and mind maps

Humans seem to be becoming increasingly vision-driven creatures, which is why so many of us are succeeding in our studies when we include diagrams, flowcharts, and mind maps in our notes.

This simple trick allows you to create more in-depth notes that provide you with a deeper understanding of concepts. While I disregarded the importance of flowcharting when I was studying software development, I came to appreciate the simple task of drawing out logic and inserting it into my notes before cementing it in code. Having these types of diagrams in your notes can complement our human tendencies to focus immediately on photos and diagrams before reading text.

As much as data science is steeped in code, I find that visual representations of the logic, processes, or sequences that you’re carrying out can be beneficial to building your understanding of how the different components of data science fit together – how our problem can be turned into logic that can then be coded, extended into machine learning systems, modified into production code, and then used to produce results that can be translated for non-tech individuals.

Diagrams are ideal for learning how different pieces of code work together, how machine learning works, or how to tell a better data story. Flowcharts are necessary for writing out coding and machine learning logic. Finally, mind maps are great tools for relating the different concepts of industry questions, code, mathematics, data, and design that make up a data science project.

4. Rewrite concepts in your own words

Copying notes directly from your study material has its place, like when a concept is so simply put that you couldn’t possibly write it any clearer. On the other hand, using your own words to explain concepts in plain English (or whatever your language of choice) benefits your studying by forcing you to understand the concept before you write it down.

For example, when studying object-oriented programming (OOP) the definition of a class that you’re provided with may read like this:

Textbook definition: Classes are a template definition of methods and variables for a particular type of object.

That’s great and all, but does it really make sense? Instead, let’s look at how I would describe classes using my own words:

Author’s definition: Classes are a blueprint that specifies the unique attributes and properties that an object may have.

See? That makes more sense already. Then, you’ll have to create your own definition of objects so your understanding of these OOP concepts is more concrete.

The key here is to use your own words when writing your study notes (in specific circumstances where concepts are not properly explained in the first place) to help you cement your understanding. Additionally, the extra brain power used to create your own definition will make the concept easier to remember when reviewing your notes. This tip is also a part of the Feynman Technique, which you may find helpful in your data science studies.

5. Add your own questions or comments

The best tip I ever received while teaching myself various areas of mathematics is to write down your thoughts while studying. This means writing down everything from questions to comments that arise, directly where they arise.

For example, while working out a calculus problem, I’ll highlight areas of the problem and write my questions or comments there as I go along. Not only does this make it really obvious where my understanding has faltered, but it also helps my instructor give me better advice on how to improve my understanding.

This part of note-taking also helps keep you accountable for what you understand and don’t understand. We all get into the rhythm sometimes of just copying information down without actually checking to see if we understand it. By annotating your notes with comments and questions, you’re regularly checking back with yourself to see if you understand everything you’re reading.

This tip also applies to programming, where you can type comments and questions directly into your code, as well as any other topics where you may be taking notes, such as those concerning machine learning or data visualization.

6. Review, revise, and test yourself using your notes

This can be one of the hardest tasks to accomplish when you’re teaching yourself data science. How do you review, revise, and test yourself on your notes regularly when you don’t have exams to complete or interviews to prepare for? However, this is one of the most important steps you can take to ensure that your data science notes are actually working for you.

It’s critical that you review, revise, and test yourself using your data science notes to not only retain the material better (the obvious benefit of frequently reviewing, revising, and testing) but also to identify areas where your notes could better serve you and where they leave a little to be desired in the way of thoroughness or the clarity of your descriptions.

As you advance in learning data science concepts, it’s not a bad idea to return to old notes and see if you can find better ways to explain concepts you may not have fully understood when you first went through them. This not only ensures that you’re grasping everything properly but also takes advantage of all of the tips mentioned above to better improve your notes as well as your retention and understanding of topics.

The best way to do this is to sit down at regular intervals (this may be once a month, once a quarter, once every six months, or once a year, depending on how quickly you’re studying data science) and go through your notes, asking yourself seriously where your notes could be better (the idea behind this is that you’re constantly gaining experience in data science which can help you critically evaluate how your notes could be better written or explained). Making notes of these instances, take some time to then test yourself, whether via flashcards, coding challenges, or example university tests available online. After marking the test, ask yourself again where your notes failed you in understanding concepts or where they worked really well. From here, you can modify your notes to suit your needs.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Support my writing by donating to fund the creation of more stories like this one: Donate

The post How to Write Better Study Notes for Data Science appeared first on Towards Data Science.

The Data Analyst Learning Roadmap for People Who Hate Math

Madison Hunter — Tue, 07 Mar 2023 18:54:58 +0000

Math and Data Analysis are practically synonymous terms when you work in the tech industry. It’s why many people get scared away from working in analytics because there’s this preconceived notion that math is all the job entails.

In reality, while I can’t say that there isn’t some math involved, I can tell you that the math you do is actually quite fun. There’s non of the theoretical or imaginary numbers scariness that comes with being a data scientist – instead, you look at practical, easily applicable mathematics to explain what happened and to even make simple predictions about what could happen.

You might even already know all of the math you’ll ever need to become a data analyst. Heck, I put together a 15-minute read on everything you would learn in an undergraduate-level statistics course to get you started. It’s that simple.

As someone who once failed a math course in high school, I can share with you with absolute certainty that this data analysis learning roadmap for people who hate math is exactly all you need – nothing more and nothing less – to help you take your first steps in the data analysis field. All you need is to be open-minded about learning just a little bit of math and the rest will be easy. You will probably spend more time learning to code and how to conduct data analyses than you will be learning all of the math you will need for the job.

This roadmap looks at all of the learning aspects you will need to cover to become a data analyst, with just a bare-bones plan for the bare minimum level of mathematics you need to succeed in the job.

Step 1: Learn Excel, SQL, and Python

Author’s note: In previous roadmaps that I’ve created, I’ve always suggested that people looking to become data analysts should learn Python or R. I’ve changed my approach to this simply due to the difficulties that come with learning R (and because Python is widely the industry standard). If you’re learning Python as your first and one-and-only Programming language, you have enough functionality at your fingertips to have a long and happy career as a data analyst.

Python is the easiest programming language you can pick up when first jumping into data analysis. It’s pretty powerful, has tons of functionality, and also has many applications outside of data analysis if you decide the field isn’t for you. It’s incredibly forgiving, gives you exactly what you ask for (almost literally, so it makes you good at asking the right questions), and allows you to produce effective analyses without having to become an expert in all of the minutiae that comes with programming.

My top tip for learning Python is to work on practice problems from Kaggle or Leetcode. As much as this kind of repetition may seem boring, it’s the quickest way to test yourself and force your brain to become accustomed to the patterns of programming problems.

But first, before you even begin to think about learning Python, you need to learn Excel. Why? Because Excel might be the only tool you need to become a data analyst.

Excel (while old-fashioned to some) is still an incredibly powerful tool that you can use to conduct most simple data analyses that involve summaries of company data as well as the establishment and simple prediction of trends. We all say that we will "actually" learn Excel one of these days – this is your sign to actually learn the ins and outs of Excel because you may find your first entry-level position at a small business as a data analyst because of this skill alone.

One of my favorite ways to learn the hundreds of tricks built into Excel is to look for TikTok videos of people sharing Excel hacks. There are hundreds of short, bite-sized videos that can help you quickly learn all of the many shortcuts built into the program which can speed up your workflow and help get you more accustomed to the program through this form of microlearning.

Finally, you will learn SQL, a language that allows you to work with data in a database. SQL can be a complex language to master, which is why one of my best tips is to create a cheat sheet for yourself as you go along. This cheat sheet should show you everything from the proper syntax for SQL commands to the different types of data joins you can create.

You don’t have to memorize every function, but you should become familiar with ones like COUNT, CONCAT, TRIM, MAX/MIN, GETDATE, and CONVERT. Memorizing functions is a waste of your time. As I mentioned previously, make a cheat sheet highlighting how to use some of your most-used functions and learn how to Google the right questions to find the other functions instead.

One of the best ways that I became familiar with SQL quickly was to download a free database and then go through it trying out all the functions and syntax along the way. Having a database that you can muck around in without being concerned about making mistakes or wrecking the database is a great way to get your hands dirty with the code (the best way to learn SQL in my opinion).

Step 2: Refresh your memory of algebra and statistics

If you can do algebra and statistics, you’ve basically got everything you need to be a data analyst.

In fact, if you go through my article that outlines an undergraduate statistics course in fifteen minutes, you’re pretty much already through the worst of it. See? I said this was a data analyst learning roadmap for people who hate math.

An Undergraduate-Level Introductory Statistics Course in 15 Minutes

The truth is that you can always learn more math when you’re a data analyst. You can never know enough math. However, if you truly despise it, you can get away with knowing algebra and statistics.

To get a grasp of the basics, I would recommend going through Khan Academy’s lecture series for Algebra 1, Algebra 2, and AP Statistics.

Algebra 1 | Math | Khan Academy

Algebra 2 | Math | Khan Academy

AP®︎ Statistics | College Statistics | Khan Academy

If you would like a challenge, Khan Academy’s Linear Algebra series is a fantastic way to expand your skills, as is Professor Leonard’s lecture series on Calculus 1 on Youtube.

Linear Algebra | Khan Academy

All of these resources share mathematical knowledge in pretty painless ways, which allows you to zip through the learning math part of becoming a data analyst and getting to the good stuff: data analysis and visualization.

Step 3: Study data analysis and visualization

It’s time to tie it all together and analyze some data. This involves learning how to ask the right questions, collect the data you will need, clean the data, analyze the data for insights and answers to your questions, and interpret the data through a visualization that can be easily understood by anyone.

To better study data analysis and visualization, I believe in spending less time working through online courses and more time doing the actual thing. For example, the video I linked above is 4 hours long. This is all the time you should be taking in learning how to do a data analysis before creating an analysis of your own. You will learn more about data analysis when working on your own project than sitting in front of your computer following along with a video.

From experience, this is the only video you need to watch to learn how to analyze data – there isn’t much more to learn beyond this. You can expand your knowledge by learning how to use a particular visualization tool or you can learn about how to make your statistics more accurate, but these are areas of professional development – not core learning.

Final thoughts

See? That wasn’t so painful (read: full of math).

In fact, math plays probably the smallest role in the learning process, especially if you already know some of it. The best part of becoming a data analyst is that you don’t really need to know how the math works (like if you were a data scientist) – you just need to know that it works and when to use it.

When I was self-teaching myself data analysis, I found that I spent the least amount of time learning math. Most of it was spent learning how to code and working on data analysis projects to fine-tune my skills. So, don’t let your hatred of math steer you away from what is essentially one tiny aspect of the entire job. When you realize that the math you’re using is extremely practical, you learn to enjoy using it to uncover details and answers to your questions. As long as you’re comfortable working with some numbers in a practical sense, like a challenge, and can tell a good story using data, you can become a data analyst with no issues.

Subscribe to get my stories sent directly to your inbox: Story Subscription

Please become a member to get unlimited access to Medium using my referral link (I will receive a small commission at no extra cost to you): Medium Membership

Support my writing by donating to fund the creation of more stories like this one: Donate

The post The Data Analyst Learning Roadmap for People Who Hate Math appeared first on Towards Data Science.