The world’s leading publication for data science, AI, and ML professionals.

4 SQL Filtering Concepts to improve your Data Wrangling Game

Let's learn about some common SQL filtering tricks data scientists can use to speed up their queries

Photo by Jaffer Nizami @unsplash.com
Photo by Jaffer Nizami @unsplash.com

As Data scientists, most of us are used to filter tables using pandas (on Python) __ or dplyr (on R). Indeed, there are certain wrangling operations that are more convenient to do in both libraries and in the context of the languages we use to code our algorithms.

But, if your data source is sitting on a database somewhere (and you are probably accessing it using some SQL interface), some filters and table operations would be more efficient using SQL code and performing those operations on the server side.

Filtering tables is one of the most common operations one does when working with SQL. On the surface, they seem pretty straightforward, you apply a WHERE clause and that’s it! But.. filters have more than meets the eye and can get complex quite fast.

Learning how to apply the different filtering methods and leverage theWHEREclause with IN, AND, OR, precedence and wildcards will improve your SQL game enormously. In this post, we’ll explore those extra spices that can be applied when filtering rows from a table. By the end of the post, you should be able to understand most complex WHERE statements in the majority of SQL queries out there.

I’ll use a sample table that contains data for students in a school – this table is completely made up and all names and information are fictional. I’m creating this table inside a sandbox database:

create table sandbox.students (
 student_id integer auto_increment primary key,
    student_full_name varchar(30),
    favorite_classes varchar(150),
    student_age integer,
    birth_date date,
    degree varchar(30)
    );

insert into sandbox.students (
 student_full_name, favorite_classes, student_age, birth_date, degree
) values ('John Smith', 'Mathematics, Political Science, Biology', 23, '1999-03-01', 'Engineering'),
 ('Amy Smith', 'History, Political Science, Biology', 23, '1999-03-01', 'History'),
    ('Joe Adams', 'History, Mathematics', 23, '1999-06-01', 'History'),
    ('Joe Williams', 'Biology, Chemistry', 22, '2000-03-01', 'Chemistry'),
    ('Anne Williams', 'Mathematics, Arts', 22, '2000-03-16', 'Mathematics'),
    ('May Taylor', 'History, Geography', 22, '2000-05-19', 'History'),
    ('Zoe Brown', 'Physical Education, History', 21, '2001-02-18', 'Physical Education'),
    ('Jennifer Davies', 'Biology, Chemistry', 21, '2001-03-19', 'Chemistry'),
    ('Robert Jones', 'Chemistry, Biology, Mathematics', 21, '2001-06-02', 'Chemistry');

Our students table contains data of 9 college students – although a low sample, this should be enough to understand the main difference between the filtering concepts we will explore.

Let’s improve our Sql game!


AND / OR

The difference between these two clauses – AND , OR – is ultra relevant in filters and takes into account logical operations commonly used in other programming languages.

Imagine that, from our students table, we wanted to subset all the students from the "Chemistry" major. That would be easy with a standard WHEREclause, like:

select * from sandbox.students
where degree = 'Chemistry'

This will output all the Chemistry students, 3 in this case:

Return Data— Chemistry Students - Image by Author
Return Data— Chemistry Students – Image by Author

But, what if we just wanted to subset the students that are 21 years Chemistry students? We can do that with an AND clause:

select * from sandbox.students
where degree = 'Chemistry' AND student_age = 21;
Return Data -21 Years Old Chemistry Students - Image by Author
Return Data -21 Years Old Chemistry Students – Image by Author

The AND clause enables us to select both conditions and only the rows that match exactly both conditions. In this case, we only select degree = 'Chemistry' ANDstudent_age = 21 – both conditions must be matched by the row to be returned in the output set.

Naturally, a question arises. What if we wanted to match only one of the conditions? Meaning we want to select rows that match the first condition OR the second condition – this is also super easy in SQL, using the OR clause:

select * from sandbox.students
where degree = 'Chemistry' OR student_age = 21;
Return Data - Chemistry or 21 Years Old Students - Image by Author
Return Data – Chemistry or 21 Years Old Students – Image by Author

Notice that we have 4 students in this result – this happened because we are choosing students that have degree = 'Chemistry' OR student_age = 21 . In this case, a student can match only one of the conditions in the WHERE clause.

That’s why we have a Physical Education student in there, that student has 21 years old and matches that condition only.

This is one of the most simple and effective concepts in SQL filters. Nevertheless, you’ll notice that when you start to play around with multiple AND and OR clauses, things can become quite messy. And that’s where filtering precedences enter the picture.


Operator Precedence

SQL enables the combination of multiple ANDand OR conditions in the same WHERE. This is really cool but also upgrades the complexity of our queries.

Luckily, by learning about filtering precedence, we are able to navigate the logic behind more complex filters. Operator precedence logic mimic what we know from mathematical equations – for instance, imagine the following SQL query with our students table:

select * from sandbox.students
where degree = 'Chemistry' OR student_age = 23 AND favorite_classes = 'History, Mathematics';

What will be the output of this query?

If you read the query directly, do you think SQL will select all the students that have the Chemistry Degree or are 23 years old but at the same time they have their favorite classes being History and Mathematics?

Wow, this sentence was confusing. Let’s break it apart.

Extremely important – AND has priority over OR ! This means that the clauses connected with AND will be tied together as a single clause – so we are searching for:

  • Students that are 23 years old AND their favorite classes are "History, Mathematics";

Because of the AND these two clauses are treated simultaneously, let’s call this clause 1.

Ok, so where will the degree = 'Chemistry' enter the picture? Well, this clause is the other clause that we are including in the search:

  • Every student that has degree = 'Chemistry'

Let’s call this Clause 2.

Resuming, we are searching for all students that match Clause 1 OR Clause 2. Meaning that a student that is not 23 years old and doesn’t have its favorite classes as History and Mathematics can still be selected in our query if its favorite degree is Chemistry.

Return Data - Chemistry Students + 23 Years Old that like History and Mathematics - Image by Author
Return Data – Chemistry Students + 23 Years Old that like History and Mathematics – Image by Author

Bottom line, first commandment of SQL filtering is: AND has a priority over OR . Second commandment is that we can override these rules using parenthesis, just like we would do in mathematical equations:

select * from sandbox.students
where (degree = 'Chemistry' OR student_age = 23) AND favorite_classes = 'History, Mathematics';

In this case, the operations are a bit different – clause 1 will be:

  • Students that have a degree in Chemistry or are 23 years old.

And clause 2 is:

  • Students with favorite Classes History and Mathematics.

In this case we want: "Every student with class History and Mathematics that is either in Chemistry or has 23 years old. In this case, only one student matches the condition:

Return Data - Students that like History and Mathematics and that are either 23 years old or are in Chemistry - Image by Author
Return Data – Students that like History and Mathematics and that are either 23 years old or are in Chemistry – Image by Author

Notice how a parenthesis changed the whole result of the query! Operator precedence are really important and completely change the result of your return data. In summary:

  • AND has priority over OR
  • Parenthesis change the precedence and artificially create a priority for the bit of code inside the parenthesis.

Nice! Let’s continue with another concept that reduces a significant portion of our code with multiple OR conditions.


IN / NOT IN

IN and NOT IN operators are really cool ways to write conditions where you need to match multiple elements at the same time. For instance, imagine that I would ask you to filter the following students from the table: "Joe Adams" , "John Smith" and "Amy Adams".

Knowing the OR condition, you could immediately write the following query:

select * from sandbox.students
where student_full_name = 'Joe Adams' or student_full_name = 'John Smith' or student_full_name = 'Amy Smith';
Return Data -John Smith, Amy Smith or Joe Adams - Image by Author
Return Data -John Smith, Amy Smith or Joe Adams – Image by Author

This seems a reasonable result but.. imagine you would like to subset twenty names to select? Do you really want to write the OR condition 20 times? I think not!

Luckily, we have the IN operator to will help us with that!

select * from sandbox.students
where student_full_name IN ('Joe Adams','John Smith','Amy Smith');

The IN operator acts as a multiple OR condition but written in a more elegant way. Inside the IN , we can add multiple elements to match without repeating the "or student_full_name =" multiple times.

The NOT IN works exactly in the same way but it excludes the elements inside the parenthesis:

select * from sandbox.students
where student_full_name NOT IN ('Joe Adams','John Smith','Amy Smith');

This will select all the students except the ones inside the IN clause:

Return Data - John Smith, Amy Smith or Joe Adams - Image by Author
Return Data – John Smith, Amy Smith or Joe Adams – Image by Author

The IN clause is super cool and handy – we can use it to avoid writing really long SQL statements.


Wildcards

Our last SQL filtering concept is the wildcard. Wildcard are super useful to filter based on a "subset" condition.

For instance, imagine that we would like to select all students with "Smith" as last name – we could, of course, split the column to do that but, that would be extra code that is not needed if we know how to apply wildcards.

Wildcards are centered around three different elements:

  • The LIKE clause;
  • The % character;
  • The _ character;

Starting with our example – selecting all the students with last name "Smith":

select * from sandbox.students
where student_full_name LIKE '%Smith';

In this wildcard selection, the % works as "does not matter what’s before Smith". Translating this query into a sentence we are selecting "every row where the full name ends with "Smith", regardless of what’s before that".

Return Data -All Students with Surname "Smith" - Image by Author
Return Data -All Students with Surname "Smith" – Image by Author

What if we want to select all the "Joes" from our table? That’s easy! We just put the wildcard % character to the end of the element we are searching:

select * from sandbox.students
where student_full_name LIKE 'Joe%';
Return Data - All Students with First Name "Joe" - Image by Author
Return Data – All Students with First Name "Joe" – Image by Author

Cool! So the % always works in this way – it makes our query agnostic to what’s before or after the string we are searching. We can also use a double sided % :

select * from sandbox.students
where student_full_name LIKE '%am%';
Return Data - All Students with "am" somewhere in their name - Image by Author
Return Data – All Students with "am" somewhere in their name – Image by Author

This last query will search for all students that contain an "am" in their name somewhere. Amy Smith, Joe Adams, Joe Williams and Anne Williams. The double sided % doesn’t care of what comes before or after the string we are searching, simultaneously.

Finally, let’s check the _ character:

select * from sandbox.students
where student_full_name LIKE '_o%';

The _ specifically points to a position where we are searching for a specific letter or string. For instance, here we are searching for all the names that contain an "o" in the second position, this will yield:

Return Data - All Students with an "o" as second letter of their name— Image by Author
Return Data – All Students with an "o" as second letter of their name— Image by Author

All these students have an "o" as the second letter of their letter. What if we only want the students with an "e" on the third letter of their name? That’s also easy with the _ character!

select * from sandbox.students
where student_full_name LIKE '__e%';

As we now wanted the third position of the "name" – we added two underscores: __ . This query yields:

Return Data - All Students with an "e" as third letter of their name - Image by Author
Return Data – All Students with an "e" as third letter of their name – Image by Author

Wildcards are really cool filtering elements! Mastering wildcards will simplify your queries and make them really efficient.


And we’re done! Thank you for taking the time to read this post. SQL is a really cool language to pump up our data wrangling game.

As a data scientist, there’s a really high chance that you will end up working with data stored in databases – grasping some of these concepts will improve your productivity and code efficiency.

I’ve set up a course on learning SQL from Scratch on Udemy where I’ve included a deep dive into these concepts and more – the course is suitable for absolute beginners and I would love to have you around!

SQL For Absolute Beginners - Image by Author
SQL For Absolute Beginners – Image by Author

Join Medium with my referral link – Ivo Bernardo


Related Articles