import pandas as pd
from sqlalchemy import create_engine, text
#maximum number of rows to display
= 10
pd.options.display.max_rows
=create_engine('sqlite://')
engine= pd.read_csv('customer.csv').to_sql('customer', engine, if_exists='replace', index=False)
df = pd.read_csv('agent.csv').to_sql('agent', engine, if_exists='replace', index=False)
df = pd.read_csv('call.csv').to_sql('call', engine, if_exists='replace', index=False)
df
# function that allows us to run queries
def runQuery(sql):
= engine.connect().execute((text(sql)))
result # print(type(result))
return pd.DataFrame(result.fetchall(), columns=result.keys())
13 SQL
13.1 Why databases?
While we have been dealing with data sitting in CSV files so far, no serious data organization runs their operations off of CSV files on a single person’s computer. This practice presents all sorts of hazards, including but not limited to:
- Destruction of that single device
- Destruction of the files on that device
- Inability to connect to that person’s device from another device that requires the data
- Inability to store more than a limited amount of data (since a single device doesn’t have that much memory)
Therefore, our data should be stored elsewhere if we want to reliably access it in the future and, more importantly, share it and work on it with others. The database is the classic location where modern organizations have chosen to store their data for professional use. Databases have been a topic of research since the late 1960s. Many technology vendors picked up on this and developed databases software for companies to consume. Some of these vendors and products are:
- Microsoft, initially with Microsoft Access and more recently with Microsoft SQL Server
- Oracle, with their Oracle database and MySQL (a popular open source database)
- The “PostgreSQL Global Development Group”, with the open-source PostgreSQL
These databases all implement the standard SQL language and are thus fairly similar to each other in terms of features. However, there are some key differences. Comparing MySQL
vs. PostgreSQL
, the two most popular database systems, MySQL does not implement FULL JOINS
(you will learn about JOIN
s later on). PostgreSQL also supports some more advanced aggregation functions for statistics (you will learn about these soon). For example, in PostgreSQL you can perform regressions directly on the data before retrieving it, whereas MySQL only supports basic stats operations. However, this overhead leads to a slight performance hit, making MySQL faster for simple retrieval tasks.
13.2 Types of databases
At this point, you might believe that databases can be thought of as a collection of data. This is true, but unfortunately it is not that simple. Data cannot simply be thrown in a database the same way you throw your socks in your sock drawer. Depending on your needs for the data, you will choose between one of two main types of databases.
13.2.1 Relational databases
The most common database type is called a relational database, and the systems that manage these kinds of databases are called Relational Database Management Systems (RDBMS). Relational databases date back to the early 1970s and can be considered the first type of database ever conceived. Continuing with our sock example, this drawer would have many identical slots, each for one pair of socks. The socks may be of different materials, colors, brands, etc., but they need to fit into a slot.
Relational databases deal with “relational data”, which is a fancy way of saying “tabular” data. This kind of dataset consists of rows and columns (i.e. tables) where each row corresponds to an observation and each column corresponds to an attribute of that observation. For example if we were keeping track of our friends and their phone numbers, each row on the file (or table) represents one friend and each column represents the information we want to track about that friend (name and phone number). The cell on the intersection of the row and column contains the actual data. Relational data is manipulated using a specific language called SQL (Structured Query Language), which we will learn about soon.
A simple way to conceptualize a table inside a relational database is as a CSV file “copied” to the database. In fact, many databases offer that possibility (assuming your file is correctly formatted, of course).
13.2.2 NoSQL databases
Around 20 years ago, with the advent of the Internet and the necessity to store and process unstructured data (i.e. data that does not fit well in the row-by-column paradigm), developers started to discuss another type of database, which eventually ended up being referred to as a NoSQL database. These databases are not relational and are also built with more “relaxed” rules compared to their predecessors. NoSQL databases are more like a big drawer without slots and not exclusively for socks. You may choose to use this drawer primarily for socks of all sizes - small ones, big ones, maybe even a loose sock by itself - but it could also contain other items like sweaters or pants.
As the name implies, NoSQL databases do not rely on SQL, although many of them do allow you to use SQL to interface with them. At its core, a NoSQL database is simply a key-value store. That is, everything you store (sometimes called a document) in this database has a key associated with it. The database’s job is simply to help you retrieve your desired document as quickly as possible. Nothing pre-determines what a document contains (i.e. it does not have a concept of “tables”); however, this flexibility comes at a price. When you retrieve a document, you have to perform extra checks on it to ensure its validity as the database will not automatically do this for you as it would with a relational database. This may or may not be desireable depending on your particular application.
13.2.3 When to pick one over the other
Picking between RDBMS vs. NoSQL really comes down to the requirements of your project. We touched on this above, but both systems prioritize different parts of the CAP Theorem. Simply put, the CAP Theorem says that a database system can’t have all three of the following:
- Consistency: Every read of the database will return the most up-to-date write version or an error
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most up-to-date write version
- Partition Tolerance: The system continues to operate no matter the network quality between nodes
NoSQL prefers to be partition tolerant over consistent, whereas, RDBMS is the opposite. In certain applications, consistency is imperative, which often forces you into using RDBMS. For example, if you are a bank and you query a customer’s balance, you want to guarantee that the number you get is the most recent one and not the one from yesterday.
For the remainder of the case, we shall only consider RDBMS systems.
13.3 What is this “SQL” thing?
So we’ve been dropping in references to SQL throughout, yet we haven’t explained what it is. Now we will! Just like data can’t really survive without a database, a database can’t be utilized without SQL. SQL is used for a wide variety of tasks, including but not limited to extracting data, creating the internal structure of a database (in the form of tables), and reading and writing data to these tables. SQL is an international standard published by the ISO and so it is the de facto language that all database systems adhere to.
In this course, we will be writing SQL queries using the SQLAlchemy
package in Python. This allows you to directly interface with relational databases without exiting the Python environment, while using syntax that is identical to what you would write outside of Python.
Run the code below to set up this framework:
13.4 SELECT statements
The most important thing you will ever do in SQL is extract a subset of the data from a SQL table based on a set of rules. This is accomplished using the SELECT
statement and the following syntax:
SELECT list of columns
FROM which dataframe?
WHERE clause
To translate the above diagram into words:
- Start with the keyword
SELECT
- Follow with the names of the columns you want to select, separated by commas (alternatively, you can use the
*
symbol to indicate you wish to select all columns) - Follow with the keyword
FROM
- Finish with the name of the table you wish to select data from
- Optionally, you can use the
WHERE
clause to only return results which satisfy certain conditions (similar to how code within Pythonif...then
blocks only execute if the associated conditions are true)
Example:
SELECT CustomerID, Name
FROM Customer
WHERE Occupation != 'Unemployed'
Running in python:
= """SELECT CustomerID, Name
query1 FROM Customer
WHERE Occupation != 'Unemployed'"""
runQuery(query1)
CustomerID | Name | |
---|---|---|
0 | 1 | Michael Gonzalez |
1 | 2 | Amanda Wilson |
2 | 3 | Robert Thomas |
3 | 4 | Eddie Hall |
4 | 6 | Maria Johnson |
... | ... | ... |
755 | 994 | Ruben Steele |
756 | 995 | Ashley Young |
757 | 996 | Mr. Steven Smith |
758 | 997 | Mark Smith |
759 | 999 | Karen Barber |
760 rows × 2 columns
We can define a multiline string in Python using triple quotes: """ my_multiline string """
. This is used to write out the queries in this course. The triple quotes are not needed in regular SQL.
13.5 ORDER BY statements
SQL allows us to have the returned data ordered, using the ORDER BY
statement. The statement ORDER BY
should be followed by a comma-separated list of columns on which you want to order your results (columns that come first take priority in the subsequent ordering). Optionally, you can then append the keyword ASC
or DESC
(short for ascending and descending, respectively) after each column to determine the ordering type (e.g. alphabetical or reverse-alphabetical for a string column).
Example:
SELECT CustomerID
FROM Customer
WHERE Occupation != 'Unemployed'
ORDER BY Name
SELECT CustomerID
FROM Customer
WHERE Occupation != 'Unemployed'
ORDER BY Name ASC
13.6 AS statements
The AS
statment changes the name of a column returned by your query. However, this change is only temporary and is only valid for that particular query. For example, we can rename the Name
column to CustomerName
and order it alphabetically. This operation is known as aliasing.
SELECT CustomerID, Name AS CustomerName
FROM Customer
WHERE Occupation != 'Unemployed'
ORDER BY CustomerName
In Python:
= """SELECT CustomerID, Name AS CustomerName
query2 FROM Customer
WHERE Occupation != 'Unemployed'
ORDER BY CustomerName"""
runQuery(query2)
CustomerID | CustomerName | |
---|---|---|
0 | 900 | Aaron Gutierrez |
1 | 622 | Aaron Rose |
2 | 226 | Adam Ward |
3 | 786 | Alan Chambers |
4 | 985 | Alan Mitchell |
... | ... | ... |
755 | 699 | Willie Greene |
756 | 715 | Yesenia Wright |
757 | 952 | Yolanda White |
758 | 421 | Zachary Ruiz |
759 | 392 | Zachary Wilson |
760 rows × 2 columns
13.7 DISTINCT statements
If you only want unique elements from a data table, you’ll need to write the keyword DISTINCT
immediately after SELECT
in your query.
Example:
SELECT DISTINCT Occupation
FROM Customer
ORDER BY Occupation
13.8 LIKE
The LIKE
operator in SQL is used to search for a specified pattern in a column. It is often used with the WHERE
clause to filter records based on partial matching patterns.
13.8.1 Basic Syntax
The basic syntax for the LIKE
operator is:
SELECT column1, column2, ...
FROM table_name
WHERE column LIKE pattern;
13.8.2 Wildcards
The LIKE
operator is often used with two wildcards:
- %: Represents zero or more characters.
- **_**: Represents a single character.
13.8.3 Examples
13.8.3.1 Using the % Wildcard
To find all records where a column (e.g., name
) starts with a specific letter (e.g., ‘J’):
SELECT *
FROM students
WHERE name LIKE 'J%';
This query returns all students whose names start with ‘J’, such as ‘John’, ‘Jane’, ‘Jack’, etc.
To find all records where the column ends with a specific letter (e.g., ‘n’):
SELECT *
FROM students
WHERE name LIKE '%n';
This query returns all students whose names end with ‘n’, such as ‘John’, ‘Megan’, etc.
To find all records where the column contains a specific substring (e.g., ‘an’):
SELECT *
FROM students
WHERE name LIKE '%an%';
This query returns all students whose names contain ‘an’, such as ‘Megan’, ‘Andrew’, etc.
13.8.3.2 Using the _ Wildcard
To find all records where a column (e.g., name
) has a specific character at a specific position (e.g., second character is ‘a’):
SELECT *
FROM students
WHERE name LIKE '_a%';
This query returns all students whose names have ‘a’ as the second character, such as ‘James’, ‘Sarah’, etc.
13.8.3.3 Combining Wildcards
You can combine %
and _
to form more complex patterns. For example, to find all names that start with ‘J’, have any two characters, and end with ‘n’:
SELECT *
FROM students
WHERE name LIKE 'J__n';
This query returns names like ‘John’ and ‘Jian’.
13.8.4 Case Sensitivity
- In most databases,
LIKE
is case-insensitive by default (e.g., MySQL). - In some databases (e.g., PostgreSQL),
LIKE
is case-sensitive. UseILIKE
for case-insensitive matching.
13.8.5 Using ESCAPE Clause
If you need to search for characters like %
or _
, which are normally wildcards, you can use the ESCAPE
clause:
SELECT *
FROM students
WHERE name LIKE '%\%%' ESCAPE '\';
This query finds all names that contain the %
character.
13.9 CASE statements
A CASE statement in SQL is used to create conditional logic within your queries. It’s similar to an if-then-else statement in programming languages. You can use CASE to perform different actions based on various conditions.
13.9.1 Basic Syntax
The basic syntax for a CASE statement in SQL is:
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE resultN
END
13.9.2 Example Usage
Here’s an example to illustrate its use:
Suppose you have a students
table with columns student_id
, name
, and score
. You want to categorize the students based on their scores into “Pass” or “Fail”.
SELECT
student_id,
name,
score,CASE
WHEN score >= 50 THEN 'Pass'
ELSE 'Fail'
END AS result
FROM
students;
In this example: - The CASE
statement checks each student’s score. - If the score is 50 or more, it returns ‘Pass’. - If the score is less than 50, it returns ‘Fail’.
13.9.3 Nested CASE Statements
You can also nest CASE statements for more complex conditions:
SELECT
student_id,
name,
score,CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
WHEN score >= 70 THEN 'C'
WHEN score >= 60 THEN 'D'
ELSE 'F'
END AS grade
FROM
students;
In this example: - The CASE
statement categorizes scores into letter grades.
13.9.4 Using CASE in WHERE Clause
You can use CASE statements in WHERE clauses, but it’s less common. Here’s an example:
SELECT
student_id,
name,
scoreFROM
studentsWHERE
CASE
WHEN score >= 50 THEN 'Pass'
ELSE 'Fail'
END = 'Pass';
13.10 JOIN
In SQL, the JOIN
clause is used to combine rows from two or more tables based on a related column between them. There are several types of joins, each serving a different purpose.
13.10.1 Types of JOINs
- INNER JOIN: Returns only the rows that have matching values in both tables.
- LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table, and the matched rows from the right table. The result is NULL from the right side if there is no match.
- RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table, and the matched rows from the left table. The result is NULL from the left side if there is no match.
- FULL JOIN (or FULL OUTER JOIN): Returns all rows when there is a match in either left or right table. Rows without a match in the other table will have NULLs for the columns from that table.
- CROSS JOIN: Returns the Cartesian product of the two tables, meaning it returns all possible combinations of rows from the tables.
- SELF JOIN: A regular join, but the table is joined with itself.
13.10.2 Examples
13.10.2.1 INNER JOIN
This join returns only the rows where there is a match in both tables.
SELECT students.student_id, students.name, scores.score
FROM students
INNER JOIN scores ON students.student_id = scores.student_id;
13.10.2.2 LEFT JOIN
This join returns all rows from the left table, and the matched rows from the right table. If no match is found, NULLs are returned for columns from the right table.
SELECT students.student_id, students.name, scores.score
FROM students
LEFT JOIN scores ON students.student_id = scores.student_id;
13.10.2.3 RIGHT JOIN
This join returns all rows from the right table, and the matched rows from the left table. If no match is found, NULLs are returned for columns from the left table.
SELECT students.student_id, students.name, scores.score
FROM students
RIGHT JOIN scores ON students.student_id = scores.student_id;
13.10.2.4 FULL JOIN
This join returns all rows when there is a match in either left or right table. Rows without a match in the other table will have NULLs for the columns from that table.
SELECT students.student_id, students.name, scores.score
FROM students
FULL JOIN scores ON students.student_id = scores.student_id;
13.10.2.5 CROSS JOIN
This join returns the Cartesian product of the two tables, meaning it returns all possible combinations of rows from the tables.
SELECT students.student_id, students.name, scores.score
FROM students
CROSS JOIN scores;
13.10.2.6 SELF JOIN
This join is used to join a table with itself, often to compare rows within the same table.
SELECT A.student_id, A.name AS student_name, B.name AS buddy_name
FROM students A
INNER JOIN students B ON A.buddy_id = B.student_id;
13.11 Aggregate functions
Aggregate functions in SQL perform calculations on a set of values and return a single value. The following are the most commonly used SQL aggregate functions:
AVG()
– calculates the average of a set of valuesCOUNT()
– counts rows in a specified table or viewMIN()
– gets the minimum value in a set of valuesMAX()
– gets the maximum value in a set of valuesSUM()
– calculates the sum of values
PostgreSQL as some more advanced aggregate functions. Specifically, they have some nice ones for statistics. For example,
regr_intercept(Y, X)
- Returns the intercept for the line of best fitregr_slope(Y, X)
- Returns the slope of the line of best fitcorr(Y, X)
- Returns the correlation between two columns
Here are brief explanations of the previously specified aggregate functions:
13.11.1 1. AVG()
- Description: Calculates the average of a set of values.
- Example: To find the average score of students:
SELECT AVG(score) AS average_score
FROM students;
13.11.2 2. COUNT()
- Description: Counts the number of rows in a specified table or view.
- Example: To count the number of students:
SELECT COUNT(*) AS number_of_students
FROM students;
13.11.3 3. MIN()
- Description: Gets the minimum value in a set of values.
- Example: To find the minimum score:
SELECT MIN(score) AS minimum_score
FROM students;
13.11.4 4. MAX()
- Description: Gets the maximum value in a set of values.
- Example: To find the maximum score:
SELECT MAX(score) AS maximum_score
FROM students;
13.11.5 5. SUM()
- Description: Calculates the sum of values.
- Example: To find the total score of all students:
SELECT SUM(score) AS total_score
FROM students;
13.12 SQL statement types
We have introduced SQL’s Data Manipulation Language (DML) statements; that is, statements that are used to read or write (manipulate) data from the database. However, SQL also has the ability to create, modify, and remove database objects themselves as well as the data within them. It does this by using Data Definition Language (DDL) statements which are commands that define the different structures in a database. You will learn more about these statements in future cases.
There are two other types of SQL statements that are important, but less likely to be used by someone who is merely focused on analyzing data. We’ll not not dig into these as it’s very unlikely that you’ll have to deal with these anytime soon, but you are free to read up about them elsewhere if you are interested. They are:
Data Control Language (DCL): These determine who has permission to do what in the database. Everytime you log in to a database, you do that using (your) database user account. By default, a user after being created does not have permission to do anything, so someone (normally a database administrator (DBA)) needs to grant permission to that user to perform certain operations on the database.
Transactional Control Language (TCL): These commands are used to guarantee that full units of work are either completed as a whole or not at all. An example is a bank transfer: you need to ensure that if money has been withdrawn from account A, then it has also been deposited in account B, which requires wrapping these two commands into a transaction.