1  Python basics

1.1 Python and Jupyter

Python is a general purpose programming language that allows for both simple and complex data analysis. Python is incredibly versatile, allowing analysts, consultants, engineers, and managers to obtain and analyze information for insightful decision-making.

The Jupyter Notebook is an open-source web application that allows for Python code development. Jupyter further allows for inline plotting and provides useful mechanisms for viewing data that make it an excellent resource for a variety of projects and disciplines.

The following section will outline how to install and begin working with Python and Juypter.

1.2 Setting up the Python Environment locally (optional)

Instruction guides for Windows and MacOS are included below. Follow the one that corresponds with your operating system.

1.2.1 Windows Install

  1. Open your browser and go here

  2. Click on your OS and then “Download”

  3. Run the downloaded file found in the downloads section from Step 2

  4. Click through the install prompts

  5. Go to menu (or Windows Explorer), find the Anaconda3 folder, and double-click to run OR Use a Spotlight Search and type “Navigator”, select and run the Anaconda Navigator program. Note that MacOS also comes with Python pre-installed, but you should not use this version, which is used by the system. Anaconda will run a second installation of Python and you should ensure that you only use this second installation for all tasks.

1.2.2 Compare and contrast Jupyter, Python and Anaconda

  • Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
  • Python is a programming language that is often used in scientific computing, data science, and general-purpose programming.
  • Anaconda is a distribution of Python and R for scientific computing and data science. It includes the conda package manager, which makes it easy to install packages for scientific computing and data science, as well as the Jupyter Notebook and other tools.
  • In simple terms Anaconda is a distribution and python is a language, Jupyter notebook is an application to create and share document that contains live code, equations, visualizations and narrative text.

1.3 An alternative to Jupyter is VSCode

VSCode is a very popular tool for programming in a variety of languages. In can be used in place of Jupyter. It can be downloaded from here. After installation, follow this guide to set up Jupyter in VSCode. You will also need the Python extension and the Jupyter extension.

1.4 File Management with Python and Jupyter

It is common practice to have a main folder where all projects will be located (e.g. “jupyter_research”). The following are guidelines you can use for Python projects to help keep your code organized and accessible:

  1. Create subfolders for each Jupyter-related project
  2. Group related .ipynb (the file format for Jupyter Notebook) files together in the same folder
  3. Create a “Data” folder within individual project folders if using a large number of related data files

You should now be set up and ready to begin coding in Python!

1.5 Fundamentals of Python

In this case, we will introduce you to more basic Python concepts. If you prefer, you can first go through the more extensive official Python tutorial and/or the W3School Python tutorial and practice more fundamental concepts before proceeding with this case. It is highly recommended that you go through either or both of these tutorials either before or after going through this notebook to solidify and augment your understanding of Python. For a textbook introduction to Python, see this text, from which some of the following material is adapted/taken.

1.5.1 What is a program?

  • A program is a sequence of instructions that specifies how to perform a computation.
  • The computation might be something mathematical, such as solving a system of equations or finding the roots of a polynomial, but it can also be a symbolic computation, such as searching and replacing text in a document or something graphical, like processing an image or playing a video.
  • The details look different in different languages, but a few basic instructions appear in just about every language:
    • input: Get data from the keyboard, a file, the network, or some other device.
    • output: Display data on the screen, save it in a file, send it over the network, etc.
    • math: Perform basic mathematical operations like addition and multiplication.
    • conditional execution: Check for certain conditions and run the appropriate code.
    • repetition: Perform some action repeatedly, usually with some variation.

Every program you’ve ever used, no matter how complicated, is made up of instructions that look pretty much like these. So you can think of programming as the process of breaking a large, complex task into smaller and smaller subtasks until the subtasks are simple enough to be performed with one of these basic instructions.

Note that writing a program involves typing out the correct code, and then running, or executing that code. For example, the print() function allows you to print an object. Placing a python object in the round brackets and running the code makes the computer print the object.

print("Hello World")
Hello World

After printing, the simplest use of Python is as a calculator. We can use the operators +, -, /, and * perform addition, subtraction, division and multiplication. We can use ** for exponentiation. See the following examples:

print(40 + 2)
print(43 - 1)
print(6 * 7)
print(2**2)
42
42
42
4

Note that a Python program contains one or more lines of code, which are executed in a top-down fashion.

1.5.2 Values and types

  • A value is one of the basic things a program works with, like a letter or a number.
  • Some values we have seen so far are 2, 5.0, and ‘Hello, World!’.
  • These values belong to different types: 2 is an integer, 5.0 is a floating-point number, and ‘Hello, World!’ is a string, so-called because the letters it contains are strung together.

Use type() to determine the type of a value. Example:

print(type(2))
print(type(42.0))
print(type('Hello, World!'))
<class 'int'>
<class 'float'>
<class 'str'>

If you are used to using Microsoft Excel, this is similar to how Excel distinguishes between data types such as Text, Number, Currency, Date, or Scientific. As noted above, some common data types that you will come across in Python are:

  1. Integer type int: 1
  2. Float type float: 25.5
  3. String type str: 'Hello'
  • Here we see (1) integers and (2) floats store numeric data. The difference between the two is that floats store decimal variables (fractions), whereas the integer type can only store integer variables (whole numbers).
  • Finally, (3) is the string type. Strings are used to store textual data in Python (a string of one or more characters). Later in this case we will use string variables to store country names. They are often used to store identifiers such as names of people, city names, and more.

There are other data types available in Python; however, these are the three fundamental types that you will see across almost every Python program. Always keep in mind that every value, or object, in Python has a type and some of these “types” can be custom-defined by the user, which is one of the benefits of Python.

1.5.3 Variables

We can assign names to objects in python so that they are easy to manipulate, a named object is called a variable. Use the = sign to assign a name to a variable.

  • For example, if a user aims to store the integer 5 in an object named my_int, this can be accomplished by writing the Python statement, my_int = 5.
  • In this case, my_int is a variable, much like you might find in algebra, but the = sign is for assignment not equality .
  • So my_int = 5 should be taken to mean something more like Let my_int be equal to 5 rather than my_int is equal to 5.
my_int=5
print(my_int)
5

Here, my_int is an example of a variable because it can change value (we could later assign a different value to it) and it is of type Integer, known in Python simply as int. Unlike some other programming languages, Python guesses the type of the variable from the value that we assign to it, so we do not need to specify the type of the variable explicitly. For example,

  1. Integers, type int: my_int = 1
  2. Float type float: my_float = 25.5
  3. Strings, type str: my_string = 'Hello'

Note that the names my_int, my_float and my_string are arbitrary here. While it is useful to name your variables so that the names contain some information about what they are used for, this code would be functionally identical to if we had used x, xrtqp2 and my_very_long_variable_name, respectively.

my_int=5
print(my_int)
country="Canada"
print(country)
5
Canada

We mentioned before that variables can change value. Let’s take a look at how this works, and also introduce a few more Python operations:

x = 4
print(x)
y = 2
x = y + x
print(x)
4
6

Again, if you’re used to syntax from mathematics “x = y + x” might look very wrong at first glance. However, it simply means “throw out the existing value of x, and assign a new value which is calculated as the sum of y and x”. Here, we can see that the value of x changes, demonstrating why we call them “variables”.

We can also use more operators than just + and -. We use * for multiplication, / for division, and ** for power. The standard order of operations rules from mathematics also applies and parentheses () can be used to override these.

1.6 Data structures

A data structure is a data/value type which organizes and stores multiple values. These are more complicated data types that comprise many single pieces of data, organized in a specific way. Examples include dictionaries, arrays, lists, stacks, queues, trees, and graphs. Each type of data structure is designed to handle specific scenarios.

  • As before, we use =, the assignment operator, to assign a value to the variable.

1.6.1 Dictionaries

  • Python’s dictionary type stores a mapping of key-value pairs that allow users to quickly access information for a particular key.
  • By specifying a key, the user can return the value corresponding to the given key. Python’s syntax for dictionaries uses curly braces {}:

Syntax for creation:

user_dictionary = {'Key1': Value1, 'Key2': Value2, 'Key3': Value3}

Notes:

In Python, the dictionary type has built-in methods to access the dictionary keys and values.

  • These methods are called by typing .keys() or .values() after the dictionary object.
  • We will change the return type of calling .keys() and .values() to a list by using the list() method. Below when we print the unconverted keys, the first thing you see is dict_keys, indicating the type of the data. Convert it to a list which is a simpler and more common data type. We can do this by passing the data into the list(...) function.

Example:

# Creating a dictionary
person = {
    "name": "Alice",
    "age": 30,
    "city": "New York"
}

# Accessing values
print(person["name"])  # Output: Alice
print(person["age"])   # Output: 30
print(person["city"])  # Output: New York

# Adding a new key-value pair
person["job"] = "Engineer"

# Updating an existing value
person["age"] = 31

# Deleting a key-value pair
del person["city"]

# Printing the updated dictionary
print(person)

# Getting all keys
keys = person.keys()

# Converting to list
keys_list=list(keys)
print(keys)  # Output: dict_keys(['name', 'age', 'job'])
print(keys_list) # Output: type list

# Getting all values
values = person.values()
print(values)  # Output: dict_values(['Alice', 31, 'Engineer'])

# Converting to list
values_list=list(values)
print(values_list) # Output: type list


print(type(values_list))
print(type(list(values)))


# Creating a nested dictionary
person = {
    "name": "Alice",
    "age": 30,
    "address": {
        "city": "New York",
        "zipcode": "10001"
    }
}

# Accessing elements in a nested dictionary
print(person["address"]["city"])    # Output: New York
print(person["address"]["zipcode"]) # Output: 10001
Alice
30
New York
{'name': 'Alice', 'age': 31, 'job': 'Engineer'}
dict_keys(['name', 'age', 'job'])
['name', 'age', 'job']
dict_values(['Alice', 31, 'Engineer'])
['Alice', 31, 'Engineer']
<class 'list'>
<class 'list'>
New York
10001

1.6.2 Lists

A list is an incredibly useful data structure in Python that can store any number of Python objects. Lists are denoted by the use of square brackets []:

Syntax:

user_list = [Value1, Value2, Value3]

Example:

# Creating a list
fruits = ["apple", "banana", "cherry", "date"]


# Adding a new element
fruits.append("orange")

# Updating an existing element
fruits[1] = "blueberry"

# Deleting an element
del fruits[2]

# Printing the updated list
print(fruits)

# Getting the length
print(len(fruits))


fruits = ["apple", "banana", "cherry", "date"]




# Accessing elements
print(fruits[0])  # Output: apple
print(fruits[1])  # Output: banana
print(fruits[2])  # Output: cherry

# Accessing elements by negative index
print(fruits[-1])  # Output: date
print(fruits[-2])  # Output: cherry
print(fruits[-3])  # Output: banana
print(fruits[-4])  # Output: apple



# Accessing a range of elements
print(fruits[1:3])  # Output: ['banana', 'cherry']
print(fruits[:2])   # Output: ['apple', 'banana']
print(fruits[2:])   # Output: ['cherry', 'date']
print(fruits[:])    # Output: ['apple', 'banana', 'cherry', 'date']

# Creating a nested list
nested_list = [["apple", "banana"], ["cherry", "date"]]

# Accessing elements in a nested list
print(nested_list[0][0])  # Output: apple
print(nested_list[0][1])  # Output: banana
print(nested_list[1][0])  # Output: cherry
print(nested_list[1][1])  # Output: date
['apple', 'blueberry', 'date', 'orange']
4
apple
banana
cherry
date
cherry
banana
apple
['banana', 'cherry']
['apple', 'banana']
['cherry', 'date']
['apple', 'banana', 'cherry', 'date']
apple
banana
cherry
date

List computations example:

# Creating a list of numbers
numbers = [3.5, 1.2, 6.8, 2.4, 5.1]

# Finding the minimum value
min_value = min(numbers)
print("Minimum value:", min_value)  # Output: Minimum value: 1.2

# Finding the maximum value
max_value = max(numbers)
print("Maximum value:", max_value)  # Output: Maximum value: 6.8

# Summing all elements in the list
total_sum = sum(numbers)
print("Sum of all elements:", total_sum)  # Output: Sum of all elements: 19.0

# A more complicated example - this is a list comprehension - more on this later. 
# Rounding each element to the nearest integer
rounded_numbers = [round(num) for num in numbers]
print("Rounded numbers:", rounded_numbers)  # Output: Rounded numbers: [4, 1, 7, 2, 5]
Minimum value: 1.2
Maximum value: 6.8
Sum of all elements: 19.0
Rounded numbers: [4, 1, 7, 2, 5]

1.7 The in operator

The in operator in Python is used to check for the presence of an element within a collection, such as a list, tuple, set, or dictionary. (Tuples and sets are other data structures we have not learnt about yet.) When used with lists or other sequences, it checks if a specific value is contained in the sequence and returns True if it is, and False otherwise. When used with dictionaries, the in operator checks for the presence of a specified key. If the key exists in the dictionary, it returns True; otherwise, it returns False. This operator provides a simple and readable way to perform membership tests in various data structures.

Example:

# Creating a dictionary
person = {
    "name": "Alice",
    "age": 30,
    "city": "New York"
}

# Using the 'in' operator to check for a key
print("name" in person)  # Output: True
print("job" in person)   # Output: False

# Using the 'in' operator to check for a value
print("Alice" in person.values())  # Output: True
print("Engineer" in person.values())  # Output: False


# Creating a list
fruits = ["apple", "banana", "cherry"]

# Using the 'in' operator to check for an element
print("banana" in fruits)  # Output: True
print("orange" in fruits)  # Output: False
True
False
True
False
True
False

1.8 Comments and debugging

  • Comments are lines of code which begin with the # symbol. Nothing happens when you run these lines. Their purpose is to describe the code you have written, especially if it would be unclear to someone else reading it. You should be commenting your code as you go along. For example “these lines compute the average” or “These lines remove missing data” etc.

  • We have seen that programs may have errors or bugs. The process of resolving bugs is known as debugging. Debugging can be a very frustrating activity. Please be prepared for this. Following, this Python textbook, there are three kinds of errors you may encouter:

    • Syntax error: “Syntax” refers to the rules of the language. If there is a syntax error anywhere in your program, Python displays an error message and quits.
    • Runtime error: The second type of error is a runtime error, so called because the error does not appear until after the program has started running. These errors are also called exceptions because they usually indicate that something exceptional (and bad) has happened.
    • Semantic error: If there is a semantic error in your program, it will run without generating error messages, but it will not do the right thing. It will do something else. Specifically, it will do what you told it to do. Identifying semantic errors can be tricky because it requires you to work backward by looking at the output of the program and trying to figure out what it is doing. Try runnning your code one line at a time and ensuring each line is doing what you expect, using the print feature.

Debugging means to fix errors in your program or code. Debugging is a constant and necessary part of writing code, do not be surprised if you are spending most of your time debugging. In order to debug syntax and runtime errors, the first step is to read the error message.

For example if you run the following:

fruits=['apple','banana','orange']
fruits[3]

You will get the following error message:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File 
      1 fruits=['apple','banana','orange']
----> 2 fruits[3]

IndexError: list index out of range"

You can see that there is an error pointing to the second line fruits[3]. That is the line where the program encountered an error. The error it encountered is displayed at the bottom. “IndexError: list index out of range.”

We can use the error for clues as to what could be going wrong. In this case, out of range means the index is too large. Inspecting the code further, you can see that Python indexing starts at 0, and so to access the third element of the list, we use the index 2.

If you do not understand the error message, you should Google search the error message to find out what it means. You can also use tools such as StackOverflow threads and ChatGPT to help you debug your code. When you do not know what is causing the error, it can be helpful to go through the code line by line, printing the output of each variable as you go, to see where the program is going wrong.

1.9 Control flow elements

1.9.1 For loops

One control flow element in Python is the for loop.

  • The for loop allows one to execute the same statements over and over again (i.e. looping).
  • This saves a significant amount of time coding repetitive tasks and aids in code readability.

Syntax:

for iterator_variable in some_sequence:
    statements(s)

The for loop iterates over some_sequence and performs statements(s) at each iteration.

  • That is, at each iteration the iterator_variable is updated to the next value in some_sequence.
  • As a concrete example, consider the loop:

Example:

for i in [1,2,3,4]:
    print(i*i)
1
4
9
16
  • Here, the for loop will print to the screen four times; that is it will print 1 on the first iteration of the loop, 4 on the second iteration, 9 on the third, and 16 on the fourth.
  • Hence, the for loop statement will iterate over all the elements of the list [1,2,3,4], and at each iteration it updates the iterator variable i to the next value in the list [1,2,3,4].
  • In for loops, it is an extremely good habit to choose an iterator variable that provides context rather than a random letter.
  • In this case, we will use both to get you accustomed to both.
  • This is because you will see both throughout the course of your data science career, but we encourage you to not use a generic name like i whenever possible for ease of communication.

1.9.2 List comprehensions

A list comprehension is a concise way to create a new list by applying an expression to each element of an existing list while optionally filtering elements based on a condition. It combines loops and conditional statements into a single line of code, making it efficient and readable for creating lists with specific transformations or filters. It can be used to replace a for loop with shorter code.

Example:

# Using a for loop
numbers = [1, 2, 3, 4, 5]
squared_numbers = []
for num in numbers:
    squared_numbers.append(num ** 2)
print("Squared numbers (using for loop):", squared_numbers)

# Using list comprehension
squared_numbers = [num ** 2 for num in numbers]
print("Squared numbers (using list comprehension):", squared_numbers)


# Using a for loop
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
even_numbers = []
for num in numbers:
    if num % 2 == 0:
        even_numbers.append(num)
print("Even numbers (using for loop):", even_numbers)

# Using list comprehension
even_numbers = [num for num in numbers if num % 2 == 0]
print("Even numbers (using list comprehension):", even_numbers)
Squared numbers (using for loop): [1, 4, 9, 16, 25]
Squared numbers (using list comprehension): [1, 4, 9, 16, 25]
Even numbers (using for loop): [2, 4, 6, 8, 10]
Even numbers (using list comprehension): [2, 4, 6, 8, 10]

1.9.3 If statements and booleans

A boolean is a data type that represents one of two possible states: True or False. Booleans are crucial for controlling the flow of programs, making decisions, and executing code based on specific conditions being met or not. Booleans are used extensively for making decisions and comparisons. They are often the result of logical operations, such as comparisons (e.g., greater than, less than) or boolean operations (e.g., and, or, not).

As stated above, a boolean object can take on two values:

  • True: Represents a condition that is considered true or valid.
  • False: Represents a condition that is considered false or invalid.

If statements are conditional statements that allow you to execute certain blocks of code based on specified conditions. They form the foundation of decision-making in code, enabling programs to make choices and take different actions depending on whether certain conditions are True or False.

Syntax:

if test_expression_1:
    block1_statement(s)
elif test_expression_2:
    block2_statement2(s)
else:
    block3_statement(s)

Example:

# Example 0: A boolean
x=2
y=7
print(True)
print(x==y)
print(x > 5)


# Example 1: Simple if statement
x = 10

if x > 5:
    print("x is greater than 5")

# Example 2: if-else statement
y = 3

if y % 2 == 0:
    print("y is even")
else:
    print("y is odd")

# Example 3: if-elif-else statement
grade = 75

if grade >= 90:
    print("Grade is A")
elif grade >= 80:
    print("Grade is B")
elif grade >= 70:
    print("Grade is C")
elif grade >= 60:
    print("Grade is D")
else:
    print("Grade is F")
True
False
False
x is greater than 5
y is odd
Grade is C

Explanation:

Example 0: Creates different types of boolean variables.

Example 1: Checks if x is greater than 5. If true, it prints “x is greater than 5”.

Example 2: Checks if y is even (remainder of division by 2 is zero). If true, it prints “y is even”; otherwise, it prints “y is odd”.

Example 3: Evaluates the value of grade and prints a corresponding grade based on the ranges specified using if, elif (else if), and else statements. This demonstrates chaining conditions to determine a grade based on numerical thresholds.

1.10 String formatting

String formatting refers to the various techniques used to insert dynamic values into strings in a controlled and formatted manner. We cover .format and f-strings.

1.10.1 f-strings (Formatted String Literals):

Syntax:

f"some text {expression1} more text {expression2} ..."
  • f prefix before the string indicates it’s an f-string.
  • {expression} inside curly braces {} evaluates expression and inserts its value into the string.
  • You can directly embed Python expressions, variables, or function calls inside {}.

1.10.2 .format() Method:

Syntax:

"some text {} more text {}".format(value1, value2)
  • {} acts as placeholders in the string.
  • .format() method is called on a string object, and values passed to it replace corresponding {} in the string.
  • You can specify the order of substitution using {0}, {1}, etc., or use named placeholders {name}, {age}.

Differences: - f-strings are more concise and readable. - .format() method offers more flexibility, such as specifying formatting options or reusing values.

Examples:

# Example using .format() method
name = "Bob"
age = 25
formatted_string = "Hello, {}! You are {} years old.".format(name, age)
print(formatted_string)


# Example using f-strings
name = "Charlie"
age = 35
formatted_string = f"Hello, {name}! You are {age} years old."
print(formatted_string)
Hello, Bob! You are 25 years old.
Hello, Charlie! You are 35 years old.