Introduction to Relational Databases

# Introduction to Relational Databases
## Computational Statistics<br>& Operations Research
### <a href='htps://janusvm.github.io'>Janus Valberg-Madsen</a>
### 2019-03-18 @ Aalborg University

---

# Course structure

1. Introduction to databases and SQL
1. Aggregation, JOINs, and subqueries
1. Using a PostgreSQL database from R
1. Database management and performance
1. Self study session

???
## Goal of the course

- **is not**: to cover _all_ the details of the SQL standard or the Postgres implementation
- **is:** to teach you enough of the basics that you can use it as part of data analysis tasks

## Structure & exam

- 5 sessions in total: 4 lectures and 1 self study session

+ lecture 1 & 2: general intro to SQL
    + lecture 3 & 4: practical aspects
    + self study: hand-in exercise, which must be completed to pass;

this exercise will contain questions in topics from all four lectures, so make sure to show up!

---

# Today's topics

- What are databases and why/when do they make sense to use?
- Basic SQL syntax
- Using `psql` and `pgAdmin`

???
For today,

- we start with a brief introduction to databases in general, talking about pros and cons
- then we dive into the basic standard SQL syntax, with some examples and class exercises
- finally we take a look at using Postgres specifically, and set up a test database that we'll be using throughout the lectures

---
layout: false

# Why databases?

???
So _when_ and _why_ might you use databases?

---
class: center
background-image: url(https://upload.wikimedia.org/wikipedia/commons/0/0c/ComputerMemoryHierarchy.svg)
background-size: 70%

???
It might be helpful to first establish an overview of different options for data storage:

- Volatile; does not retain content when powered off

+ Processor registers: small amount of "instant" access storage in the CPU; up to a few thousand bytes in size
    + Cache: different levels of fast storage, from ~KBs in size and ~700GB/s access to ~MBs in size and ~40GB/s access
    + RAM: GBs in size, access around ~10GB/s

- Non-volative; retains content when powered off

+ Flash memory: GBs in size, ~640MB/s; short term storage, used for transfer
    + Disk storage: TBs in size, up to ~2GB/s (with SSDs); mid term storage, rule of thumb is that hard drives last about 5 years
    + Magnetic tape: EBs (exabytes) of size, best speed ~160MB/s; long term storage, can last decades

SQL databases fall under the "disk storage/hard drives" category.

---
layout: true
name: rectangular

# Rectangular data

???
You are likely used to data being _rectangular_, i.e. suitable for storing in a 2-dimensional table, each column corresponding to a different _variable_, and each row corresponding to an _observation_ of those variables together.

## Data frames

- This is exactly what a data frame in e.g. R is like
- Data frames only "live" within the R/Python/Julia/etc. session

## CSV files

- Data persistent between sessions
- Plain text files are not very efficient
- Entire file needs to be parsed (e.g. via `read_csv`)

## Databases

- In a database, you have a much higher level of control
- Data is not plainly available, but "lives" on a server
- The server can filter and aggregate by sending instructions via a client program

---
template: rectangular

---
template: rectangular

### CSV files
]

.right-column[
- Tables stored "permanently" in files on the hard drive
- Unlimited size<sup>*</sup>, slower access
- You still have to read it into memory to do stuff
]

---
template: rectangular

### CSV files

### Databases
]

.right-column[
- A more organised approach to data collection
- Data is "hidden away" on server (binary files somewhere on disk)
- Client program needed to interact with it
]

---
layout: false

# Pros

- Huge amounts of data
- No data redundancy
- Statically typed
- Users can have different privileges
- Preprocess data before loading it into R

]

# Cons

- May seem complicated at first
- It's a bit harder to "just view" data (compared to plain text files)
- A bit more effort to get data into R

]

???
## Pros

- Since data is stored on disk, we can have TBs of data, spread across many tables
- Databases are optimised to store data efficiently (contrast with CSV files, where all values are ASCII sequences of 1 byte per character)
- The type of a value is determined by its column; each column has a type, and it can only contain values of that type (contrast with CSV again)
- As the database administrator, you can create mutliple "roles" (users) with different privileges, e.g. read/write/create database/etc.
- Execute queries on the data on the server and pull the results into memory locally

## Cons

- More of a learning curve than simple CSV files
- With CSV files, you can open the file on disk with any text editor and look at the data; with SQL you have to query the database with a client
- As we'll see later, it takes more than just a simple one-liner to load data into R (but not _much_ more)

---

# Terminology

![](https://upload.wikimedia.org/wikipedia/commons/7/7c/Relational_database_terms.svg)

???
SQL is based on the _relational data model_, coined by computer scientist Edgar F. Codd in 1969.

This model is described using so-called _relational algebra_ with certain set operations, but we will not be diving into that area.

The terms used in the theory and their practical/SQL counterparts are:

RM | SQL
--- | ---
Relation | Table
Attribute | Column
Tuple | Row

A **relation** (table) is a set of **tuples** (rows), where each element belongs to an **attribute** domain (column/type).

---
layout: false

# SQL

???
So, in order to read or write to a database, we need to tell the server what we want.
This is what SQL is for.

---

# SQL

- Short for **Structured Query Language**
- A declarative, domain-specific language
- Used to communicate with the server
- Several implementations exist (we'll be using PostgreSQL)

???
- SQL was developed by IBM in the 1970's, initially called SEQUEL (Structured English Query Language), but the name later changed because "SEQUEL" was a trademark of another company
- Declarative, domain-specific:

+ Declarative: you write _what_ you want, rather than _how_ the program should do it (imperative)
    + Domain-specific: it's not a general programming language, but rather it deals with the domain of relational databases

- You type queries in this language into a client program, that executes them on the server
- Other examples include: MySQL, MS SQL Server, SQLite, and Oracle; the choice of PostgreSQL is based on it being cross platform, open source, and standard compliant

---

# Basic syntax: selection

???
Most commonly, you'll be querying the database to _read_ data from a table;
this is done with a `SELECT` statement.

```sql
SELECT col1, col2, ... FROM table1;
```

"**SELECT** all rows of columns `col1`, `col2`, etc. **FROM** the table named `table1`"

???
The syntax reads much like plain English;
you say which columns you want to `SELECT` (retrieve) and `FROM` which table.

```sql
SELECT * FROM table1;
```

"**SELECT** all rows of all columns **FROM** the table `table1`"

???
The `*` in the column list is shorthand for all columns.

---
layout: true
# Basic syntax: filtering

---

???
Oftentimes, you don't want _all_ the rows in a table, but only a subset.
This is called _filtering_.

```sql
SELECT * FROM table1
 WHERE <condition>;
```

"**SELECT** the rows **FROM** `table1` **WHERE** `condition` is true"

???
The `WHERE` keyword is used to filter rows conditionally.
Here, `condition` is any statement that evaluates to a boolean value (`true/false`)

```sql
SELECT DISTINCT * FROM table1;
```

"**SELECT** **DISTINCT** (unique) rows **FROM** `table1`"

???
The `DISTINCT` keyword signifies that you don't want duplicate rows; only unique combinations of the values in the selected columns are retrieved.

```sql
SELECT * FROM table1 LIMIT <number>;
```

"**SELECT** the rows **FROM** `table1` and **LIMIT** the output to only `number` of them"

???
The `LIMIT` keyword tells the database to return only up to a fixed number of rows - good for previewing tables

---

### Boolean expressions

Common functions in such `condition`s:

.pull-left[
- Equality (NB: not assignment): `col = val`
- Inequality: `>`, `<`, `>=`, `<=`, `<>`/`!=`
- Double inequality: `col BETWEEN a AND b`
- Set membership: `col IN (a, b, ...)`
]

→ Wildcards: `_` for single char, `%` for many

- Logical operators: `AND`, `OR`, `XOR`, `NOT`
]

???
These functions can be used to describe boolean `condition`s

---
layout: false

# Exercises

Do the following tutorials on [sqlzoo.net](https://sqlzoo.net/):

- [0 SELECT basics](https://sqlzoo.net/wiki/SELECT_basics) ([+quiz](https://sqlzoo.net/wiki/SELECT_Quiz))
- [1 SELECT names](https://sqlzoo.net/wiki/SELECT_names)
- [2 SELECT FROM world](https://sqlzoo.net/wiki/SELECT_from_WORLD_Tutorial) ([+quiz](https://sqlzoo.net/wiki/BBC_QUIZ))
- [3 SELECT FROM nobel](https://sqlzoo.net/wiki/SELECT_from_Nobel_Tutorial) ([+quiz](https://sqlzoo.net/wiki/Nobel_Quiz))

(these slides can be accessed on: <https://janusvm.github.io/aau-csor-db>)

---
class: polarnight

# Cheat sheet

```sql
SELECT col1, col2, ... FROM table1  -- filter columns
 WHERE col2 >= 0                    -- filter rows
 ORDER BY col2, col1 DESC           -- order according to specific column(s)
 LIMIT 10;                          -- only return a limited number of rows
```

- `SELECT DISTINCT`: only return unique results
- `col BETWEEN a AND b`: limit values of `col` to the range `[a,b]`
- `col LIKE pattern`: pattern match in text
  + `_` any single character
  + `%` any number of characters
- `col IN (a, b, c)`: limit values of `col` to array of values

]

- `CONCAT(str1, str2)`: concatenate strings
- `REPLACE(str, sub, new)`: replace `sub` with `new` in `str`
- `ROUND(num, n)`: round `num` to `n` places
- `LENGTH(str)`: length of string
- `LEFT(str, n)`: first `n` characters of string

]

---
layout: false

# PostgreSQL clients

---

# psql

```console
$ psql -h <hostname> -p <port> -U <user> -d <dbname>
```

By default, `hostname` is `localhost`, and `port` is 5432.

Command | Effect
--- | ---
`\d` | List tables in database
`\d table` | Describe table
`\l` | List databases on server
`\c dbname` | Connect to another database
`\e` | Open editor (e.g. for long, multi-line queries)
`\q` | Quit `psql`
`\?` | List all commands

Other than that, you can input SQL statements ― remember to end with `;`

???

`psql` is the command line tool for interacting with a PostgreSQL database.

We won't be using it in this course (other than for setting up the `postgres` user's password on Linux), but it you ever need to do some advanced shell scripting with PostgreSQL, it's good to know about.

---

# pgAdmin

![](../resources/img/pgadmin-dashboard.png)

???

`pgAdmin` is a graphical tool for managing your database, in which you get a tree-based view of the database structure.

This is what we'll be primarily using to interact with our local database.

**[switch to pgAdmin here to quickly showcase the most important functions]**

- Browser tree
- Query tool

[ensure everyone has it up and running with a `localhost` server]

---

## Set up example database

- Select `localhost` server in Browser
- Object → Create → Login/Group role
  - Name: `csor`
  - Password: `csor`
  - Privileges: default + can login, can create databases
- Object → Create → Database
  - Name: `nycflights13`
  - Owner: `csor`

Run this code in an R session:

```r
dbplyr::nycflights13_postgres(dbname   = "nycflights13",
                              host     = "localhost",
                              user     = "csor",
                              password = "csor")
```

???

To get some example data into your `postgres` server, do these steps to add a database with a few tables.

---

## Run example queries

1. Select the `nycflights13` database in the Browser
2. Go to Tools → Query Tool
3. Run these queries individually (execute with F5)

```sql
    SELECT * FROM airlines LIMIT 10;
    SELECT * FROM airports LIMIT 10;
    SELECT * FROM flights  LIMIT 10;
    SELECT * FROM planes   LIMIT 10;
    SELECT * FROM weather  LIMIT 10;
    ```

Try using `SELECT` and `WHERE` to retrieve different subsets, e.g.:

- the rows from `flights`, where `dep_delay` is 0
- the airlines whose `name` includes "Airlines"
- the `tailnum`s of all planes manufactured by "EMBRAER"

---
layout: false

# Links 🔗

- Course home page: <https://janusvm.github.com/aau-csor-db>
- These slides: <https://janusvm.github.com/aau-csor-db/slides/01-introduction>
- PostgreSQL documentation: <https://www.postgresql.org/docs/current/index.html>
- SELECT reference: <https://sqlzoo.net/wiki/SELECT_Reference>