class: center, middle, inverse, title-slide # Introduction to Relational Databases ## Computational Statistics
& Operations Research ###
Janus Valberg-Madsen
### 2019-03-18 @ Aalborg University --- # Course structure 1. Introduction to databases and SQL 1. Aggregation, JOINs, and subqueries 1. Using a PostgreSQL database from R 1. Database management and performance 1. Self study session ??? ## Goal of the course - **is not**: to cover _all_ the details of the SQL standard or the Postgres implementation - **is:** to teach you enough of the basics that you can use it as part of data analysis tasks ## Structure & exam - 5 sessions in total: 4 lectures and 1 self study session + lecture 1 & 2: general intro to SQL + lecture 3 & 4: practical aspects + self study: hand-in exercise, which must be completed to pass; this exercise will contain questions in topics from all four lectures, so make sure to show up! --- # Today's topics - What are databases and why/when do they make sense to use? - Basic SQL syntax - Using `psql` and `pgAdmin` ??? For today, - we start with a brief introduction to databases in general, talking about pros and cons - then we dive into the basic standard SQL syntax, with some examples and class exercises - finally we take a look at using Postgres specifically, and set up a test database that we'll be using throughout the lectures --- layout: false name: why-dbs class: center, middle, snowstorm # Why databases? ??? So _when_ and _why_ might you use databases? --- class: center background-image: url(https://upload.wikimedia.org/wikipedia/commons/0/0c/ComputerMemoryHierarchy.svg) background-size: 70% ??? It might be helpful to first establish an overview of different options for data storage: - Volatile; does not retain content when powered off + Processor registers: small amount of "instant" access storage in the CPU; up to a few thousand bytes in size + Cache: different levels of fast storage, from ~KBs in size and ~700GB/s access to ~MBs in size and ~40GB/s access + RAM: GBs in size, access around ~10GB/s - Non-volative; retains content when powered off + Flash memory: GBs in size, ~640MB/s; short term storage, used for transfer + Disk storage: TBs in size, up to ~2GB/s (with SSDs); mid term storage, rule of thumb is that hard drives last about 5 years + Magnetic tape: EBs (exabytes) of size, best speed ~160MB/s; long term storage, can last decades SQL databases fall under the "disk storage/hard drives" category. --- layout: true name: rectangular # Rectangular data ??? You are likely used to data being _rectangular_, i.e. suitable for storing in a 2-dimensional table, each column corresponding to a different _variable_, and each row corresponding to an _observation_ of those variables together. ## Data frames - This is exactly what a data frame in e.g. R is like - Data frames only "live" within the R/Python/Julia/etc. session ## CSV files - Data persistent between sessions - Plain text files are not very efficient - Entire file needs to be parsed (e.g. via `read_csv`) ## Databases - In a database, you have a much higher level of control - Data is not plainly available, but "lives" on a server - The server can filter and aggregate by sending instructions via a client program --- template: rectangular .left-column[ ### Data frames ] .right-column[ - Tables stored temporarily in memory - Fast access, limited size ] --- template: rectangular .left-column[ ### Data frames ### CSV files ] .right-column[ - Tables stored "permanently" in files on the hard drive - Unlimited size<sup>*</sup>, slower access - You still have to read it into memory to do stuff ] .footnote[ *) Depending on your hard drive volume and type 😏 ] --- template: rectangular .left-column[ ### Data frames ### CSV files ### Databases ] .right-column[ - A more organised approach to data collection - Data is "hidden away" on server (binary files somewhere on disk) - Client program needed to interact with it ] --- layout: false .pull-left[ # Pros - Huge amounts of data - No data redundancy - Statically typed - Users can have different privileges - Preprocess data before loading it into R ] .pull-right[ # Cons - May seem complicated at first - It's a bit harder to "just view" data (compared to plain text files) - A bit more effort to get data into R ] ??? ## Pros - Since data is stored on disk, we can have TBs of data, spread across many tables - Databases are optimised to store data efficiently (contrast with CSV files, where all values are ASCII sequences of 1 byte per character) - The type of a value is determined by its column; each column has a type, and it can only contain values of that type (contrast with CSV again) - As the database administrator, you can create mutliple "roles" (users) with different privileges, e.g. read/write/create database/etc. - Execute queries on the data on the server and pull the results into memory locally ## Cons - More of a learning curve than simple CSV files - With CSV files, you can open the file on disk with any text editor and look at the data; with SQL you have to query the database with a client - As we'll see later, it takes more than just a simple one-liner to load data into R (but not _much_ more) --- # Terminology  ??? SQL is based on the _relational data model_, coined by computer scientist Edgar F. Codd in 1969. This model is described using so-called _relational algebra_ with certain set operations, but we will not be diving into that area. The terms used in the theory and their practical/SQL counterparts are: RM | SQL --- | --- Relation | Table Attribute | Column Tuple | Row A **relation** (table) is a set of **tuples** (rows), where each element belongs to an **attribute** domain (column/type). --- layout: false name: sql class: center, middle, snowstorm # SQL ??? So, in order to read or write to a database, we need to tell the server what we want. This is what SQL is for. --- # SQL - Short for **Structured Query Language** - A declarative, domain-specific language - Used to communicate with the server - Several implementations exist (we'll be using PostgreSQL) ??? - SQL was developed by IBM in the 1970's, initially called SEQUEL (Structured English Query Language), but the name later changed because "SEQUEL" was a trademark of another company - Declarative, domain-specific: + Declarative: you write _what_ you want, rather than _how_ the program should do it (imperative) + Domain-specific: it's not a general programming language, but rather it deals with the domain of relational databases - You type queries in this language into a client program, that executes them on the server - Other examples include: MySQL, MS SQL Server, SQLite, and Oracle; the choice of PostgreSQL is based on it being cross platform, open source, and standard compliant --- # Basic syntax: selection ??? Most commonly, you'll be querying the database to _read_ data from a table; this is done with a `SELECT` statement. -- ```sql SELECT col1, col2, ... FROM table1; ``` "**SELECT** all rows of columns `col1`, `col2`, etc. **FROM** the table named `table1`" ??? The syntax reads much like plain English; you say which columns you want to `SELECT` (retrieve) and `FROM` which table. -- ```sql SELECT * FROM table1; ``` "**SELECT** all rows of all columns **FROM** the table `table1`" ??? The `*` in the column list is shorthand for all columns. --- layout: true # Basic syntax: filtering --- ??? Oftentimes, you don't want _all_ the rows in a table, but only a subset. This is called _filtering_. -- ```sql SELECT * FROM table1 WHERE <condition>; ``` "**SELECT** the rows **FROM** `table1` **WHERE** `condition` is true" ??? The `WHERE` keyword is used to filter rows conditionally. Here, `condition` is any statement that evaluates to a boolean value (`true/false`) -- ```sql SELECT DISTINCT * FROM table1; ``` "**SELECT** **DISTINCT** (unique) rows **FROM** `table1`" ??? The `DISTINCT` keyword signifies that you don't want duplicate rows; only unique combinations of the values in the selected columns are retrieved. -- ```sql SELECT * FROM table1 LIMIT <number>; ``` "**SELECT** the rows **FROM** `table1` and **LIMIT** the output to only `number` of them" ??? The `LIMIT` keyword tells the database to return only up to a fixed number of rows - good for previewing tables --- ### Boolean expressions Common functions in such `condition`s: .pull-left[ - Equality (NB: not assignment): `col = val` - Inequality: `>`, `<`, `>=`, `<=`, `<>`/`!=` - Double inequality: `col BETWEEN a AND b` - Set membership: `col IN (a, b, ...)` ] .pull-right[ - String mathching: `col LIKE pattern` → Wildcards: `_` for single char, `%` for many - Logical operators: `AND`, `OR`, `XOR`, `NOT` ] ??? These functions can be used to describe boolean `condition`s --- layout: false name: exercises # Exercises Do the following tutorials on [sqlzoo.net](https://sqlzoo.net/): - [0 SELECT basics](https://sqlzoo.net/wiki/SELECT_basics) ([+quiz](https://sqlzoo.net/wiki/SELECT_Quiz)) - [1 SELECT names](https://sqlzoo.net/wiki/SELECT_names) - [2 SELECT FROM world](https://sqlzoo.net/wiki/SELECT_from_WORLD_Tutorial) ([+quiz](https://sqlzoo.net/wiki/BBC_QUIZ)) - [3 SELECT FROM nobel](https://sqlzoo.net/wiki/SELECT_from_Nobel_Tutorial) ([+quiz](https://sqlzoo.net/wiki/Nobel_Quiz)) (these slides can be accessed on: <https://janusvm.github.io/aau-csor-db>) --- class: polarnight # Cheat sheet ```sql SELECT col1, col2, ... FROM table1 -- filter columns WHERE col2 >= 0 -- filter rows ORDER BY col2, col1 DESC -- order according to specific column(s) LIMIT 10; -- only return a limited number of rows ``` .pull-left[ - `SELECT DISTINCT`: only return unique results - `col BETWEEN a AND b`: limit values of `col` to the range `[a,b]` - `col LIKE pattern`: pattern match in text + `_` any single character + `%` any number of characters - `col IN (a, b, c)`: limit values of `col` to array of values ] .pull-right[ - `CONCAT(str1, str2)`: concatenate strings - `REPLACE(str, sub, new)`: replace `sub` with `new` in `str` - `ROUND(num, n)`: round `num` to `n` places - `LENGTH(str)`: length of string - `LEFT(str, n)`: first `n` characters of string ] --- layout: false name: postgres class: center, middle, snowstorm # PostgreSQL clients --- # psql ```console $ psql -h <hostname> -p <port> -U <user> -d <dbname> ``` By default, `hostname` is `localhost`, and `port` is 5432. Command | Effect --- | --- `\d` | List tables in database `\d table` | Describe table `\l` | List databases on server `\c dbname` | Connect to another database `\e` | Open editor (e.g. for long, multi-line queries) `\q` | Quit `psql` `\?` | List all commands Other than that, you can input SQL statements ― remember to end with `;` ??? `psql` is the command line tool for interacting with a PostgreSQL database. We won't be using it in this course (other than for setting up the `postgres` user's password on Linux), but it you ever need to do some advanced shell scripting with PostgreSQL, it's good to know about. --- # pgAdmin  ??? `pgAdmin` is a graphical tool for managing your database, in which you get a tree-based view of the database structure. This is what we'll be primarily using to interact with our local database. **[switch to pgAdmin here to quickly showcase the most important functions]** - Browser tree - Query tool [ensure everyone has it up and running with a `localhost` server] --- ## Set up example database - Select `localhost` server in Browser - Object → Create → Login/Group role - Name: `csor` - Password: `csor` - Privileges: default + can login, can create databases - Object → Create → Database - Name: `nycflights13` - Owner: `csor` Run this code in an R session: ```r dbplyr::nycflights13_postgres(dbname = "nycflights13", host = "localhost", user = "csor", password = "csor") ``` ??? To get some example data into your `postgres` server, do these steps to add a database with a few tables. --- ## Run example queries 1. Select the `nycflights13` database in the Browser 2. Go to Tools → Query Tool 3. Run these queries individually (execute with F5) ```sql SELECT * FROM airlines LIMIT 10; SELECT * FROM airports LIMIT 10; SELECT * FROM flights LIMIT 10; SELECT * FROM planes LIMIT 10; SELECT * FROM weather LIMIT 10; ``` Try using `SELECT` and `WHERE` to retrieve different subsets, e.g.: - the rows from `flights`, where `dep_delay` is 0 - the airlines whose `name` includes "Airlines" - the `tailnum`s of all planes manufactured by "EMBRAER" --- layout: false class: polarnight # Links 🔗 - Course home page: <https://janusvm.github.com/aau-csor-db> - These slides: <https://janusvm.github.com/aau-csor-db/slides/01-introduction> - PostgreSQL documentation: <https://www.postgresql.org/docs/current/index.html> - SELECT reference: <https://sqlzoo.net/wiki/SELECT_Reference>