# Mathematical Background

“When you have mastered numbers, you will in fact no longer be reading numbers, any more than you read words when reading books. You will be reading meanings.”, W. E. B. Du Bois

In this chapter, we review some of the mathematical concepts that we will use in this course. Most of these are not very complicated, but do require some practice and exercise to get comfortable with. If you have not previously encountered some of these concepts, there are several excellent freely-available resources online for them. In particular, the CS 121 webpage contains a program for self study of all the needed notions using the lecture notes, videos, and assignments of MIT course 6.042j Mathematics for Computer science. (The MIT lecture notes are also used by Harvard CS 20.)

## A mathematician’s apology

Before explaining the math background, perhaps I should explain why does
this course is so “mathematically heavy”. After all, this is supposed to
be a course about *computation*; one might think we should be talking
mostly about *programs*, rather than more “mathematical” objects such as
*sets*, *functions*, and *graphs*, and doing more *coding* on an actual
computer than writing mathematical proofs with pen and paper. So, why
are we doing so much math in this course? Is it just some form of
hazing? Perhaps a revenge of the “math nerds” against the
“hackers”?

At the end of the day, mathematics is simply a language for modelling
concepts in a precise and unambiguous way. In this course, we will be
mostly interested in the concept of *computation*. For example, we will
look at questions such as *“is there an efficient algorithm to find the
prime factors of a given integer?”*.Actually, scientists currently do not know the answer to this
question, but we will see that settling it in either direction has
very interesting applications touching on areas as far apart as
Internet security and quantum mechanics. To even *phrase* such a
question, we need to give a precise *definition* of the notion of an
*algorithm*, and of what it means for an algorithm to be *efficient*.
Also, if the answer to this or similar questions turns out to be
*negative*, then this cannot be shown by simply writing and executing
some code. After all, there is no empirical experiment that will prove
the *non existence* of an algorithm. Thus, our only way to show this
type of *negative results* is to use *mathematical proofs*. So you can
see why our main tools in this course will be mathematical proofs and
definitions.

## A quick overview of mathematical prerequisites

The main notions we will use in this course are the following:

**Proofs:**First and foremost, this course will involve a heavy dose of formal mathematical reasoning, which includes mathematical*definitions*,*statements*, and*proofs*.**Sets:**Including notation such as membership (\(\in\)), containment (\(\subseteq\)), and set operations such as union, intersection, set difference and Cartesian product (\(\cup,\cap,\setminus\) and \(\times\)).**Functions:**Including the notions of the*domain*and*range*of a function, properties such being*one-to-one*(also known as*injective*) or*onto*(also known as*surjective*) functions, as well as*partial functions*(that, unlike standard or “total” functions, are not necessarily defined on all elements of their domain).**Logical operations:**The operations AND, OR, and NOT (\(\wedge,\vee,\neg\)) and the quantifiers “exists” and “forall” (\(\exists\),\(\forall\)).**Tuples and strings:**The notation \(\Sigma^k\) and \(\Sigma^*\) where \(\Sigma\) is some finite set which is called the*alphabet*(quite often \(\Sigma = \{0,1\}\)).**Basic combinatorics:**Notions such as \(\binom{n}{k}\) (the number of \(k\)-sized subset of a set of size \(n\)).**Graphs:**Undirected and directed graphs, connectivity, paths, and cycles.**Big Oh notation:**\(O,o,\Omega,\omega,\Theta\) notation for analyzing asymptotics of functions.**Discrete probability:**Later on in this course we will use*probability theory*, and specifically probability over*finite*samples spaces such as tossing \(n\) coins. We will only use probability theory in the second half of this course, and will review it beforehand. However, probabilistic reasoning is a subtle (and extremely useful!) skill, and it’s always good to start early in acquiring it.

While I highly recommend the resources linked above, in the rest of this section we briefly review these notions. This is partially to remind the reader and reinforce material that might not be fresh in your mind, and partially to introduce our notation and conventions which might occasionally differ from those you’ve encountered before.

## Basic discrete math objects

We now quickly review some of the mathematical objects and definitions we in this course.

### Sets

A *set* is an unordered collection of objects. For example, when we
write \(S = \{ 2,4, 7 \}\), we mean that \(S\) denotes the set that contains
the numbers \(2\), \(4\), and \(7\). (We use the notation “\(2 \in S\)” to
denote that \(2\) is an element of \(S\).) Note that the set \(\{ 2, 4, 7 \}\)
and \(\{ 7 , 4, 2 \}\) are identical, since they contain the same
elements. Also, a set either contains an element or does not contain it
-there is no notion of containing it “twice”- and so we could even write
the same set \(S\) as \(\{ 2, 2, 4, 7\}\) (though that would be a little
weird). The *cardinality* of a finite set \(S\), denoted by \(|S|\), is the
number of distinct elements it contains.Later in this course we will discuss how to extend the notion of
cardinality to *infinite* sets. So, in the example above,
\(|S|=3\). A set \(S\) is a *subset* of a set \(T\), denoted by
\(S \subseteq T\), if every element of \(S\) is also an element of \(T\). (We
can also describe this by saying that \(T\) is a *superset* of \(S\).) For
example, \(\{2,7\} \subseteq \{ 2,4,7\}\). The set that contains no
elements is known as the *empty set* and it is denoted by \(\emptyset\).

We can define sets by either listing all their elements or by writing down a rule that they satisfy such as \[ EVEN = \{ x \;:\; \text{ $x=2y$ for some non-negative integer $y$} \} \;. \]

Of course there is more than one way to write the same set, and often we will use intuitive notation listing a few examples that illustrate the rule. For example, we can also define \(EVEN\) as

\[ EVEN = \{ 0,2,4, \ldots \} \;. \]

Note that a set can be either finite (such as the set \(\{2,4,7\}\) ) or
infinite (such as the set \(EVEN\)). Also, the elements of a set don’t
have to be numbers. We can talk about the sets such as the set
\(\{a,e,i,o,u \}\) of all the vowels in the English language, or the set
\(\{\) `New York`

, `Los Angeles`

, `Chicago`

, `Houston`

, `Philadelphia`

,
`Phoenix`

, `San Antonio`

, `San Diego`

, `Dallas`

\(\}\) of all cities in
the U.S. with population more than one million per the 2010 census. A
set can even have other sets as elements, such as the set
\(\{ \emptyset, \{1,2\},\{2,3\},\{1,3\} \}\) of all even-sized subsets of
\(\{1,2,3\}\).

**Operations on sets:** The *union* of two sets \(S,T\), denoted by
\(S \cup T\), is the set that contains all elements that are either in \(S\)
*or* in \(T\). The *intersection* of \(S\) and \(T\), denoted by \(S \cap T\),
is the set of elements that are both in \(S\) *and* in \(T\). The *set
difference* of \(S\) and \(T\), denoted by \(S \setminus T\) (and in some
texts also by \(S-T\)), is the set of elements that are in \(S\) but *not*
in \(T\).

**Tuples, lists, strings, sequences:** A *tuple* is an *ordered*
collection of items. For example \((1,5,2,1)\) is a tuple with four
elements (also known as a \(4\)-tuple or quadruple). Since order matters,
this is not the same tuple as the \(4\)-tuple \((1,1,5,2)\) or the \(3\)-tuple
\((1,5,2)\). A \(2\)-tuple is also known as a *pair*. We use the terms
*tuples* and *lists* interchangeably. A tuple where every element comes
from some finite set \(\Sigma\) (such as \(\{0,1\}\)) is also known as a
*string*. Analogously to sets, we denote the *length* of a tuple \(T\) by
\(|T|\). Just like sets, we can also think of an infinite analogs of
tuples, such as the ordered collection \((1,2,4,9,\ldots )\) of all
perfect squares. Infinite ordered collections are known as *sequences*;
we might sometimes use the term “infinite sequence” to emphasize this,
and use “finite sequence” as a synonym for a tuple.We can identify a sequence \((a_0,a_1,a_2,\ldots)\) of elements in
some set \(S\) with a *function* \(A:\N \rightarrow S\) (where
\(a_n = A(n)\) for every \(n\in \N\)). Similarly, we can identify a
\(k\)-tuple \((a_0,\ldots,a_{k-1})\) of elements in \(S\) with a function
\(A:[k] \rightarrow S\).

**Cartesian product:** If \(S\) and \(T\) are sets, then their *Cartesian
product*, denoted by \(S \times T\), is the set of all ordered pairs
\((s,t)\) where \(s\in S\) and \(t\in T\). For example, if \(S = \{1,2,3 \}\)
and \(T = \{10,12 \}\), then \(S\times T\) contains the \(6\) elements
\((1,10),(2,10),(3,10),(1,12),(2,12),(3,12)\). Similarly if \(S,T,U\) are
sets then \(S\times T \times U\) is the set of all ordered triples
\((s,t,u)\) where \(s\in S\), \(t\in T\), and \(u\in U\). More generally, for
every positive integer \(n\) and sets \(S_0,\ldots,S_{n-1}\), we denote by
\(S_0 \times S_1 \times \cdots \times S_{n-1}\) the set of ordered
\(n\)-tuples \((s_0,\ldots,s_{n-1})\) where \(s_i\in S_i\) for every
\(i \in \{0,\ldots, n-1\}\).

For every set \(S\), we denote the set \(S\times S\) by \(S^2\),
\(S\times S\times S\) by \(S^3\), \(S\times S\times S \times S\) by \(S^4\), and
so on and so forth.

### Special sets

There are several sets that we will use in this course time and again, and so find it useful to introduce explicit notation for them. For starters we define

\[ \N = \{ 0, 1,2, \ldots \} \]

to be the set of all *natural numbers*, i.e., non-negative integers. For
any natural number \(n\), we define the set \([n]\) as
\(\{0,\ldots, n-1\} = \{ k\in \N : k < n \}\).We start our indexing of both \(\N\) and \([n]\) from \(0\), while many
other texts index those sets from \(1\). Starting from zero or one is
simply a convention that doesn’t make much difference, as long as
one is consistent about it.

We will also occasionally use the set
\(\Z=\{\ldots,-2,-1,0,+1,+2,\ldots \}\) of (negative and non-negative)
*whole numbers*,The letter Z stands for the German word “Zahlen”, which means
*numbers*. as well as the set \(\R\) of *real* numbers. (This is
the set that includes not just the whole numbers, but also fractional
and even irrational numbers; e.g., \(\R\) contains numbers such as \(+0.5\),
\(-\pi\), etc.) We denote by \(\R_+\) the set \(\{ x\in \R : x > 0 \}\) of
*positive* real numbers. This set is sometimes also denoted as
\((0,\infty)\).

**Strings:** Another set we will use time and again is

\[ \{0,1\}^n = \{ (x_0,\ldots,x_{n-1}) \;:\; x_0,\ldots,x_{n-1} \in \{0,1\} \} \]

which is the set of all \(n\)-length binary strings for some natural number \(n\). That is \(\{0,1\}^n\) is the set of all \(n\)-tuples of zeroes and ones. This is consistent with our notation above: \(\{0,1\}^2\) is the Cartesian product \(\{0,1\} \times \{0,1\}\), \(\{0,1\}^3\) is the product \(\{0,1\} \times \{0,1\} \times \{0,1\}\) and so on.

We will write the string \((x_0,x_1,\ldots,x_{n-1})\) as simply \(x_0x_1\cdots x_{n-1}\) and so for example

\[ \{0,1\}^3 = \{ 000 , 001, 010 , 011, 100, 101, 110, 111 \} \;. \]

For every string \(x\in \{0,1\}^n\) and \(i\in [n]\), we write \(x_i\) for the
\(i^{th}\) coordinate of \(x\). If \(x\) and \(y\) are strings, then \(xy\)
denotes their *concatenation*. That is, if \(x \in \{0,1\}^n\) and
\(y\in \{0,1\}^m\), then \(xy\) is equal to the string \(z\in \{0,1\}^{n+m}\)
such that for \(i\in [n]\), \(z_i=x_i\) and for \(i\in \{n,\ldots,n+m-1\}\),
\(z_i = y_{i-n}\).

We will also often talk about the set of binary strings of *all*
lengths, which is

\[ \{0,1\}^* = \{ (x_0,\ldots,x_{n-1}) \;:\; n\in\N \;,\;, x_0,\ldots,x_{n-1} \in \{0,1\} \} \;. \]

Another way to write this set is as \[ \{0,1\}^* = \{0,1\}^0 \cup \{0,1\}^1 \cup \{0,1\}^2 \cup \cdots \]

or more concisely as

\[ \{0,1\}^* = \cup_{n\in\N} \{0,1\}^n \;. \]

The set \(\{0,1\}^*\) contains also the “string of length \(0\)” or “the empty string”, which we will denote by \(\mathtt{""}\).We follow programming languages in this notation; other texts sometimes use \(\epsilon\) or \(\lambda\) to denote the empty string. However, this doesn’t matter much since we will rarely encounter this “edge case”.

**Generalizing the star operation:** For every set \(\Sigma\), we define

\[\Sigma^* = \cup_{n\in \N} \Sigma^n \;.\]

For example, if \(\Sigma = \{a,b,c,d,\ldots,z \}\) then \(\Sigma^*\) denotes the set of all finite length strings over the alphabet a-z.

**Concatenation:** The *concatenation* of two strings \(x\in \Sigma^n\)
and \(y\in \Sigma^m\) is the \((n+m)\)-length string \(xy\) obtained by
writing \(y\) after \(x\). That is, \((xy)_i\) equals \(x_i\) if \(i<n\) and
equals \(y_{i-n}\) if \(n \leq i < n+m\).

### Functions

If \(S\) and \(T\) are sets, a *function* \(F\) mapping \(S\) to \(T\), denoted by
\(F:S \rightarrow T\), associates with every element \(x\in S\) an element
\(F(x)\in T\). The set \(S\) is known as the *domain* of \(F\) and the set \(T\)
is known as the *range* or *co-domain* of \(F\). Just as with sets, we can
write a function either by listing the table of all the values it gives
for elements in \(S\) or using a rule. For example if
\(S = \{0,1,2,3,4,5,6,7,8,9 \}\) and \(T = \{0,1 \}\). Then the function \(F\)
defined by the input output behavior as in the table below, is the same
as defining \(F(x)= (x \mod 2)\).

Input | Output |
---|---|

0 | 0 |

1 | 1 |

2 | 0 |

3 | 1 |

4 | 0 |

5 | 1 |

6 | 0 |

7 | 1 |

8 | 0 |

9 | 1 |

If \(F:S \rightarrow T\) satisfies that \(F(x)\neq F(y)\) for all \(x \neq y\) then we say that \(F\) is

*one-to-one*(also known as an

*injective*function or simply an

*injection*).

If \(F\) satisfies that for every \(y\in T\) there is some \(x\) such that \(F(x)=y\) then we say that \(F\) is

*onto*(also known as a

*surjective*function or simply a

*surjection*). A function that is both one-to-one and onto is known as a

*bijective*function or simply a

*bijection*. If \(S=T\) then a bijection from \(S\) to \(T\) is also known as a

*permutation*. If \(F:S \rightarrow T\) is a bijection then for every \(y\in T\) there is a unique \(x\in S\) s.t. \(F(x)=y\). We denote this value \(x\) by \(F^{-1}(y)\). Note that \(F^{-1}\) is itself a bijection from \(T\) to \(S\) (can you see why?).

Giving a bijection between two sets is often a good way to show they
have the same size. In fact, the standard mathematical definition of the
notion that “\(S\) and \(T\) have the same cardinality” is that there exists
a bijection \(f:S \rightarrow T\). In particular, the cardinality of a set
\(S\) is defined \(n\) if there is a bijection from \(S\) to the set
\(\{0,\ldots,n-1\}\). As we will see later in this course, this is a
definition that can generalizes to defining the cardinality of
*infinite* sets.

**Partial functions:** We will sometimes be interested in *partial*
functions from \(S\) to \(T\). This is a generalization of the notion of a
function to consider also \(F\) that is not necessarily defined on every
element of \(S\). For example, the partial function \(F(x)= \sqrt{x}\) is
only defined on non-negative real numbers. When we want to distinguish
between partial functions and standard (i.e., non-partial) functions, we
will call the latter *total* functions. When we say “function” without
any qualifier then we mean a *total* function. That is, the notion of
partial functions is a strict generalization of functions, and so a
partial function *not* necessarily a function. The set of partial
functions is a proper superset of the set of total functions, since a
partial function is allowed to be defined on all its input elements.
When we want to emphasize that a function \(f\) from \(A\) to \(B\) might not
be total, we will write \(f: A \rightarrow_p B\). We can think of a
partial function \(F\) from \(S\) to \(T\) also as a total function from \(S\)
to \(T \cup \{ \bot \}\) where \(\bot\) is some special “failure symbol”,
and so instead of saying that \(F\) is undefined at \(x\), we can say that
\(F(x)=\bot\).

**Basic facts about functions:** Verifying that you can prove the
following results is an excellent way to brush up on functions:

- If \(F:S \rightarrow T\) and \(G:T \rightarrow U\) are one-to-one
functions, then their
*composition*\(H:S \rightarrow U\) defined as \(H(s)=G(F(s))\) is also one to one. - If \(F:S \rightarrow T\) is one to one, then there exists an onto function \(G:T \rightarrow S\) such that \(G(F(s))=s\) for every \(s\in S\).
- If \(G:T \rightarrow S\) is onto then there exists a one-to-one function \(F:S \rightarrow T\) such that \(G(F(s)=s\) for every \(s\in S\).
- If \(S\) and \(T\) are finite sets then the following conditions are
equivalent to one another:
**(a)**\(|S| \leq |T|\),**(b)**there is a one-to-one function \(F:S \rightarrow T\), and**(c)**there is an onto function \(G:T \rightarrow S\).

You can find the proofs of these results in many discrete math texts, including for example, section 4.5 in the Leham-Leighton-Meyer notes. However, I strongly suggest you try to prove them on your own, or at least convince yourself that they are true by proving special cases of those for small sizes (e.g., \(|S|=3,|T|=4,|U|=5\)).

Let us prove one of these facts as an example:

If \(S,T\) are non-empty sets and \(F:S \rightarrow T\) is one to one, then there exists an onto function \(G:T \rightarrow S\) such that \(G(F(s))=s\) for every \(s\in S\).

Let \(S\), \(T\) and \(F:S \rightarrow T\) be as in the Lemma’s statement, and
choose some \(s_0 \in S\). We will define the function \(G:T \rightarrow S\)
as follows: for every \(t\in T\), if there is some \(s\in S\) such that
\(F(s)=t\) then set \(G(t)=s\) (the choice of \(s\) is well defined since by
the one-to-one property of \(F\), there cannot be two distinct \(s,s'\) that
both map to \(t\)). Otherwise, set \(G(t)=s_0\). Now for every \(s\in S\), by
the definition of \(G\), if \(t=F(s)\) then \(G(t)=G(F(s))=s\). Moreover, this
also shows that \(G\) is *onto*, since it means that for every \(s\in S\)
there is some \(t\) (namely \(t=F(s)\)) such that \(G(t)=s\).

### Graphs

*Graphs* are ubiquitous in Computer Science, and many other fields as
well. They are used to model a variety of data types including social
networks, road networks, deep nueral nets, gene interactions,
correlations between observations, and a great many more. The formal
definitions of graphs are below, but if you have not encountered them
before then I urge you to read up on them in one of the sources linked
above. Graphs come in two basic flavors: *undirected* and
*directed*.It is possible, and sometimes useful, to think of an undirected
graph as simply a directed graph with the special property that for
every pair \(u,v\) either both the edges \(\overrightarrow{u v}\) and
\(\overleftarrow{u v}\) are present or neither of them is. However, in
many settings there is a significant difference between undirected
and directed graphs, and so it’s typically best to think of them as
separate categories.

An *undirected graph* \(G = (V,E)\) consists of a set \(V\) of *vertices*
and a set \(E\) of edges. Every edge is a size two subset of \(V\). We say
that two vertices \(u,v \in V\) are *neighbors*, denoted by \(u \sim v\), if
the edge \(\{u,v\}\) is in \(E\).

Given this definition, we can define several other properties of graphs
and their vertices. We define *degree* of \(u\) to be the number of
neighbors \(u\) has. A *path* in the graph is a tuple
\((u_0,\ldots,u_k) \in V^k\), for some \(k>0\) such that \(u_{i+1}\) is a
neighbor of \(u_i\) for every \(i\in [k]\). A *simple path* is a path
\((u_0,\ldots,u_{k-1})\) where all the \(u_i\)’s are distinct. A *cycle* is
a path \((u_0,\ldots,u_k)\) where \(u_0=u_{k}\). We say that two vertices
\(u,v\in V\) are *connected* if either \(u=v\) or there is a path from
\((u_0,\ldots,u_k)\) where \(u_0=u\) and \(u_k=v\). We say that the graph \(G\)
is *connected* if every pair of vertices in it is connected.

Here are some basic facts about undirected graphs. We give some informal arguments below, but leave the full proofs as exercises. (The proofs can also be found in most basic texts on graph theory.)

In any undirected graph \(G=(V,E)\), the sum of the degrees of all vertices is equal to twice the number of edges.

Reference:degreesegeslem can be shown by seeing that every edge \(\{ u,v\}\) contributes twice to the sum of the degrees (once for \(u\) and the second time for \(v\).)

The connectivity relation is *transitive*, in the sense that if \(u\) is
connected to \(v\), and \(v\) is connected to \(w\), then \(u\) is connected to
\(w\).

Reference:conntranslem can be shown by simply attaching a path of the form \((u,u_1,u_2,\ldots,u_{k-1},v)\) to a path of the form \((v,u'_1,\ldots,u'_{k'-1},w)\) to obtain the path \((u,u_1,\ldots,u_{k-1},v,u'_1,\ldots,u'_{k'-1},w)\) that connects \(u\) to \(w\).

For every undirected graph \(G=(V,E)\) and connected pair \(u,v\), the shortest path from \(u\) to \(v\) is simple. In particular, for every connected pair there exists a simple path that connects them.

Reference:simplepathlem can be shown by “shortcutting” any non simple path of the form \((u,u_1,\ldots,u_{i-1},w,u_{i+1},\ldots,u_{j-1},w,u_{j+1},\ldots,u_{k-1},v)\) where the same vertex \(w\) appears in both the \(i\)-th and \(j\)-position, to obtain the shorter path \((u,u_1,\ldots,u_{i-1},w,u_{j+1},\ldots,u_{k-1},v)\).

If you haven’t seen these proofs before, it is indeed a great exercise to transform the above informal exercises into fully rigorous proofs.

A *directed graph* \(G=(V,E)\) consists of a set \(V\) and a set
\(E \subseteq V\times V\) of *ordered pairs* of \(V\). We denote the edge
\((u,v)\) also as \(\overrightarrow{u v}\). If the edge
\(\overrightarrow{u v}\) is present in the graph then we say that \(v\) is
an *out-neighbor* of \(u\) and \(u\) is an *in-neigbor* of \(v\).

A directed graph might contain both \(\overrightarrow{u v}\) and
\(\overrightarrow{v u}\) in which case \(u\) will be both an in-neighbor and
an out-neighbor of \(v\) and vice versa. The *in-degree* of \(u\) is the
number of in-neighbors it has, and the *out-degree* of \(v\) is the number
of out-neighbors it has. A *path* in the graph is a tuple
\((u_0,\ldots,u_k) \in V^k\), for some \(k>0\) such that \(u_{i+1}\) is an
out-neighbor of \(u_i\) for every \(i\in [k]\). As in the undirected case, a
*simple path* is a path \((u_0,\ldots,u_{k-1})\) where all the \(u_i\)’s are
distinct and a *cycle* is a path \((u_0,\ldots,u_k)\) where \(u_0=u_{k}\).
One type of directed graphs we often care about is *directed acyclic
graphs* or *DAGs*, which, as their name implies, are directed graphs
without any cycles.

The lemmas we mentioned above have analogs for directed graphs. We again leave the proofs (which are essentially identical to their undirected analogs) as exercises for the reader:

In any directed graph \(G=(V,E)\), the sum of the in-degrees is equal to the sum of the out-degrees, which is equal to the number of edges.

In any directed graph \(G\), if there is a path from \(u\) to \(v\) and a path from \(v\) to \(w\), then there is a path from \(u\) to \(w\).

For every directed graph \(G=(V,E)\) and a pair \(u,v\) such that there is a
path from \(u\) to \(v\), the *shortest path* from \(u\) to \(v\) is simple.

The word *graph* in the sense above was coined by the mathematician
Sylvester in 1878 in analogy with the chemical graphs used to visualize
molecules. There is an unfortunate confusion with the more common usage
of the term as a way to plot data, and in particular a plot of some
function \(f(x)\) as a function of \(x\). We can merge these two meanings by
thinking of a function \(f:A \rightarrow B\) as a special case of a
directed graph over the vertex set \(V= A \cup B\) where we put the edge
\(\overrightarrow{x f(x)}\) for every \(x\in A\). In a graph constructed in
this way every vertex in \(A\) has out-degree one.

The following lecture of Berkeley CS70 provides an excellent overview of graph theory.

### Logic operators and quantifiers.

If \(P\) and \(Q\) are some statements that can be true or false, then \(P\)
AND \(Q\) (denoted as \(P \wedge Q\)) is the statement that is true if and
only if both \(P\) *and* \(Q\) are true, and \(P\) OR \(Q\) (denoted as
\(P \vee Q\)) is the statement that is true if and only if either \(P\) *or*
\(Q\) is true. The *negation* of \(P\), denoted as \(\neg P\) or
\(\overline{P}\), is the statement that is true if and only if \(P\) is
false.

Suppose that \(P(x)\) is a statement that depends on some *parameter* \(x\)
(also sometimes known as an *unbound* variable) in the sense that for
every instantiation of \(x\) with a value from some set \(S\), \(P(x)\) is
either true or false. For example, \(x>7\) is a statement that is not a
priori true or false, but does become true or false whenever we
instantiate \(x\) with some real number. In such case we denote by
\(\forall_{x\in S} P(x)\) the statement that is true if and only if \(P(x)\)
is true *for every* \(x\in S\).In these notes we will place the variable that is bound by a
quantifier in a subscript and so write \(\forall_{x\in S}P(x)\)
whereas other texts might use \(\forall x\in S. P(x)\). We denote by \(\exists_{x\in S} P(x)\)
the statement that is true if and only if *there exists* some \(x\in S\)
such that \(P(x)\) is true.

For example, the following is a formalization of the true statement that there exists a natural number \(n\) larger than \(100\) that is not divisible by \(3\):

\[ \exists_{n\in \N} (n>100) \wedge \left(\forall_{k\in N} k+k+k \neq n\right) \;. \]

*“For sufficiently large \(n\)”* One expression which comes up time and
again is the claim that some statement \(P(n)\) is true “for sufficiently
large \(n\)”. What this means is that there exists an integer \(N_0\) such
that \(P(n)\) is true for every \(n>N_0\). We can formalize this as
\(\exists_{N_0\in \N} \forall_{n>N_0} P(n)\).

### Quantifiers for summations and products

The following shorthands for summing up or taking products of several numbers are often convenient. If \(S = \{s_0,\ldots,s_{n-1} \}\) is a finite set and \(f:S \rightarrow \R\) is a function, then we write \(\sum_{x\in S} f(x)\) as shorthand for

\[ f(s_0) + f(s_1) + f(s_2) + \ldots + f(s_{n-1}) \;, \]

and \(\prod_{x\in S} f(x)\) as shorthand for

\[ f(s_0) \cdot f(s_1) \cdot f(s_2) \cdot \ldots \cdot f(s_{n-1}) \;. \]

For example, the sum of the squares of all numbers from \(1\) to \(100\) can be written as

\[ \sum_{i\in \{1,\ldots,100\}} i^2 \;. \label{eqsumsquarehundred} \]

Since summing up over intervals of integers is so common, there is a special notation for it, and for every two integers \(a \leq b\), \(\sum_{i=a}^b f(i)\) denotes \(\sum_{i\in S} f(i)\) where \(S =\{ x\in \Z : a \leq x \leq b \}\). Hence we can write the sum \eqref{eqsumsquarehundred} as

\[ \sum_{i=1}^{100} i^2 \;. \]

### Parsing formulas: bound and free variables

In mathematics as in code, we often have symbolic “variables” or
“parameters”. It is important to be able to understand, given some
formula, whether a given variable is *bound* or *free* in this formula.
For example, in the following statement \(n\) is free but \(a,b\) are bound
by the \(\exists\) quantifier:

\[ \exists_{a,b \in \N} (a \neq 1) \wedge (a \neq n) \wedge (n = a \times b) \label{aboutnstmt} \]

Since \(n\) is free, it can be set to any value, and the truth of the statement \eqref{aboutnstmt} depends on the value of \(n\). For example, if \(n=8\) then \eqref{aboutnstmt} is true, but for \(n=11\) it is false. (Can you see why?)

The same issue appears when parsing code. For example, in the following snippet from the C++ programming language

```
for (int i=0 ; i<n ; i=i+1) {
printf("*");
}
```

the variable `i`

is bound to the `for`

operator but the variable `n`

is
free.

The main property of bound variables is that we can change them to a different name (as long as it doesn’t conflict with another used variable) without changing the meaning of the statement. Thus for example the statement

\[ \exists_{x,y \in \N} (x \neq 1) \wedge (x \neq n) \wedge (n = x \times y) \label{aboutnstmttwo} \]

is equivalent to \eqref{aboutnstmt} in the sense that it is true for exactly the same set of \(n\)’s. Similarly, the code

```
for (int j=0 ; j<n ; j=j+1) {
printf("*");
}
```

produces the same result.

Mathematical notation has a lot of similarities with programming
language, and for the same reasons. Both are formalisms meant to convey
complex concepts in a precise way. However, there are some cultural
differences. In programming languages, we often try to use meaningful
variable names such as `NumberOfVertices`

while in math we often use
short identifiers such as \(n\). (Part of it might have to do with the
tradition of mathematical proofs as being handwritten and verbally
presented, as opposed to typed up and compiled.) One consequence of that
is that in mathematics we often end up reusing identifier, and also “run
out” of letters and hence use greek letters too, as well as distinguish
between small and capital letters. Similarly, mathematical notation
tends to use quite a lot of “overloading”, using operators such as \(+\)
for a great variety of objects (e.g., real numbers, matrices, finite
field elements, etc..), and assuming that the meaning can be inferred
from the context. Both fields have a notion of “types”, and in math we
often try to reserve certain letters for variables of a particular type.
For example, variables such as \(i,j,k,\ell,m,n\) will often denote
integers, and \(\epsilon\) will often denote a small positive real number.
When reading or writing mathematical texts, we usually don’t have the
advantage of a “compiler” that will check type safety for us. Hence it
is important to keep track of the type of each variable, and see that
the operations that are performed on it “make sense”.

### Asymptotics and big-Oh notation

“\(\log\log\log n\) has been proved to go to infinity, but has never been observed to do so.”, Anonymous, quoted by Carl Pomerance (2000)

It is often very cumbersome to describe precisely quantities such as
running time and is also not needed, since we are typically mostly
interested in the “higher order terms”. That is, we want to understand
the *scaling behavior* of the quantity as the input variable grows. For
example, as far as running time goes, the difference between an
\(n^5\)-time algorithm and an \(n^2\)-time one is much more significant than
the difference between an \(100n^2 + 10n\) time algorithm and an \(10n^2\)
time algorithm. For this purpose, Oh notation is extremely useful as a
way to “declutter” our text and focus our attention on what really
matters. For example, using Oh notation, we can say that both
\(100n^2 + 10n\) and \(10n^2\) are simply \(\Theta(n^2)\) (which informally
means “the same up to constant factors”), while \(n^2 = o(n^5)\) (which
informally means that \(n^2\) is “much smaller than” \(n^5\)).While Big Oh notation is often used to analyze running time of
algorithms, this is by no means the only application. At the end of
the day, Big Oh notation is just a way to express asymptotic
inequalities between functions on integers. It can be used
regardless of whether these functions are a measure of running time,
memory usage, or any other quantity that may have nothing to do with
computation.

Generally (though still informally), if \(F,G\) are two functions mapping natural numbers to non-negative reals, then “\(F=O(G)\)” means that \(F(n) \leq G(n)\) if we don’t care about constant factors, while “\(F=o(G)\)” means that \(F\) is much smaller than \(G\), in the sense that no matter by what constant factor we multiply \(F\), if we take \(n\) to be large enough then \(G\) will be bigger (for this reason, sometimes \(F=o(G)\) is written as \(F \ll G\)). We will write \(F= \Theta(G)\) if \(F=O(G)\) and \(G=O(F)\), which one can think of as saying that \(F\) is the same as \(G\) if we don’t care about constant factors. More formally, we define Big Oh notation as follows:

For \(F,G: \N \rightarrow \R_+\), we define \(F=O(G)\) if there exist numbers \(a,N_0 \in \N\) such that \(F(n) \leq a\cdot G(n)\) for every \(n>N_0\). We define \(F=\Omega(G)\) if \(G=O(F)\).

We write \(F =o(G)\) if for every \(\epsilon>0\) there is some \(N_0\) such that \(F(n) <\epsilon G(n)\) for every \(n>N_0\). We write \(F =\omega(G)\) if \(G=o(F)\). We write \(F= \Theta(G)\) if \(F=O(G)\) and \(G=O(F)\).

We can also use the notion of *limits* to define big and little oh
notation. You can verify that \(F=o(G)\) (or, equivalently, \(G=\omega(F)\))
if and only if
\(\lim\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)} = 0\). Similarly, if
the limit \(\lim\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)}\) exists
and is a finite number then \(F=O(G)\). If you are familiar with the
notion of *supremum*, then you can verify that \(F=O(G)\) if and only if
\(\limsup\limits_{n\rightarrow\infty} \tfrac{F(n)}{G(n)} < \infty\).

Using the equality sign for Oh notation is extremely common, but is
somewhat of a misnomer, since a statement such as \(F = O(G)\) really
means that \(F\) is in the set
\(\{ G' : \exists_{N,c} \text{ s.t. } \forall_{n>N} G'(n) \leq c G(n) \}\).
For this reason, some texts write \(F \in O(G)\) instead of \(F = O(G)\). If
anything, it would have made more sense use *inequalities* and write
\(F \leq O(G)\) and \(F \geq \Omega(G)\), reserving equality for
\(F = \Theta(G)\), but by now the equality notation is quite firmly
entrenched. Nevertheless, you should remember that a statement such as
\(F = O(G)\) means that \(F\) is “at most” \(G\) in some rough sense when we
ignore constants, and a statement such as \(F = \Omega(G)\) means that \(F\)
is “at least” \(G\) in the same rough sense.

It’s often convenient to use “anonymous functions” in the context of Oh notation, and also to emphasize the input parameter to the function. For example, when we write a statement such as \(F(n) = O(n^3)\), we mean that \(F=O(G)\) where \(G\) is the function defined by \(G(n)=n^3\). Chapter 7 in Jim Apsnes’ notes on discrete math provides a good summary of Oh notation, see also this tutorial for a gentler and more programmer-oriented introduction.

### Some “rules of thumbs” for big Oh notation

There are some simple heuristics that can help when trying to compare two functions \(F\) and \(G\):

- Multiplicative constants don’t matter in Oh notation, and so if \(F(n)=O(G(n))\) then \(100F(n)=O(G(n))\).
- When adding two functions, we only care about the larger one. For example, for the purpose of Oh notation, \(n^3+100n^2\) is the same as \(n^3\), and in general in any polynomial, we only care about the larger exponent.
- For every two constants \(a,b>0\), \(n^a = O(n^b)\) if and only if \(a \leq b\), and \(n^a = o(n^b)\) if and only if \(a<b\). For example, combining the two observations above, \(100n^2 + 10n + 100 = o(n^3)\).
- Polynomial is always smaller than exponential: \(n^a = o(2^{n^\epsilon})\) for every two constants \(a>0\) and \(\epsilon>0\) even if \(\epsilon\) is much smaller than \(a\). For example, \(100n^{100} = o(2^{\sqrt{n}})\).
- Similarly, logarithmic is always smaller than polynomial: \((\log n)^a\) (which we write as \(\log^a n\)) is \(o(n^\epsilon)\) for every two constants \(a,\epsilon>0\). For example, combining the observations above, \(100n^2 \log^{100} n = o(n^3)\).

In most (though not all!) cases we use Oh notation, the constants hidden by it are not too huge and so on an intuitive level, you can think of \(F=O(G)\) as saying something like \(F(n) \leq 1000 G(n)\) and \(F=\Omega(G)\) as saying something \(F(n) \geq 0.001 G(n)\).

## Proofs

Many people think of mathematical proofs as a sequence of logical deductions that starts from some axioms and ultimately arrives at a conclusion. In fact, some dictionaries define proofs that way. This is not entirely wrong, but in reality a mathematical proof of a statement X is simply an argument that convinces the reader that X is true beyond a shadow of a doubt. To produce such a proof you need to:

- Understand precisely what X means.
- Convince
*yourself*that X is true. - Write your reasoning down in plain, precise and concise English (using formulas or notation only when they help clarity).

In many cases, Step 1 is the most important one. Understanding what a statement means is often more than halfway towards understanding why it is true. In Step 3, to convince the reader beyond a shadow of a doubt, we will often want to break down the reasoning to “basic steps”, where each basic step is simple enough to be “self evident”. The combination of all steps yields the desired statement.

### Proofs and programs

There is a great deal of similarity between the process of writing
*proofs* and that of writing *programs*, and both require a similar set
of skills. Writing a *program* involves:

- Understanding what is the
*task*we want the program to achieve. - Convincing
*yourself*that the task can be achieved by a computer, perhaps by planning on a whiteboard or notepad how you will break it up to simpler tasks. - Converting this plan into code that a compiler or interpreter can understand, by breaking up each task into a sequence of the basic operations of some programming language.

In programs as in proofs, step 1 is often the most important one. A key
difference is that the reader for proofs is a human being and for
programs is a compiler.This difference might be eroding with time, as more proofs are
being written in a *machine verifiable form* and progress in
artificial intelligence allows expressing programs in more human
friendly ways, such as “programming by example”. Interestingly, much
of the progress in automatic proof verification and proof assistants
relies on a much deeper
correspondence
between *proofs* and *programs*. We *might* see this correspondence
later in this course. Thus our emphasis is on *readability* and
having a *clear logical flow* for the proof (which is not a bad idea for
programs as well..). When writing a proof, you should think of your
audience as an intelligent but highly skeptical and somewhat petty
reader, that will “call foul” at every step that is not well justified.

## Extended example: graph connectivity

To illustrate these ideas, let us consider the following example of a true theorem:

Every connected undirected graph of \(n\) vertices has at least \(n-1\) edges.

We are going to take our time to understand how one would come up with a proof for Reference:graphconthm, and how to write such a proof down. This will not be the shortest way to prove this theorem, but hopefully following this process will give you some general insights on reading, writing, and discovering mathematical proofs.

Before trying to prove Reference:graphconthm, we need to understand what
it means. Let’s start with the terms in the theorems. We defined
undirected graphs and the notion of connectivity in Reference:graphsec
above. In particular, an undirected graph \(G=(V,E)\) is *connected* if
for every pair \(u,v \in V\), there is a path \((u_0,u_1,\ldots,u_k)\) such
that \(u_0=u\), \(u_k=v\), and \(\{ u_i,u_{i+1} \} \in E\) for every
\(i\in [k]\).

It is crucial that at this point you pause and verify that you completely understand the definition of connectivity. Indeed, you should make a habit of pausing after any statement of a theorem, even before looking at the proof, and verifying that you understand all the terms that the theorem refers to.

To prove Reference:graphconthm we need to show that there is no
\(2\)-vertex connected graph with fewer than \(1\) edges, \(3\)-vertex
connected graph with fewer than \(2\) edges, and so on and so forth. One
of the best ways to prove a theorem is to first try to *disprove it*. By
trying and failing to come up with a counterexample, we often understand
why the theorem can not be false. For example, if you try to draw a
\(4\)-vertex graph with only two edges, you can see that there are
basically only two choices for such a graph as depicted in
Reference:figurefourvertexgraph, and in both there will remain a vertex
that is not connected.

In fact, we can see that if we have a budget of \(2\) edges and we choose
some vertex \(u\), we will not be able to connect to \(u\) more than two
other vertices, and similarly with a budget of \(3\) edges we will not be
able to connect to \(u\) more than three other vertices. We can keep
trying to draw such examples until we convince ourselves that the
theorem is probably true, at which point we want to see how we can
*prove* it.

If you have not seen the proof of this theorem before (or don’t remember it), this would be an excellent point to pause and try to prove it yourself.

There are several ways to approach this proof, but one version is to
start by proving it for small graphs, such as graphs with 2,3 or 4
edges, for which we can check all the cases, and then try to extend the
proof for larger graphs. The technical term for this proof approach is
*proof by induction*.

### Mathematical induction

*Induction* is simply an application of the self-evident Modus
Ponens rule that says that
if **(a)** \(P\) is true and **(b)** \(P\) implies \(Q\) then \(Q\) is true. In
the setting of proofs by induction we typically have a statement \(Q(k)\)
that is parameterized by some integer \(k\), and we prove that **(a)**
\(Q(0)\) is true and **(b)** For every \(k>0\), if \(Q(0),\ldots,Q(k-1)\) are
all true then \(Q(k)\) is true.Usually proving **(b)** is the hard part, though there are
examples where the “base case” **(a)** is quite subtle. By repeatedly applying Modus Ponens,
we can deduce from **(a)** and **(b)** that \(Q(1)\) is true, and then
from **(a)**,**(b)** and \(Q(1)\) that \(Q(2)\) is true, and so on and so
forth to obtain that \(Q(k)\) is true for every \(k\). The statement **(a)**
is called the “base case”, while **(b)** is called the “inductive step”.
The assumption in **(b)** that \(Q(i)\) holds for \(i<k\) is called the
“inductive hypothesis”.

Proofs by inductions are closely related to algorithms by recursion. In both cases we reduce solving a larger problem to solving a smaller instance of itself. In a recursive algorithm to solve some problem P on an input of length \(k\) we ask ourselves “what if someone handed me a way to solve P on instances smaller than \(k\)?”. In an inductive proof to prove a statement Q parameterized by a number \(k\), we ask ourselves “what if I already knew that \(Q(k')\) is true for \(k'<k\)”. Both induction and recursion are crucial concepts for this course and Computer Science at large (and even other areas of inquiry, including not just mathematics but other sciences as well). Both can be initially (and even post-initially) confusing, but with time and practice they become clearer. For more on proofs by induction and recursion, you might find the following Stanford CS 103 handout, this MIT 6.00 lecture or this excerpt of the Lehman-Leighoton book useful.

### Proving the theorem by induction

There are several ways to use induction to prove Reference:graphconthm. We will do so by following our intuition above that with a budget of \(k\) edges, we cannot connect to a vertex more than \(k\) other vertices. That is, we will define the statement \(Q(k)\) as follows:

\(Q(k)\) is

“For every graph \(G=(V,E)\) with at most \(k\) edges and every \(u\in V\), the number of vertices that are connected to \(u\) (including \(u\) itself) is at most \(k+1\)”

Note that \(Q(n-2)\) implies our theorem, since it means that in an \(n\)
vertex graph of \(n-2\) edges, there would be at most \(n-1\) vertices that
are connected to \(u\), and hence in particular there would be *some*
vertex that is not connected to \(u\). More formally, if we define, given
any undirected graph \(G\) and vertex \(u\) of \(G\), the set \(C_G(u)\) to
contain all vertices connected to \(u\), then the statement \(Q(k)\) is that
for every undirected graph \(G=(V,E)\) with \(|E|=k\) and \(u\in V\),
\(|C_G(u)| \leq k+1\).

To prove that \(Q(k)\) is true for every \(k\) by induction, we will first
prove that **(a)** \(Q(0)\) is true, and then prove **(b)** if
\(Q(0),\ldots,Q(k-1)\) are true then \(Q(k)\) is true as well. In fact, we
will prove the stronger statement **(b’)** that if \(Q(k-1)\) is true then
\(Q(k)\) is true as well. (**(b’)** is a stronger statement than **(b)**
because it has same conclusion with a weaker assumption.) Thus, if we
show both **(a)** and **(b’)** then we complete the proof of
Reference:graphconthm.

Proving **(a)** (i.e., the “base case”) is actually quite easy. The
statement \(Q(0)\) says that if \(G\) has zero edges, then \(|C_G(u)|=1\), but
this is clear because in a graph with zero edges, \(u\) is only connected
to itself. The heart of the proof is, as typical with induction proofs,
is in proving a statement such as **(b’)** (or even the weaker statement
**(b)**). Since we are trying to prove an *implication*, we can *assume*
the so-called “inductive hypothesis” that \(Q(k-1)\) is true and need to
prove from this assumption that \(Q(k)\) is true. So, suppose that
\(G=(V,E)\) is a graph of \(k\) edges, and \(u\in V\). Since we can use
induction, a natural approach would be to remove an edge \(e\in E\) from
the graph to create a new graph \(G'\) of \(k-1\) edges. We can use the
induction hypothesis to argue that \(|C_{G'}(u)| \leq k\). Now if we could
only argue that removing the edge \(e\) reduced the connected component of
\(u\) by at most a single vertex, then we would be done, as we could argue
that \(|C_G(u)| \leq |C_{G'}(u)|+1 \leq k+1\).

Please ensure that you understand why showing that \(|C_G(u)| \leq |C_{G'}(u)|+1\) completes the inductive proof.

Alas, this might not be the case. It could be that removing a single edge \(e\) will greatly reduce the size of \(C_{G}(u)\). For example that edge might be a “bridge” between two large connected components; such a situation is illustrated in Reference:effectofoneedgefig. This might seem as a real stumbling block, and at this point we might go back to the drawing board to see if perhaps the theorem is false after all. However, if we look at various concrete examples, we see that in any concrete example, there is always a “good” choice of an edge, adding which will increase the component connect to \(u\) by at most one vertex.

The crucial observation is that this always holds if we choose an edge \(e = \{ s, w\}\) where \(w \in C_G(u)\) has degree one in the graph \(G\), see Reference:addingdegreeonefig. The reason is simple. Since every path from \(u\) to \(w\) must pass through \(s\) (which is \(w\)’s only neighbor), removing the edge \(\{ s,w \}\) merely has the effect of disconnecting \(w\) from \(u\), and hence \(C_{G'}(u) = C_G(u) \setminus \{ w \}\) and in particular \(|C_{G'}(u)|=|C_G(u)|-1\), which is exactly the condition we needed.

Now the question is whether there will always be a degree one vertex in \(C_G(u) \setminus \{u \}\). Of course generally we are not guaranteed that a graph would have a degree one vertex, but we are not dealing with a general graph here but rather a graph with a small number of edges. We can assume that \(|C_G(u)| > k+1\) (otherwise we’re done) and each vertex in \(C_G(u)\) must have degree at least one (as otherwise it would not be connected to \(u\)). Thus, the only case where there is no vertex \(w\in C_G(u) \setminus \{u\}\) of degree one, is when the degrees of all vertices in \(C_G(u)\) are at least \(2\). But then by Reference:degreesegeslem the number of edges in the graph is at least \(\tfrac{1}{2}\cdot 2 \cdot (k+1)>k\), which contradicts our assumption that the graph \(G\) has at most \(k\) edges. Thus we can conclude that either \(|C_G(u)| \leq k+1\) (in which case we’re done) or there is a degree one vertex \(w\neq u\) that is connected to \(u\). By removing the single edge \(e\) that touches \(w\), we obtain a \(k-1\) edge graph \(G'\) which (by the inductive hypothesis) satisfies \(|C_{G'}(u)| \leq k\), and hence \(|C_G(u)|=|C_{G'}(u) \cup \{ w \}| \leq k+1\). This suffices to complete an inductive proof of statement \(Q(k)\).

### Writing down the proof

All of the above was a discussion of how we *discover* the proof, and
convince *ourselves* that the statement is true. However, once we do
that, we still need to write it down. When writing the proof, we use the
benefit of hindsight, and try to streamline what was a messy journey
into a linear and easy-to-follow flow of logic that starts with the word
**“Proof:”** and ends with **“QED”** or the symbol \(\blacksquare\).QED stands for “quod erat demonstrandum”, which is “What was to
be demonstrated.” or “The very thing it was required to have shown.”
in Latin.
All our discussions, examples and digressions can be very insightful,
but we keep them outside the space delimited between these two words,
where (as described by this excellent
handout)
“every sentence must be load bearing”. Just like we do in programming,
we can break the proof into little “subroutines” or “functions” (known
as *lemmas* or *claims* in math language), which will be smaller
statements that help us prove the main result. However, it should always
be crystal-clear to the reader in what stage we are of the proof. Just
like it should always be clear to which function a line of code belongs
to, it should always be clear whether an individual sentence is part of
a proof of some intermediate result, or is part of the argument showing
that this intermediate result implies the theorem. Sometimes we
highlight this partition by noting after each occurrence of **“QED”** to
which lemma or claim it belongs.

Let us see how the proof of Reference:graphconthm looks in this streamlined fashion. We start by repeating the theorem statement

Every connected undirected graph of \(n\) vertices has at least \(n-1\) edges.

The proof will follow from the following lemma:

For every \(k\in \N\), undirected graph \(G=(V,E)\) of at most \(k\) edges, and \(u\in V\), the number of vertices connected to \(u\) in \(G\) is at most \(k+1\).

We start by showing that Reference:graphcontlem implies the theorem:

Proof of Reference:graphconthmpf from Reference:graphcontlem:We will show that for undirected graph \(G=(V,E)\) of \(n\) vertices and at most \(n-2\) edges, there is a pair \(u,v\) of vertices that are disconnected in \(G\). let \(G\) be such a graph and \(u\) be some vertex of \(G\). By Reference:graphcontlem, the number of vertices connected to \(u\) is at most \(n-1\), and hence (since \(|V|=n\)) there is a vertex \(v\in V\) that is not connected to \(u\), thus completing the proof.QED (Proof of Reference:graphconthmpf from Reference:graphcontlem)

We now turn to proving Reference:graphcontlem. Let \(G=(V,E)\) be an undirected graph of \(k\) edges and \(u\in V\). We define \(C_G(u)\) to be the set of vertices connected to \(u\). To complete the proof of Reference:graphcontlem, we need to prove that \(|C_G(u)| \leq k+1\). We will do so by induction on \(k\).

The *base* case that \(k=0\) is true because a graph with zero edges, \(u\)
is only connected to itself.

Now suppose that Reference:graphcontlem is true for \(k-1\) and we will
prove it for \(k\). Let \(G=(V,E)\) and \(u\in V\) be as above, where \(|E|=k\),
and suppose (towards a contradiction) that \(|C_G(u)| \geq k+2\). Let
\(S = C_G(u) \setminus \{u \}\). Denote by \(deg(v)\) the degree of any
vertex \(v\). By Reference:degreesegeslem,
\(\sum_{v\in S} deg(v) \leq \sum_{v\in V} deg(v) = 2|E|=2k\). Hence in
particular, under our assumption that \(|S|+1=|C_G(u)| \geq k+2\), we get
that \(\tfrac{1}{|S|}\sum_{v\in S} deg(v) \leq 2k/(k+1)< 2\). In other
words, the *average* degree of a vertex in \(S\) is smaller than \(2\), and
hence in particular there is *some* vertex \(w\in S\) with degree smaller
than \(2\). Since \(w\) is connected to \(u\), it must have degree at least
one, and hence (since \(w\)’s degree is smaller than two) degree *exactly*
one. In other words, \(w\) has a single neighbor which we denote by \(s\).

Let \(G'\) be the graph obtained by removing the edge \(\{ s, w\}\) from \(G\). Since \(G'\) has at most \(k-1\) edges, by the inductive hypothesis we can assume that \(|C_{G'}(u)| \leq k\). The proof of the lemma is concluded by showing the following claim:

Claim:Under the above assumptions, \(|C_G(u)| \leq |C_{G'}(u)|+1\).

Proof of claim:The claim says that \(C_{G'}(u)\) has at most one fewer element than \(C_G(u)\). Thus it follows from the following statement \((*)\): \(C_{G'}(u) \supseteq C_G(u) \setminus \{ w \}\). To prove (*) we need to show that for every \(v \neq w\) that is connected to \(u\), \(v \in C_{G'}(u)\). Indeed for every such \(v\), Reference:simplepathlem implies that there must be somesimplepath \((t_0,t_1,\ldots,t_{i-1},t_i)\) in the graph \(G\) where \(t_0=u\) and \(t_i=v\). But \(w\) cannot belong to this path, since \(w\) is different from the endpoints \(u\) and \(v\) of the path and can’t equal one of the intermediate points either, since it has degree one and that would make the path not simple. More formally, if \(w=t_j\) for \(0 < j < i\), then since \(w\) has only a single neighbor \(s\), it would have to hold that \(w\)’s neighbor \(s\) satisfies \(s=t_{j-1}=t_{j+1}\), contradicting the simplicity of the path. Hence the path from \(u\) to \(v\) is also a path in the graph \(G'\), which means that \(v \in C_{G'}(u)\), which is what we wanted to prove.QED (claim)

The claim implies Reference:graphcontlem since by the inductive
assumption, \(|C_{G'}(u)| \leq k\), and hence by the claim
\(|C_G(u)| \leq k+1\), which is what we wanted to prove. This concludes
the proof of Reference:graphcontlem and hence also of
Reference:graphconthmpf. **QED (Reference:graphcontlem)**, **QED
(Reference:graphconthmpf)**

The proof above used the observation that if the *average* of some \(n\)
numbers \(x_0,\ldots,x_{n-1}\) is at most \(X\), then there must *exists* at
least a single number \(x_i \leq X\). (In this particular proof, the
numbers were the degrees of vertices in \(S\).) This is known as the
*averaging principle*, and despite its simplicity, it is often extremely
useful.

Reading a proof is no less of an important skill than producing one. In
fact, just like understanding code, it is a highly non-trivial skill in
itself. Therefore I strongly suggest that you re-read the above proof,
asking yourself at every sentence whether the assumption it makes are
justified, and whether this sentence truly demonstrates what it purports
to achieve. Another good habit is to ask yourself when reading a proof
for every variable you encounter (such as \(u\), \(t_i\), \(G'\), etc. in the
above proof) the following questions: **(1)** What *type* of variable is
it? is it a number? a graph? a vertex? a function? and **(2)** What do
we know about it? Is it an arbitrary member of the set? Have we shown
some facts about it?, and **(3)** What are we *trying* to show about
it?.

## Proof writing style

A mathematical proof is a piece of writing, but it is a specific genre of writing with certain conventions and preferred styles. As in any writing, practice makes perfect, and it is also important to revise your drafts for clarity.

In a proof for the statement \(X\), all the text between the words
**“Proof:”** and **“QED”** should be focused on establishing that \(X\) is
true. Digressions, examples, or ruminations should be kept outside these
two words, so they do not confuse the reader. The proof should have a
clear logical flow in the sense that every sentence or equation in it
should have some purpose and it should be crystal-clear to the reader
what this purpose is. When you write a proof, for every equation or
sentence you include, ask yourself:

- Is this sentence or equation stating that some statement is true?
- If so, does this statement follow from the previous steps, or are we going to establish it in the next step?
- What is the
*role*of this sentence or equation? Is it one step towards proving the original statement, or is it a step towards proving some intermediate claim that you have stated before? - Finally, would the answers to questions 1-3 be clear to the reader? If not, then you should reorder, rephrase or add explanations.

Some helpful resources on mathematical writing include this handout by Lee, this handout by Hutching, as well as several of the excellent handouts in Stanford’s CS 103 class.

### Patterns in proofs

Just like in programming, there are several common patterns of proofs that occur time and again. Here are some examples:

**Proofs by contradiction:** One way to prove that \(X\) is true is to
show that if \(X\) was false then we would get a contradiction as a
result. Such proofs often start with a sentence such as “Suppose,
towards a contradiction, that \(X\) is false” and end with deriving some
contradiction (such as a violation of one of the assumptions in the
theorem statement). Here is an example:

There are no natural numbers \(a,b\) such that \(\sqrt{2} = \tfrac{a}{b}\).

Suppose, towards the sake of contradiction that this is false, and so
let \(a\in \N\) be the smallest number such that there exists some
\(b\in\N\) satisfying \(\sqrt{2}=\tfrac{a}{b}\). Squaring this equation we
get that \(2=a^2/b^2\) or \(a^2=2b^2\) \((*)\). But this means that \(a^2\) is
*even*, and since the product of two odd numbers is odd, it means that
\(a\) is even as well, or in other words, \(a = 2a'\) for some \(a' \in \N\).
Yet plugging this into \((*)\) shows that \(4a'^2 = 2b^2\) which means
\(b^2 = 2a'^2\) is an even number as well. By the same considerations as
above we gat that \(b\) is even and hence \(a/2\) and \(b/2\) are two natural
numbers satisfying \(\tfrac{a/2}{b/2}=\sqrt{2}\), contradicting the
minimality of \(a\).

**Proofs of a universal statement:** Often we want to prove a statement
\(X\) of the form “Every object of type \(O\) has property \(P\).” Such proofs
often start with a sentence such as “Let \(o\) be an object of type \(O\)”
and end by showing that \(o\) has the property \(P\). Here is a simple
example:

For every natural number \(n\in N\), either \(n\) or \(n+1\) is even.

Let \(n\in N\) be some number. If \(n/2\) is a whole number then we are done, since then \(n=2(n/2)\) and hence it is even. Otherwise, \(n/2+1/2\) is a whole number, and hence \(2(n/2+1/2)=n+1\) is even.

**Proofs of an implication:** Another common case is that the statement
\(X\) has the form “\(A\) implies \(B\)”. Such proofs often start with a
sentence such as “Assume that \(A\) is true” and end with a derivation of
\(B\) from \(A\). Here is a simple example:

If \(b^2 \geq 4ac\) then there is a solution to the quadratic equation \(ax^2 + bx + c =0\).

Suppose that \(b^2 \geq 4ac\). Then \(d = b^2 - 4ac\) is a non-negative number and hence it has a square root \(s\). Thus \(x = (-b+s)/(2a)\) satisfies \[ ax^2 + bx + c = a(-b+s)^2/(4a^2) + b(-b+s)/(2a) + c = (b^2-2bs+s^2)/(4a)+(-b^2+bs)/(2a)+c \;. \label{eq:quadeq} \] Rearranging the terms of \eqref{eq:quadeq} we get \[ s^2/(4a)+c- b^2/(4a) = (b^2-4ac)/(4a) + c - b^2/(4a) = 0 \]

**Proofs of equivalence:** If a statement has the form “\(A\) if and only
if \(B\)” (often shortened as “\(A\) iff \(B\)”) then we need to prove both
that \(A\) implies \(B\) and that \(B\) implies \(A\). We call the implication
that \(A\) implies \(B\) the “only if” direction, and the implication that
\(B\) implies \(A\) the “if” direction.

**Proofs by combining intermediate claims:** When a proof is more
complex, it is often helpful to break it apart into several steps. That
is, to prove the statement \(X\), we might first prove statements
\(X_1\),\(X_2\), and \(X_3\) and then prove that \(X_1 \wedge X_2 \wedge X_3\)
implies \(X\).As mentioned below, \(\wedge\) denotes the logical AND operator. Our proof of Reference:graphconthm had this form.

**Proofs by case distinction:** This is a special case of the above,
where to prove a statement \(X\) we split into several cases
\(C_1,\ldots,C_k\), and prove that **(a)** the cases are *exhaustive*, in
the sense that *one* of the cases \(C_i\) must happen and **(b)** go one
by one and prove that each one of the cases \(C_i\) implies the result \(X\)
that we are after.

**“Without loss of generality (w.l.o.g)”:** This term can be initially
quite confusing to students. It is essentially a way to shorten case
distinctions such as the above. The idea is that if Case 1 is equal to
Case 2 up to a change of variables or a similar transformation, then the
proof of Case 1 will also imply the proof of case 2. It is always a
statement that should be viewed with suspicion. Whenever you see it in a
proof, ask yourself if you understand *why* the assumption made is truly
without loss of generality, and when you use it, try to see if the use
is indeed justified. Sometimes it might be easier to just repeat the
proof of the second case (adding a remark that the proof is very similar
to the first one).

**Proofs by induction:** We can think of such proofs as a variant of the
above, where we have an unbounded number of intermediate claims
\(X_0,X_2,\ldots,X_k\), and we prove that \(X_0\) is true, as well that
\(X_0\) implies \(X_1\), and that \(X_0 \wedge X_1\) implies \(X_2\), and so on
and so forth. The website for CMU course 15-251 contains a useful
handout on
potential pitfalls when making proofs by induction.

## Non-standard notation

Most of the notation we discussed above is standard and is used in most mathematical texts. The main points where we diverge are:

- We index the natural numbers \(\N\) starting with \(0\) (though many other texts, especially in computer science, do the same).
- We also index the set \([n]\) starting with \(0\), and hence define it as \(\{0,\ldots,n-1\}\). In most texts it is defined as \(\{1,\ldots, n \}\). Similarly, we index coordinates of our strings starting with \(0\), and hence a string \(x\in \{0,1\}^n\) is written as \(x_0x_1\cdots x_{n-1}\).
- We use
*partial*functions which are functions that are not necessarily defined on all inputs. When we write \(f:A \rightarrow B\) this will refer to a*total*function unless we say otherwise. When we want to emphasize that \(f\) can be a partial function, we will sometimes write \(f: A \rightarrow_p B\). - As we will see later on in the course, we will mostly describe our
computational problems in the terms of computing a
*Boolean function*\(f: \{0,1\}^* \rightarrow \{0,1\}\). In contrast, most textbooks will refer to this as the task of*deciding a language*\(L \subseteq \{0,1\}^*\). These two viewpoints are equivalent, since for every set \(L\subseteq \{0,1\}^*\) there is a corresponding function \(f = 1_L\) such that \(f(x)=1\) if and only if \(x\in L\). Computing*partial functions*corresponds to the task known in the literature as a solving a*promise problem*.Because the language notation is so prevalent in textbooks, we will occasionally remind the reader of this correspondence. - Some other notation we use is \(\ceil{x}\) and \(\floor{x}\) for the “ceiling” and “floor” operators that correspond to “rounding up” or “rounding down” a number to the nearest integer. We use \((x \mod y)\) to denote the “remainder” of \(x\) when divided by \(y\). That is, \((x \mod y) = x - y\floor{x/y}\). In context when an integer is expected we’ll typically “silently round” the quantities to an integer. For example, if we say that \(x\) is a string of length \(\sqrt{n}\) then we’ll typically mean that \(x\) is of length \(\lceil \sqrt{n} \rceil\). (In most such cases, it will not make a difference whether we round up or down.)
- Like most Computer Science texts, we default to the logarithm in base two. Thus, \(\log n\) is the same as \(\log_2 n\).
- We will also use the notation \(f(n)=poly(n)\) as a short hand for \(f(n)=n^{O(1)}\) (i.e., as shorthand for saying that there is some constants \(a,b\) such that \(f(n) \leq a\cdot n^b\) for every sufficiently large \(n\)). Similarly, we will use \(f(n)=polylog(n)\) as shorthand for \(f(n)=poly(\log n)\) (i.e., as shorthand for saying that there are some constant \(a,b\) such that \(f(n) \leq a\cdot (\log n)^b\) for every sufficiently large \(n\)).

## Exercises

- Let \(A,B\) be finite sets. Prove that
\(|A\cup B| = |A|+|B|-|A\cap B|\).
- Let \(A_0,\ldots,A_{k-1}\) be finite sets. Prove that
\(|A_0 \cup \cdots \cup A_{k-1}| \geq \sum_{i=0}^{k-1} |A_i| - \sum_{0 \leq i < j < k} |A_i \cap A_j|\).
- Let \(A_0,\ldots,A_{k-1}\) be finite subsets of \(\{1,\ldots, n\}\), such that \(|A_i|=m\) for every \(i\in [k]\). Prove that if \(k>100n\), then there exist two distinct sets \(A_i,A_j\) s.t. \(|A_i \cap A_j| \geq m^2/(10n)\).

Prove that if \(S,T\) are finite and \(F:S \rightarrow T\) is one to one then \(|S| \leq |T|\).

Prove that if \(S,T\) are finite and \(F:S \rightarrow T\) is onto then \(|S| \geq |T|\).

Prove that for every finite \(S,T\), there are \((|T|+1)^{|S|}\) partial functions from \(S\) to \(T\).

Suppose that \(\{ S_n \}_{n\in \N}\) is a sequence such that \(S_0 \leq 10\) and for \(n>1\) \(S_n \leq 5 S_{\lfloor \tfrac{n}{5} \rfloor} + 2n\). Prove by induction that \(S_n \leq 100 n \log n\) for every \(n\).

Describe the following statement in English words: \(\forall_{n\in\N} \exists_{p>n} \forall{a,b \in \N} (a\times b \neq p) \vee (a=1)\).

Prove that for every undirected graph \(G\) of \(100\) vertices, if every vertex has degree at most \(4\), then there exists a subset \(S\) of at \(20\) vertices such that no two vertices in \(S\) are neighbors of one another.

Suppose that we toss three independent fair coins \(a,b,c \in \{0,1\}\). What is the probability that the XOR of \(a\),\(b\), and \(c\) is equal to \(1\)? What is the probability that the AND of these three values is equal to \(1\)? Are these two events independent?

For every pair of functions \(F,G\) below, determine which of the
following relations holds: \(F=O(G)\), \(F=\Omega(G)\), \(F=o(G)\) or
\(F=\omega(G)\).

a. \(F(n)=n\), \(G(n)=100n\).

b. \(F(n)=n\), \(G(n)=\sqrt{n}\).

c. \(F(n)=n\), \(G(n)=2^{(\log (n))^2}\).

d. \(F(n)=n\), \(G(n)=2^{\sqrt{\log n}}\)

Give an example of a pair of functions \(F,G:\N \rightarrow \N\) such that neither \(F=O(G)\) nor \(G=O(F)\) holds.

Prove that for every directed acyclic graph (DAG) \(G=(V,E)\), there
exists a map \(f:V \rightarrow \N\) such that \(f(u)<f(v)\) for every edge
\(\overrightarrow{u \; v}\) in the graph.Hint: Use induction on the number of vertices. You might want to
first prove the claim that every DAG contains a *sink*: a vertex
without an outgoing edge.

## Bibliographical notes

The section heading “A Mathematician’s Apology”, refers of course to Hardy’s classic book. Even when Hardy is wrong, he is very much worth reading.