CMSC 28000 — Lecture 2

The concatenation of two strings $u = a_1 a_2 \cdots a_m$ and $v = b_1 b_2 \cdots b_n$, denoted $u \cdot v$ or simply $uv$, is $uv = a_1 \cdots a_m b_1 \cdots b_n$.

Observe that concatenation is associative, since $(uv)w = u(vw)$. However, it is not commutative (i.e. $uv$ is not necessarily equal to $vu$). Usually, we learn about concatenation when we first learn to program. Depending on how you learned to program, you may think of it as sticking two strings together. Or, if you learned to program functionally, then you will have seen a version of the following alternative definition before.

To see that this works, just follow the definition: \begin{align*} cat \cdot hat &= c(at \cdot hat) \\ &= c(a(t \cdot hat)) \\ &= c(a(t(\varepsilon \cdot hat))) \\ &= c(a(t(hat))) \\ &= c(a(that)) \\ &= c(athat) \\ &= cathat \\ \end{align*}

This definition has a few advantages over the less formal one. As with other recursive definitions, this one acts as an algorithm as well—it tells us how two strings are to be concatenated. And because it is a more formal definition, it allows us to actually prove that concatenation is associative (give it a shot), rather than just notice it.

It also becomes more clear in this view of strings that alphabet symbols are not technically considered strings. This is similar to the way that some programming languages distinguish characters from strings. Sometimes, this distinction will be very helpful in proofs. Other times, like in Python, we can be lazy and not bother making this distinction.

Notice that this also makes it clear that we can perform concatenation on the empty string. If we have a string $w$, then $w \varepsilon = \varepsilon w = w$.

An interesting consequence of these definitions is that they already define a mathematical structure.

A monoid is a structure $\langle M,\cdot,1 \rangle$, where $M$ is a set, $\cdot$ is an associative binary operation, and $1$ is an identity for $\cdot$.

From this, we can see that the set of strings, with the binary operation concatenation and identity element $\varepsilon$, is a monoid, and this is where $\Sigma^*$ gets its notation from—it is the free monoid on the set $\Sigma$. Of course, for us, this is will be little more than a curiosity, but it does point to a few things. First of all, strings aren't just arbitrary objects; they have mathematical meaning and structure. Secondly, monoids are algebraic objects (just add inverses and you've got groups!), and this points to the possibility that studying formal language theory using algebraic tools may not be such a bad idea.

The $k$th power of a string $w$ is $$w^k = \overbrace{ww \cdots w}^{k \text{ times}}$$ and we define $w^0 = \varepsilon$. Formally, we define this inductively by

(Notice that this definition is inductive on the natural numbers rather than strings)

A related notion that won't come up very much in this course is the notion of primitivity. We say a word is primitive if it is non-empty and can't be written as the power of some other word. This is sort of like the notion of a "prime unit" for strings.

It's important to note that $u,v,w,x$ are strings, which can include $\varepsilon$. So $\varepsilon$ is a prefix of every string and that $w$ itself is a prefix of $w$.

Much like the usage of string or word, there are a plethora of terms that can be used interchangeably to describe what you're probably most familiar with as a substring. The term subword is the same idea, if you were to use word instead of string. Similarly, the term infix comes naturally from the analogy with prefix and suffix. The one which may be less obvious is factor; this comes from viewing the string as an algebraic object: if $6 = 1 \times 2 \times 3$, then $2$ is a factor of $6$. Similarly, if $w = x \cdot y \cdot z$, then $y$ is a factor of $w$.

The choice of which term to use is largely a product of which subfield you spend your time in (computer science, combinatorics, or algebra) and where you are situated geographically (the Americas or Europe).

The reversal of a string $w = a_1 a_2 \cdots a_n$ is the string with the symbols written in reverse order, $w^R = a_n \cdots a_2 a_1$. Formally,

That strings can be reversed is a consequence of the fact that concatenation is not commutative.

Here is a property of reversal that we can prove: if we concatenate two strings and reverse them, it is the same as reversing them first then concatenating them in the opposite order. Seems obvious.

We will show this by induction on $x$.

For our base case, let $x = \varepsilon$. We have \begin{align*} (xy)^R &= (\varepsilon y)^R &\text{Base case}\\ &= y^R &\text{Definition of $\varepsilon$} \\ &= y^R \varepsilon &\text{Definition of $\varepsilon$}\\ &= y^R \varepsilon^R &\text{Definition of reversal}\\ &= y^R x^R & \text{Base case} \end{align*}

Now, for our induction step, let $x = az$ for some word $z \in \Sigma^*$ and assume that $(zy)^R = y^R z^R$. Then we have \begin{align*} (xy)^R &= ((az)y)^R &\text{Inductive case}\\ &= (a(zy))^R &\text{definition of concatenation}\\ &= (zy)^R a &\text{definition of reversal}\\ &= (y^R z^R) a &\text{Inductive hypothesis}\\ &= y^R (z^R a) &\text{Associativity of concatenation}\\ &= y^R (az)^R &\text{Definition of reversal}\\ &= y^R x^R &\text{Subtitution} \end{align*}

Let's note a few things about this proof. This proof gets into more detail than you'd typically require in this course, but the steps are here to illustrate some of the subtleties of working with strings. For instance, notice that we make use of associativity of concatenation—notice that we've implicitly turned $a$ into a string at this point.

This proof was accomplished via induction directly on strings. We can do this because strings are recursive objects. But something you often see is induction that's more indirect, on the length of the string. In principle, this is really the same thing: our base case would be 0, which corresponds to $\varepsilon$, and the inductive case, when $|w| \gt 0$, would correspond to strings of the form $ax$.

Languages

As we've seen earlier, languages in the formal language theoretic sense are just sets of strings and we describe a language in the same way as for any other set.

We can define the notion of a language a bit more formally, with respect to an alphabet $\Sigma$.

We denote by $\Sigma^*$ the set of all finite words over the alphabet $\Sigma$. A language $L$ over $\Sigma$ is a subset $L \subseteq \Sigma^*$.

In other words, we fix some alphabet $\Sigma$ and take $\Sigma^*$ to be our universe.

You might notice that the more curious definition is for $\Sigma^*$. If you recall our brief diversion about monoids, you might recognize that $\Sigma^*$ is really just the free monoid on $\Sigma$.

Since languages are sets, we can perform any set theoretic operations we're familiar with, like union and intersection.

For a language $L \subseteq \Sigma^*$, the complement of $L$ is $\overline L = \Sigma^* \setminus L$.

This also means there are a few things we have to be careful with. For example, there is the notion of the empty language $\emptyset$, which is the language with no strings in it. This is different from the language $\{\varepsilon\}$, which is the language containing the empty string.

For two languages $L_1$ and $L_2$, we define the catenation of $L_1$ and $L_2$ by $$ L_1 L_2 = \{uv \mid u \in L_1, v \in L_2 \}.$$

If $L_1 = \{\varepsilon, ab, abb, abc\}$ and $L_2 = \{ca, bac\}$, then \[L_1 L_2 = \{ca, bac, abca, abbac, abbca, abbbac, abcca, abcbac\}.\]

What exactly is the difference between $\emptyset$ and $\{\varepsilon\}$? Concatenation gives us one example of why this difference matters. Consider a language $L$ and the following languages:

The Kleene star or Kleene closure of $L$ is defined by $$ L^* = \bigcup_{i \geq 0} L^i.$$ We define the positive Kleene closure of $L$ by $L^+ = LL^*$. One can see that this is just the nonzero powers of $L$, so we can also think of this as $$ L^+ = \bigcup_{i \geq 1} L^i.$$

Let $\Sigma = \{a,b,c\}$. Then $L = \{ab, bc\}$ is a language over $\Sigma$. If we take the Kleene star of $L$, we get the language made of up of combinations of $ab$ and $bc$, like $ababbcabbcbcab$.

One use of this is for constructing codes. In formal language theory, a code over an alphabet $\Sigma$ is a language $K \subseteq \Sigma^+$ such that if $w \in K^+$ then $w$ has a unique factorization into words of $K$ (here is a use of "factor" as "subword"). It's not hard to see that not every language is a code.