regular language and pumping lemma
i’ve always struggled with the concept of regular (or of context-free) language, not in the way that i don’t know how to construct one that i know satisfies the definition, but that of deciding, how do i know whether a language is regular or not? and how do i arbitrarily come up with a language that’s not regular? they will tell you about the pumping lemma, but there’s just not much good explanation about it when i tried to search around.
background: regular expression
let’s know what’s a regular language first. i would like to put it simply,
something represented by a regular expression. maybe the one “grep” comes with.
the one in many programming languages. let’s start by looking at some. (i’ll use
the common notation in programming language world here, that is, |
*
.)
suppose you want to describe a word, but not very precisely; let’s say, you want
to look for both regular
and normal
. then you want to type in
regular|normal
(or normal|regular
, it’s mostly symmetric in real world).
the |
represents an alternative, meaning to describe both the word on the left
and on the right.
now you want to describe both regularity
and regularly
. in the same logic
you want to write regularity|regularly
, which will work just as fine. but you
notice you typed regular
twice, and that’s not efficient. so in these cases,
you may use a regular expression like regular(ity|ly)
, to describe the common
prefixes and different suffixes. to take a step further, you may also take out
the last y
here for regular(it|l)y
. three describe the same set of words.
maybe you want to include something like familiarity
in your description too.
and for consistency you want familiarly
too. in this case you might want to
write (regul|famili)ar(it|l)y
. this expression describes the four words
mentioned. if you want to describe similar-
in here too, just say
(regul|famili|simil)ar(it|l)y
, or maybe ((regu|simi)l|famili)ar(it|l)y
.
these describe the six words. say for some reason you want to forgo with the
-ly
but add in arity
(i hope arly
is a word now), you can write
(()|regul|famili|simil)arity
for it. ()
is the regexp for empty input, which
is sometimes written as epsilon, $\varepsilon$.
and now you want to describe looooool
s. for this we can use the kleene star
*
. we might write something like lo(o*)l
, o*
means arbitrarily many o
in
series. we write an extra o
before that because, l(o*)l
describes ll
too,
the repetition can be zero times. if we want to describe lololololol
now, we
write lo((lo)*)l
. now if you like a freestyle llloololllooollol
, describe it
with l(l|o)*l
.
and regular expressions are just basically the above, but they usually appear
with some shorthand notations too. notice we wrote o(o*)
lo((lo)*)
, in which
we want to describe a repetition of one or more times instead of zero or more.
so let’s write A+
:= A(A*)
. if we only want to describe a short enough lulz,
we write something like lo{1-5}l
, meaning o
repeated for any of 1 to 5
times. this is expands to l(o|oo|ooo|oooo|ooooo)l
. write A?
:= ()|A
. and
there’s a lot more interesting extensions, but we can always decide to stick to
only *
|
and sticking two regexps together.
now there’s actually a less known non-shorthand regexp, $\empty$, that describes absolutely nothing, not even the empty string. it’s probably less useful in real-world situations, but we need it in order to describe all regular languages.
i hope this is enough to give us a grasp of what a regular expression is, and what it describes (we’ll call it a regular language). there are a ton of tips and tricks, deeper into the topic of what is a regular language and why the regular expressions can describe exactly all the regular languages for a given alphabet, automaton and computability theory, all very fascinating stuff out there. for the purpose of this article we’ll conclude here, with a somewhat-formal definition of a regular expression.
we call it a regular expression iff exactly one of the following holds:
- it is empty-set $\empty$;
- it is empty-string $\varepsilon$;
- it is one element of the alphabet,
a
b
x
y
etc; - it is an alternation, of the form
A | B
; - it is a concatenation, of the form
AB
; - it is a kleene star expression, of the form
A*
.
matching a string to a regexp
so now we know a regexp describes a set of strings. that means you can tell if (at least for some) strings belongs to this set or not. let’s see how we can tell by defining a relation “match” for a regexp and a string. should we talk about a string, i will use ‘++’ to represent string concatenation from now on.
matches is the smallest relation that satisfies:
- empty string matches empty-string.
- singleton string of only
x
matches the one element of the alphabetx
. - if s matches
A
, it matchesA|B
. - if s matches
B
, it matchesA|B
. - if s1 matches
A
and s2 matchesB
, s1 ++ s2 matchesAB
. - empty string matches
A*
. - if sh matches
A
and st matchesA*
, sh ++ st matchesA*
. - (and you can see because this is the smallest relation, nothing matches empty-set).
this relation is inductive, so we can prove things about this relation by induction directly on the relation, that is, case-analysis with different witnesses of the fact that a string matches a regexp, which is super handy. better yet we can show this relation is actually decidable, by developing an algorithm for the decision and proving its correctness. it’s kind of a labor, involving you resolving by some enumeration and thus guessing and backtracking, and taking care of ambiguity. so is why people generally just find whatever regexp library, sporadically selling their souls to grep and sed. may all rest peacefully in the regexp heaven.
anyways we don’t really need this to be decidable for the later reasoning, just wanted to let you know this really is the regexp we’re familiar with, albeit in a computational relation form.
now there’s still a relatively trivial way to treat this relation as a
meta-algorithm, so say you want to prove or disprove s matches R
, just see
which rule will give the correct s or R
, and try to prove premises of the
rule. if there are multiple choices, then it involves guessing, which you can use
your galaxy brain to just use one of them, or just like the real
algorithm, try and backtrack. eventually it’s either proven or you find a
contradiction entails (which means s does not match R
).
suppose we want to show familiarly
matches (regul|famili)ar(it|l)y
. top
level of this expression is of form AB
so we invoke its sole rule, (from
oracle) setting s1 = familiar
, s2 = ly
, A
= (regul|famili)ar
and B
=
(it|l)y
. now we prove s1 matches A
and s2 matches B
. let’s go with s2
matches B
here first. again by the same logic we split s2 into l
++ y
, and
show that (s2 = ) y
matches (B
=) y
, by applying the singleton rule. now
prove l
matches it|l
. we invoke the second rule for alternation. now we see
s2 matches B
. go back to s1 matches A
, it’s similar.
pumping lemma (for regular languages)
we should first know what it means if a language is regular or not. so if you have a language, that is, a (very possibly infinite) collection of strings (sentences) over the alphabet (maybe words), saying it is regular is equal to saying, there’s some regular expression out there, which describes exactly this language. and if a language is so, we can ask it for its corresponding regexp.
now let’s gradually work out what the pumping lemma really is. first of all, it is a property that every regular language has. so we can assert a language has the pumping lemma, or it doesn’t have the pumping lemma; and the conclusion is going to be that if a language is regular, it has the pumping lemma.
this is useful, because it allows us to show a language doesn’t have the pumping lemma, thus it isn’t a regular language.
detour
please note this is called a contraposition, which is (very) different from proof by contradiction (reductio ad absurdum), which many explanations end up claiming. the logic is $ (\neg A)\rightarrow (\neg B) \leftrightarrow B \rightarrow A $ which has an intuitionist derivation, unlike RAA which is only classical. this is important because, imagine if we are reasoning about some relation around regular languages, say for some parameter x, if A(x) is regular then B(x) is regular too. we might do a case analysis on x, and maybe for some x A(x) doesn’t have a pumping lemma, then A(x) isn’t regular, refuting the need to prove B(x) is regular in this case. consider if we have such a theorem, we can specify x and A (and regexp of A(x)), to get a regexp for B(x). if we have to use a classical derivation here, then the process of running A(x) through the theorem might just not terminate sometimes, then we don’t always get a B(x). (which I just happened to randomly feel that this would be super useful someday!)
also please note that this doesn’t exclude the possibility of some language having the pumping lemma but isn’t really regular. this only denies that some language is regular, and it’s usually hard to deny, relative to proving some language is regular, which instead only involves you coming up with a regular expression of it. just remember you can’t use this to prove a language is regular.
so, let’s ponder a little bit around the features of a regular language, what do they all have in common?
a finite language is necessarily regular because, for a given finite language,
you can just join every sentence together with |
, and it’s exactly this
language. and on the other way around, if a regular expression only involves |
and concatenation, it’s necessarily finite, such that you can kind of expand it
like it’s a polynomial, splitting any (A|B)C
into AC|BC
, and you will
eventually get a long list of sentences joined with |
, which is the exact list
of every sentence in this language.
as soon as we introduce *
into our regexp, we come around to dealing with an
infinite language. any regular infinite language is so because they have at
least one *
in them. you can factor and expand |
and concatenation and some
*
if you need, but the language is still infinite and you still will have a
*
somewhere in there. as you can imagine, a regular infinite language is and
in a way only is made so with the kleene star, the lemma of our interest will
definitely talk about a star a lot.
the idea basically is that, if there’s some *
somewhere in your language, then
some sentence is going to use that *
to match, which means a part of that
sentence r matches some A*
with a certain p in it matching A
. and if p
matches A*
, p ++ r also will. and p ++ r ++ r also will. and so you can just
shove more and more r into this hole. trivially, you can repeat the o
in
loooool
as many times as you want.
let’s take a look at our earlier lolol
language. so for example we know
lolololol
is in this language, that it matches the given regular expression
lo((lo)*)l
. and you can see this “used” the (lo)*
to match. so we can break
the sentence apart like this: lo(lo)lolol
, and shove as many lo
into
there as we want, and it still matches the regular expression. we can also break
as lolo(lo)lol
and this still works. we can even break this into lol(ol)olol
and you see we can shove arbitrarily many ol
in there too in this case. this
is what “pumping” means, that you can repeatedly insert something back in.
but sometimes you can’t always do this pumping even if it’s a string in a
regular infinite language, simply because if we take lmao
matching
lmao|lo((lo)*)l
, you can’t really find any part of lmao
that you can pump.
this suggests there’s some condition a match must satisfy in order for it to be
pumped. and the common way to put is that we can only pump a sentence in this
language long enough, and lmao
is too short lmao, deal with it.
pumping constant
so how long do you want then? as you can imagine this depends on the regular expression itself. we could also just say eh we don’t really know, just that not forall lengths, not some string longer but not pumpable, as is the usual semantic of “sufficiently (large|long|many)”. but it’s also just a kind of easy-to-calculate number, and it helps to know what this number actually means. we call this shortest pumpable length of a language its pumping constant.
the pumping constant of a regular expression is:
- if it’s empty-set, 1
- if it’s empty-string, 1
- if it’s one element of alphabet, 2
- if it’s
A|B
orAB
, result is summing pumping constant ofA
and pumping constant ofB
up - if it’s a kleene star
A*
, result is just pumping constant ofA
it might look a bit arbitrary and thus scary, but here’s the breakdown: we’re
basically denying to pump the finite stuff by setting the pumping constant to be
just 1 too large: empty-string can only have a length of 0, singleton can only
have a length of 1 etc. and for A|B
and AB
, they both produce finite
language if both A
and B
are finite. in this case if we just sum them up,
the result is larger than the maximum length A|B
or AB
can reach, we also
bar them from being pumped. finally, the number of A*
you see, if we have a
string as long as A
could represent, means we can pump it and it’ll still be
in A*
. this basically means the number of characters you need to look for
until you find a part that actually uses the *
, since each of the other cases
says you need to look past the entire string, which means it’s impossible to
arrive at anything else.
deduction
we got our condition figured out now, let’s see how to put them all together and get a complete proposition that is our lemma:
given any regexp R
, any sentence s that matches R
, if the given s is longer
than the pumping constant of R
, then we can give a partition of s, as in there
exists a b c, such that a ++ b ++ c = s and s2 isn’t empty, and satisfies
that a ++ (any time of repeated b) ++ c also matches R
.
Proof by induction on the given match:
-
if the match is because of empty-string, pumping constant is 1 but the length of s is 0, premise “s is longer than pumping constant of
R
” is unsatisfiable, we can refuse to prove this case. -
if the match is because of a singleton char, it’s similar to the above.
-
if the match is because
R
is someA|B
and s is matchingA
, we have the induction hypothesis, s has pumping lemma inA
: if s is longer than the pumping constant ofA
, there exists sa1 sa2 sa3, sa1 ++ sa2 ++ sa3 = s, sa2 isn’t empty, and we can pump sa2.we need to show s has pumping lemma in
A|B
. recall s is longer than pumping constant ofA|B
. since pumping constant ofA|B
is that ofA
+B
, we know s is also longer than the pumping constant ofA
. so we use this fact together with induction hypothesis to get our sa1 sa2 sa3, and it’s exactly the partition we’re looking for. basically, you figure out how to pump “s matchesA
”, the induction hypothesis, instead. -
if the match is because
R
is someA|B
and s is matchingB
, it’s similar to the case above. -
if the match is because
R
is someAB
and s is s1 ++ s2, s1 is matchingA
and s2 is matchingB
, we have the two induction hypothesis: if s1 is longer than the pumping constant of A, we can partition s1 and pump; similar for s2 andB
.we show
s1 ++ s2
has pumping lemma inAB
. we knows1 ++ s2
is longer than the pumping constant ofAB
, and it’s that ofA
+B
, and the length of s1 ++ s2 is that of s1 + s2. either s1 is longer than pumping constant ofA
or s2 is longer thanB
. in either case, we can use one of the induction hypotheses to get the partition and property desired. -
if the match is because
R
is someA*
and s is empty, we need to prove from the premise of s is longer than the pumping constant ofA*
. notice pumping constant is always >= 1, leading to an unsatisfiable premise. -
finally, if the match is because
R
is someA*
and s is some s1 ++ s2, induction hypotheses are if s1 is longer thanA
, we can partition A to pump; same with if s2 is longer thanA
(because pumping constantA*
=A
). we prove from the premise of that, s1 ++ s2 is longer than pumping constant ofA
, to show we can partition s1 ++ s2 and pump.s1 is either empty or not.
- if it’s empty, that means s1 ++ s2 is s2, thus from the premise s2 is longer
than pumping constant of
A
, and we invoke induction hypothesis. - if it’s not, we directly give out the partition letting a = empty string, b
= s1, c = s2. a ++ b ++ c = s1 ++ s2 = s. b = s1 is not empty. any times of
repeated s1 followed with s2 still matches
A*
in this case (this can be derived from a trivial induction on the times repeated).
- if it’s empty, that means s1 ++ s2 is s2, thus from the premise s2 is longer
than pumping constant of
■
conclusion
the above derivation actually still lacks a part: that the pumping constant
actually has even more power, the partition we need to give should also satisfy
that length of a ++ b is shorter or equal to the pumping constant of the given
R
. so the above is called the weak pumping lemma. the strong pumping lemma
also asserts you can find this pumpable part in the first pumping-constant
number of characters. but as you can see, it’s not that hard to keep track of
exactly which a and b we chose, and show they are indeed in the required range.
the basic idea is still that the pumping constant is indicating how long we must
look for until we find a *
.
let’s see some language that isn’t regular. the simplest example is a (a*)(b*)
but only if there is a same amount of a
and b
, usually noted like
$\{a^nb^n\,|\,n \in\N\}$. we show it’s always impossible to pump in this, for
that any partition you come up will either pump a segment of all a
or all b
,
which will cause imbalance of the number of a
s and b
s; or if you pump a
segment of some ...aaabb...
, then obviously this doesn’t match our
specification (a*)(b*)
. since we can’t pump this language, it’s impossible to
come up with a regular expression to describe this language.
and there’s actually a pumping lemma for context-free languages (the above
example is one) too. it’s just slightly more complicated but the idea is very
similar, you can always partition a sentence, into 5 parts this time,
a(b)c(d)e
, where it satisfies that all the a(bbbb)c(dddd)e
is still in the
language. b
and d
get pumped the same number of times, and they correspond
in a way that the first b
is corresponding to the last d
, such that if
b...d
matches two different rules, that b
and that d
can only choose the
same rule.
i wish i possess formal linguistic prowess, but i don’t even have much idea about the formalization of BNF forms. alas let’s see an example, just like the last one, but this time it’s $\{a^nb^nc^n\,|\,n \in\N\}$. by the exact same logic, if you choose only two segments, the other one mismatch in number; and if you choose the boundary it doesn’t match at all. thus this can’t be pumped, and isn’t a context-free language.
this article was written because i wish i had someone explained to me when i was still a child. i feel like the wikipedia pages still suck to this day, and even if they’re getting better, they always involve automatons like if automatons are the hotcakes. and was this chance here given to me by @HoshinoTented, for she asked me for help in doing an exercise in Software Foundations, Vol.1 lf, IndProp.v, which asked for exactly the derivation above. actually making it feels really relieving.