Was sind die möglichen Sätze von Wortlängen in einer regulären Sprache?

Definieren Sie in einer gegebenen Sprache $L$ die Längenmenge von $L$ als die Menge der Wortlängen in $L$ :

L S (L) = {| u | ∣ u \in L}

$\mathrm{LS}(L) = \{|u| \mid u \in L \}$

Welche Mengen von Ganzzahlen können die Längenmengen einer regulären Sprache sein?

— Gilles 'SO - hör auf böse zu sein'
quelle

Antworten:

Zunächst eine Beobachtung, die nicht entscheidend, aber zweckmäßig ist: Die Menge $\mathscr{S}$ von Mengen von ganzen Zahlen, die $LS(L)$ für eine reguläre Sprache $L$ auf einem nicht leeren Alphabet $\mathscr{A}$ hängt nicht von der Wahl des Alphabets ab. Betrachten Sie dazu einen endlichen Automaten, der $L$ erkennt . Die Längen der Wörter in $L$ sind die Längen der Pfade auf dem Automaten, die als unbeschrifteter Graph vom Startzustand bis zu einem beliebigen Akzeptanzzustand betrachtet werden. Insbesondere können Sie jeden Pfeil mit $a$ beschriften und eine reguläre Sprache mit derselben Länge erhalten, die über dem Alphabet $\{a\}$ . Umgekehrt, wenn $L$ ist eine reguläre Sprache über einem Ein-Element-Alphabet, es kann trivial in ein größeres Alphabet eingefügt werden, und das Ergebnis ist immer noch eine reguläre Sprache.

Deshalb suchen wir nach den möglichen Längenmengen für Wörter über einem Singleton-Alphabet. In einem Singleton-Alphabet ist die Sprache die Länge, die in Unary geschrieben ist: $\mathrm{LS}(L) = \{n\in\mathbb{N} \mid a^n \in L\}$ . Solche Sprachen werden unäre Sprachen genannt.

Sei $L$ eine reguläre Sprache und betrachte einen deterministischen endlichen Automaten (DFA), der $L$ erkennt . Die Menge der Wortlängen von $L$ ist die Menge der Pfadlängen in der DFA, die als gerichteter Graph betrachtet wird, der im Startzustand beginnt und in einem der Akzeptanzzustände endet. Ein DFA auf einem Ein-Element-Alphabet ist ziemlich zahm (NFAs wären wilder): Es ist entweder eine endliche Liste oder eine kreisförmige Liste. Wenn die Liste endlich ist, nummerieren Sie die Zustände von $0$ bis $h$ der angegebenen Reihenfolge. Wenn es zirkulär ist, nummerieren Sie die Zustände von $0$ bis $h$ nach dem Kopf der Liste und von $h$ bis $h+r$ entlang der Schleife.

list-shaped automata

Sei $F$ die Menge der Indizes der Akzeptanzzustände bis $h$ und $G$ die Menge der Indizes der Akzeptanzzustände von $h$ bis $h+r$ . Dann

L S (L) = F \cup {k r + x ∣ x \in G, k \in N}

$\mathrm{LS}(L) = F \cup \{ k \, r + x \mid x \in G, k\in\mathbb{N} \}$

Umgekehrt seien $h$ und $r$ zwei ganze Zahlen und $F$ und $G$ zwei endliche Mengen von ganzen Zahlen, so dass $\forall x \in F, x \le h$ und $\forall x \in G, h \le x \le h+r$ . Dann ist die Menge $L_{F,G,r} = \{ a^{k\,r+x} \mid x\in G, k\in\mathbb{N} \}$ ist eine reguläre Sprache: Es ist die Sprache, die von der oben beschriebenen DFA erkannt wird. Ein regulärer Ausdruck, der diese Sprache beschreibt, ist $a^F \mid a^{G} (a^r)^*$ .

In englischer Sprache sind die Längenmengen regulärer Sprachen die Mengen ganzer Zahlen, die oberhalb eines bestimmten Wertes periodisch¹ sind .

¹ _{Zum Aufhängen auf einen gut etablierten Begriff , periodische Mittel die charakteristische Funktion des Satzes (der eine Funktion ist $\mathbb{N}\to\{\mathtt{false},\mathtt{true}\}$ , die wir auf eine Funktion heben $\mathbb{Z}\to\{\mathtt{false},\mathtt{true}\}$ ) ist periodisch. Periodisch ab einem bestimmten Wert bedeutet, dass die Funktion auf $[h,+\infty[$ kann zu einer periodischen Funktion verlängert werden.}

— Gilles 'SO - hör auf böse zu sein'
quelle

Ihre Beobachtung über die Irrelevanz des Alphabets legt nahe, dass der Satz von Parikh angewendet werden kann. Insbesondere zeigen Sie, dass LS (L) = LS (L ') ist, wobei in L' alle Buchstaben zu einem einzigen Alphabet zusammengefasst sind. Aber LS (L ') ist die Parikh-Abbildung der Sprache L, die für jede reguläre Sprache als semilinear bekannt ist.

— Suresh

Netter Ansatz! 1) Ich denke, der erste Absatz kann durch die Feststellung ersetzt werden, dass reguläre Sprachen gegen String-Homomorphismen geschlossen sind. 2) Aus Gründen der Klarheit sollten Sie erwägen, den zweiten Teil von

als

, Modulo-Off-by-One-Fehler, anzugeben. 3) Was ist eine "periodische" Menge von ganzen Zahlen? LS(L) $\mathrm{LS(L)}$

{h+kr+(x−h)∣…} $\{h + kr + (x - h) \mid \dots \}$

— Raphael

@Suresh, Raphael (1): I prefer to state the proof in an elementary way, neither homomorphisms nor Parikh mappings were mentioned in my CS 102 class.

— Gilles 'SO- stop being evil'

@Raphael (2) Where you start in indexing

G $G$ doesn't matter, I could remove the condition

h≤G $h \le G$ , as

F $F$ can absorb as many small elements as we want. (3) A set that is periodic above a certain value is one that can be put in the displayed form above.

— Gilles 'SO- stop being evil'

Any finite subset $\{\ell_1,\ldots,\ell_n\}\subset\mathbb{N}$ can be the lenght-set of a regular language $L$ , since you can take a unary alphabet $\{0\}$ and define $L$ as $\{0^{\ell_1},\ldots,0^{\ell_n}\}$ (this includes the empty language and $\{\varepsilon\}$ ).

Now for the infinite sets. I'll give a short analysis, though the final answer might not be explicit enough. I won't elaborate unless you ask me to, because I think it's intuitive and because I don't have much time now.

Let $r_1,r_2$ be regular expressions generating languages $L_1$ and $L_2$ , respectively. It is (sort of) easy to see that

$\mathsf{LS}(L(r_1+r_2))=\mathsf{LS}(L_1\cup L_2)=\mathsf{LS}(L_1)\cup\mathsf{LS}(L_2)$ .
$\mathsf{LS}(L(r_1r_2))=\mathsf{LS}(L_1L_2)=\{\ell_1+\ell_2:\ell_1\in\mathsf{LS}(L_1),\ell_2\in\mathsf{LS}(L_2)\}$ . This is denoted $\mathsf{LS}(L_1)+\mathsf{LS}(L_2)$ .
$L S (L (r * 1)) = {0} \cup ⋃ n \geq 1 {\sum i = 1 n ℓ i : (ℓ 1, \dots, ℓ n) \in (L S (L 1)) n} .$ $\mathsf{LS}(L(r_1^*))=\{0\}\cup\bigcup_{n\geq 1}\Big\{\sum_{i=1}^n\ell_i:(\ell_1,\ldots,\ell_n)\in\big(\mathsf{LS}(L_1)\big)^n\Big\}.$

Thus, the possible sets of integers that can be the length-set of a regular language are the ones that are finite subsets of $\mathbb{N}$ or that can be built by taking finite subsets $S_1,S_2$ of $\mathbb{N}$ and using the previous formulas a finite number of times.

Here, we are using that regular languages are built, by definition, by applying the rules for constructing a regular expression a finite number of times. Note that we can start with any finite subset of $\mathbb{N}$ , even though in regular expressions we start with words of length 0 and 1 only as the base case. This is easily justified by the fact that all (finite) words are (finite) concatenations of the symbols of the alphabet.

— Janoma
quelle

I don't see any final answer. (Were you intending to finish your answer later?) I was hoping for a simple description of the possible sets, and a connection with automata.

— Gilles 'SO- stop being evil'

The final answer is there: "Thus, the possible sets of integers...". That is indeed a simple description, though connected with regular expressions, not automata.

— Janoma

There's a simpler description that doesn't involve taking a fixpoint. Maybe this question isn't as elementary as I thought!

— Gilles 'SO- stop being evil'

I don't think you can avoid the last rule, since it is the star operator the one which can produce infinite length-sets, just as it produces infinite languages.

— Janoma

@Gilles So you want a closed form of the smallest fixpoint of the inductive solution Janoma provides?

— Raphael

According to the pumping lemma for regular languages, there exists an $n$ such that a string $x$ of length at least equal to $n$ can be written in the following form:

x = u v w

$x = uvw$ Where the following three conditions hold:

| u v | < n

$|uv| < n$

| v | > 0

$|v| > 0$

u v k w \in L

$uv^{k}w \in L$

This gives us one test for sets: a set cannot be the length set of a regular language unless all its elements can be expressed as some arbitrary set of integers no greater than a fixed $n$ , plus some multiple of an undetermined value $m$ (the length of $v$ ), plus some arbitrary finite value.

In other words, it looks like the possible sets of language lengths for regular languages is the closure with respect to set union (as discussed under EDIT and EDIT2, thanks to commenters) of sets described as follows:

{a + b n | n \in N} \cup S

$\{a + bn | n \in \mathbb{N}\} \cup S$ For fixed

a,b∈N $a, b \in \mathbb{N}$ and all finite sets

S $S$ , by the pumping lemma for regular languages (thanks to Gilles for pointing out a silly mistake in my original version, whereby I was defining the set

N $\mathbb{N}$ ).

EDIT: A little more discussion. Certainly all finite sets of integers are length sets. Also, the union of two length sets must also be a length set, as must be the complement of any length set (hence intersection, hence difference). The reason for this is that the regular languages are closed under these operations. Therefore, the answer I give above is (possibly) incomplete; in reality, any union of such sets is also the length set of some regular language (note that I have abandoned requiring intersection, complement, difference, etc., since these are covered by the fact that regular languages are closed under these properties, as discussed in EDIT3; I think that only union is actually necessary, even if the others are right, which might not be the case).

EDIT2: Even more discussion. The answer I give is basically where you'd end up if you took Janoma's answer a little further; the $bn$ part comes from the Kleene star, the $a$ comes from concatenation, and the discussion of union, intersection, difference and complement come from the + of regular expressions (as well as other closure properties of regular languages) provable starting from automata).

EDIT3: In light of Janoma's comment, let's forget closure properties of language length sets that I discuss in the first EDIT. Since the regular languages have these closure properties, and since every regular language has a DFA, it follows that the pumping lemma for regular languages applies to all unions, intersections, complements, and differences of regular languages, and we'll leave it at that; no need to even consider any of these, except union, which I still think might be necessary to make my original (modified, thanks to input from Gilles) correct. So, my final answer is this: what I say in the original version, plus the closure of language length sets with respect to set union.

— Patrick87
quelle

{a+bn∣a,b,n∈N}∪S $\{a+bn \mid a,b,n\in\mathbb{N}\} \cup S$ is on the right track, but you got a quantifier wrong somewhere, you're generating

N $\mathbb{N}$ .

— Gilles 'SO- stop being evil'

The analysis for the complement of a length set may be a bit delicate. If

$L=L(a^*)$ over the alphabet

$\Sigma=\{a,b\}$ , then the length set of

$L$ is

$\mathbb{N}$ and the length set of

$\overline{L}$ is

$\mathbb{N}^+$ , and these are not complement of each other.

— Janoma

@Gilles But the set of all natural numbers is a valid length set, right? I'm not generating all subsets of natural numbers, right? I agree that would be problematic. Edit: oh wait, I see what you're saying. Yes, you're right. Will fix when back at computer.

— Patrick87

@Janoma Excellent point, will need to consider how that might change the set of things I'm defining...

— Patrick87