# derivation language

## Background

Let $G=(\Sigma,N,P,\sigma)$ be a formal grammar. A pair $(W_{1},W_{2})$ of words over $\Sigma$ is said to correspond to the production $p\to q$ if

 $W_{1}=u_{1}pv_{1}\qquad\mbox{and}\qquad W_{2}=u_{2}qv_{2}$

for some words $u_{i},v_{j}$ over $\Sigma$. We also say that $(W_{1},W_{2})$ is a derivation step, and write $W_{1}\to W_{2}$.

Recall that a derivation in $G$ is a finite sequence of words

 $W_{1},W_{2},\ldots,W_{n}$

over $\Sigma$ such that $W_{i}\Rightarrow W_{i+1}$, for $i=1,\ldots,n-1$. The derivation is also written

 $W_{1}\Rightarrow W_{2}\Rightarrow\cdots\Rightarrow W_{n}.$

The derivation above has $n-1$ steps. Zero-step derivations are also permitted. These are just words over $\Sigma$.

The reflexive transitive closure of $\Rightarrow$ is $\Rightarrow^{*}$. Thus, $V\Rightarrow^{*}W$ means that there is a derivation starting with $V$ and ending with $W$. There may be more than one derivation from $V$ to $W$.

If we consider each production as a “letter” in the alphabet $P$, then the above derivation can be represented by a “word” over $P$ in the following manner: the “word” is formed by taking concatenations of the “letters”, where concatenations correspond to successive applications of productions in $P$:

 $[p_{1}\to q_{1}][p_{2}\to q_{2}]\cdots[p_{n-1}\to q_{n-1}].$

Derivation language is thus a certain collection of derivation words, formally defined below.

## Definitions

Treating productions as “letters” is really nothing more than labeling each production with some symbol. Formally, call a labeling of an alphabet $P$ a surjection $f:F\to P$, where $F$ is some alphabet. For any $p\in P$, a label for $p$ is an element $x\in F$ such that $f(x)=p$. We will only be interested in injective labeling (hence a bijection) from now on.

Definition. Suppose we are given a labeling $f$ of the set $P$ of productions in the grammar $G$. Given a derivation $D:W_{1}\Rightarrow W_{2}\Rightarrow\ldots\Rightarrow W_{n}$, a derivation word $U$ for $D$ is defined inductively as follows:

1. 1.

if $n=1$, then the empty word $U:=\lambda$,

2. 2.

if $n=2$, then $U:=x\in F$ is a label for a production that $W_{1}\Rightarrow W_{2}$ corresponds to,

3. 3.

if $n>2$, then $U:=U_{1}U_{2}$, where

• $U_{1}$ is a derivation word for the derivation $W_{1}\Rightarrow\cdots\Rightarrow W_{i}$,

• $U_{2}$ is a derivation word for the derivation $W_{i}\Rightarrow\cdots\Rightarrow W_{n}$.

If $U$ is a derivation word for derivation $D$, let us abbreviate this by writing $f[U]=D$. Note that we are not applying the labeling $f$ to $U$, it is merely a notational convenience.

A derivation word is sometimes called a control word, for it defines or controls whether and how a word may be derived from another word. Note that any $W_{1}\Rightarrow^{*}W_{2}$ may correspond to several distinct derivations, and hence several distinct derivation words. Also, the same derivation word may correspond to distinct derivations.

For example, let $G$ be a grammar over two symbols $($ and $)$ with productions $\sigma\to\lambda$, $\sigma\to(\sigma)$, and $\sigma\to\sigma\sigma$ ($G$ generates the parenthesis language $\boldsymbol{\operatorname{Paren}_{1}}$) Label the productions as $a,b,c$ respectively. Then the derivation $\sigma\Rightarrow^{*}(()())$ correspond to derivation words $bcbbaa$ and $bcbaba$. Notice that $\sigma\sigma\Rightarrow(\sigma)\sigma$ and $\sigma\sigma\Rightarrow\sigma(\sigma)$ both correspond to the derivation word $b$.

Definition. The derivation language of a grammar $G=(\Sigma,N,P,\sigma)$ is the set of all derivation words for derivations on words generated by $G$. In other words, consider the labeling $f:F\to P$. The derivation language of $G$ is the set

 $\{U\in F^{*}\mid f[U]\mbox{ is a derivation of the form }\sigma\Rightarrow^{*}% u\mbox{ for some }u\in N^{*}\}.$

The derivation language of $G$ is also called the Szilard language of $G$, named after its inventor, and is denoted by $\operatorname{Sz}(G)$.

For example, let $G$ be the grammar over a the letter $a$, with productions given by $\sigma\to\sigma$, $\sigma\to a$. If the productions are labeled $b,c$, then $\operatorname{Sz}(G)=\{b^{n}c\mid n\geq 0\}$. Note that $L(G)=\{a\}$. Likewise, if $G^{\prime}$ is the grammar over $a$, with productions $\sigma\to A\sigma$, $A\to\lambda$, and $\sigma\to a$, labeled $p,q,r$ respectively, then $L(G^{\prime})=\{a\}$. However, $\operatorname{Sz}(G^{\prime})$ is quite different from $\operatorname{Sz}(G)$:

 $\displaystyle\operatorname{Sz}(G^{\prime})$ $\displaystyle=\{u\in F^{*}\mid$ $\displaystyle u=vrw\mbox{, where }v\in\{p,q\}^{*},w\in\{q\}^{*},$ $\displaystyle\#_{u}(p)=\#_{u}(q)\mbox{ and }\#_{x}(p)\geq\#_{x}(q)\mbox{ for % all }x\leq u\}$

where

• $F=\{p,q,r\}$,

• $\#_{u}(r)$ means the number of occurrences of $r$ in word $u$,

• $v\leq u$ means that $v$ is a prefix of $u$.

Remarks. Let $G$ be a formal grammar.

• Some properties of $\operatorname{Sz}(G)$:

1. (a)

$\operatorname{Sz}(G)$ is always context-sensitive.

2. (b)

If $G$ is regular, so is $\operatorname{Sz}(G)$.

3. (c)

if $G$ is context-free, $\operatorname{Sz}(G)$ may not be; in fact, for any context-free language $L$, there is a context-free grammar $G$ such that $L=L(G)$ but $\operatorname{Sz}(G)$ is not context-free.

4. (d)

There exists a context-free language $L$ such that $\operatorname{Sz}(G)$ is not context-free for any grammar $G$ generating $L$.

• However, if one considers the subset $\operatorname{Sz}_{L}(G)$ of $\operatorname{Sz}(G)$ consisting of all derivation words corresponding to leftmost derivations, then $\operatorname{Sz}_{L}(G)$ is context-free.

• It is an open question that, given any (context-sensitive) language $L$, whether there is a grammar $G$ such that $L=\operatorname{Sz}(G)$.

## References

• 1 A. Salomaa, Formal Languages, Academic Press, New York (1973).
• 2 G. E. Révész, Introduction to Formal Languages, Dover Publications (1991).
Title derivation language DerivationLanguage 2013-03-22 18:58:15 2013-03-22 18:58:15 CWoo (3771) CWoo (3771) 13 CWoo (3771) Definition msc 68Q42 msc 68Q45 Szilard language