Problem with DFSA State machines and Regex
1 DFSA vs NDFSA
- They are equal in power, any DFSA can be translated to NDFSA, vice versa
- Deterministic state machines cannot backtrack
- Non-deterministic state machines can backtrack
2 state machines are bad at the fine-grained level
In the above example, we dont keep track of what the total sum actually is.
The sum is the internal state(global context) and it is impossible to track.
- One huge issue with state machines is they do not visualize the INTERNAL state(global context) mutations/side-effect
- real world applications deal with side-effects typically in the form of modifying some “global context” AND make state-transitions based on the “global context”
- Since state machines do not visualize explicitly the global context, the state machine, more specifically state transitions must be carefully crafted with the “global context” in mind. Typically this causes a state explosion.
- real world applications deal with side-effects typically in the form of modifying some “global context” AND make state-transitions based on the “global context”
3 Regex
regex are just state machines
- Regex cannot match with recursive structures
- Parsing HTML cannot be done with standard regex
- Parsing source-code with indefinite number of nested brackets
{
}
cannot be done with standard regex
Regular languages – the languages that regular expressions can recognise – are a proper subset of context-free languages, and the written form of s-expressions is context-free but not regular. So no regular expression can recognise the written form of an s-expression.
To see this consider a very tiny subset of s-expressions:
n = () | ( n)
So n consists of the set {(), (()), ((())), …}, where the number of left parens and right parens in each string are equal. Such a language can’t be recognised by a regular expression because you need to count parens.
not really but we could make an infinitely long state machine
4 Chomsky Grammar and Regex
Chomsky hierarchy:
/-------------------------------------------\
| |
| Recursively enumerable languages | Type 0
| |
| /-----------------------------------\ |
| | | |
| | Context-sensitive languages | | Type 1
| | | |
| | /---------------------------\ | |
| | | | | |
| | | Context-free languages | | | Type 2
| | | | | |
| | | /-------------------\ | | |
| | | | Regular languages | | | | Type 3
| | | \-------------------/ | | |
| | \---------------------------/ | |
| \-----------------------------------/ | \-------------------------------------------/
Example of what a grammar looks like:
A -> x B C
where x is a terminal variable
Sentence -> Noun Verb
- Level 0 - Unrestricted : Turing Machines
- right side can be Nil
- ex.
A -> Nil
is allowed
- Level 1 - Context Sensitive Lang : Linear bounded Automata
- right side can be Nil (same restriction as level 0)
- Length of variables of right side must be greater than or equal Length of variables of left side
- ex.
A B C -> D
is NOT allowed since left sideA B C
is len 3 and right sideD
is len 1 - ex.
A B -> C D
is allowed - ex.
A B -> C D E
is allowed
- Level 2 - Context Free lang : Push down Automata
- Same restrictions as level 1
- left side will only contain 1 variable
- with the level 1 restriction, this implies right side must contain at least 1 variable and greater than or equal to number variable of left side
- ex.
A B -> C D E
is NOT allowed - ex.
A -> Nil
is allowed - ex.
A -> B C
is allowed - ex.
A -> B
is allowed
- Level 3 - Regular Grammar : Regex, NonDeterministic Finite Automata,Deterministic Finite Automata
- Same restriction as type 2
- Can only have rules of these two forms:
A -> b C
whereb
is a terminal variableA -> b
whereb
is a terminal variable
- ex.
A -> a b c
is not allowed (lower case are terminal variables) - ex.
A -> Nil
is allowed
5 How is regex type 3?
Level 3 grammar, express the natural numbers
N -> 0
N -> 1
N -> 2
N -> 3
N -> 4
N -> 5
N -> 6
N -> 7
N -> 8
N -> 9
N -> 0N
N -> 1N
N -> 2N
N -> 3N
N -> 4N
N -> 5N
N -> 6N
N -> 7N
N -> 8N
N -> 9N
Regex is a simplification and makes the above [0-9]+