NLP
1 Sites
2 Chomsky Hierchy
\[ Type\ 0 \subset \]
2.1 Type 0
- No rule restriction
- Equivalent to turing machine
2.2 Type 1
- Context sensitive rules
- Example: rules (SUPER BIG => VERY VERY BIG)
3 Finite State Automata
- Deterministic Finite State Automata (DFSA) : Rules are unambiguous
- NonDeterministic Finite State Automata (NDFSA) : More than one sequence of rule must be attempted to see if string matches grammar (Back tracking, look ahead)
Any NDFSA can be recreated using a larger DFSA
3.1 Deterministic Example
DFSA for regexp: A(ab)*ABB?
note: e
stands for empty string
3.2 Non-Deterministic Example
NDFSA for regexp: A(ab)*ABB?
- How is this different for the Deterministic example?
- If we give it input
AabABB
At Q3, it does back tracking, if it terminates too early from Q3->Q5, it backtracks back to Q3, and then tries Q3->Q4->Q5.
- If we give it input
The other Deterministic example does not backtrack; eventhough there is a split path at Q4 for the DFSA version, it just means 2 cases are acceptable, either empty string e
or string B
.
4 NLP Statistics
- Frequency of words: eat, eating, cat, big, bigger
- Frequency of base forms: eat, cat, big
- Frequency of characters: a, b, c
- Frequency of bi-grams: the big, big cat, ate the, the mouse
- TFIDF::
term -> [documents] -> number
= Term Frequency * Inverse Document Frequency- TF = Frequency of Terms in corpus(set of different documents)
- IDF = \(\frac{Number\ of\ documents}{Number\ of\ documents\ containing\ term}\)
- Example: TFIDF(“cat”) for a corpus of 100 documents
- Example 1: 100 documents and 100 instances of cat in 1 document, 99 documents contain no cat THEN TFIDF(“cat”)= 100 * 100/1 = 10000
- Example 2: 100 documents and 100 instance of cat, 1 instance in each of 100 documents
THEN TFIDF(“cat”)= 100 * 100/100 = 100
5 Penn treebank
https://catalog.ldc.upenn.edu/LDC99T42
5.1 Contents:
Bracket Labels
Clause Level
Phrase Level
Word Level
Function Tags
Form/function discrepancies
Grammatical role
Adverbials
Miscellaneous
5.2 Bracket Labels
5.2.1 Clause Level
S - simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion.
SBAR - Clause introduced by a (possibly empty) subordinating conjunction.
SBARQ - Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ.
SINV - Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ - Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
5.2.2 Phrase Level
ADJP - Adjective Phrase.
ADVP - Adverb Phrase.
CONJP - Conjunction Phrase.
FRAG - Fragment.
INTJ - Interjection. Corresponds approximately to the part-of-speech tag UH.
LST - List marker. Includes surrounding punctuation.
NAC - Not a Constituent; used to show the scope of certain prenominal modifiers within an NP.
NP - Noun Phrase.
NX - Used within certain complex NPs to mark the head of the NP. Corresponds very roughly to N-bar level but used quite differently.
PP - Prepositional Phrase.
PRN - Parenthetical.
PRT - Particle. Category for words that should be tagged RP.
QP - Quantifier Phrase (i.e. complex measure/amount phrase); used within NP.
RRC - Reduced Relative Clause.
UCP - Unlike Coordinated Phrase.
VP - Vereb Phrase.
WHADJP - Wh-adjective Phrase. Adjectival phrase containing a wh-adverb, as in how hot.
WHAVP - Wh-adverb Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing a wh-adverb such as how or why.
WHNP - Wh-noun Phrase. Introduces a clause with an NP gap. May be null (containing the 0 complementizer) or lexical, containing some wh-word, e.g. who, which book, whose daughter, none of which, or how many leopards.
WHPP - Wh-prepositional Phrase. Prepositional phrase containing a wh-noun phrase (such as of which or by whose authority) that either introduces a PP gap or is contained by a WHNP.
X - Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the…the-constructions.
5.2.3 Word level
CC - Coordinating conjunction
CD - Cardinal number
DT - Determiner
EX - Existential there
FW - Foreign word
IN - Preposition or subordinating conjunction
JJ - Adjective
JJR - Adjective, comparative
JJS - Adjective, superlative
LS - List item marker
MD - Modal
NN - Noun, singular or mass
NNS - Noun, plural
NNP - Proper noun, singular
NNPS - Proper noun, plural
PDT - Predeterminer
POS - Possessive ending
PRP - Personal pronoun
PRP$ - Possessive pronoun (prolog version PRP-S)
RB - Adverb
RBR - Adverb, comparative
RBS - Adverb, superlative
RP - Particle
SYM - Symbol
TO - to
UH - Interjection
VB - Verb, base form
VBD - Verb, past tense
VBG - Verb, gerund or present participle
VBN - Verb, past participle
VBP - Verb, non-3rd person singular present
VBZ - Verb, 3rd person singular present
WDT - Wh-determiner
WP - Wh-pronoun
WP$ - Possessive wh-pronoun (prolog version WP-S)
WRB - Wh-adverb