Information Retrival - Natural Launguage Processing (NLP)
Please leave a remark at the bottom of each page with your useful suggestion.
Introduction
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Information retrieval
- Boolean retrieval
- The term vocabulary and postings lists
- Document delineation and character sequence decoding
- Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes
- Search structures for dictionaries
- Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries
- Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction
- Phonetic correction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
- Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms
- Dictionary compression
- Dictionary as a string
- Blocked storage
- Postings file compression
- Variable byte codes
- γ codes
- Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g
- Term frequency and weighting
- nverse document frequency
- f-idf weighting
- The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores
- Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length
- Efficient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning
- Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together
- Vector space scoring and query operator interaction
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
- Critiques and justifications of the concept of relevance
- A broader perspective: System quality and user utility
- System issues
- User utility
- Refining a deployed system
- Results snippets
- Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback
- Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
- Review of basic probability theory
- The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs
- The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback
- An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR
- Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words
- The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft’s Experiments
- Language modeling versus other approaches in IR
- Extended language modeling approaches
- The text classification problem
- Naive Bayes text classification
- Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
- A variant of the multinomial model
- Feature selection
- Mutual information
- χ2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classifiers
- Comparison of feature selection methods
- Evaluation of text classification
- Document representations and measures of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
- Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Soft margin classification
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results
- Issues in the classification of text documents
- Choosing what kind of classifier to use
- Improving classifier performance
- Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning
- Clustering in information retrieval
- Problem statement
- Cardinality – the number of clusters
- Evaluation of clustering
- K-means
- Cluster cardinality in K-means
- Model-based clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- Time complexity of HAC
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
- Linear algebra review
- Matrix decompositions
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
- Background and history
- Web characteristics
- The web graph
- Spam
- Advertising as the economic model
- The search user experience
- User query needs
- Index size and estimation
- Near-duplicates and shingling
- Crawling
- Crawler architecture
- DNS resolution
- The URL frontier
- Distributing indexes
- Connectivity servers
- The Web as a graph
- Anchor text and the web graph
- PageRank
- Markov chains
- The PageRank computation
- Topic-specific PageRank
- Hubs and Authorities
- Choosing the subset of the Web
Boolean retrieval
- The term vocabulary and postings lists
- Document delineation and character sequence decoding
- Determining the vocabulary of terms
- Tokenization
- Dropping common terms: stop words
- Normalization (equivalence classing of terms)
- Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
- Biword indexes
- Positional indexes
- Combination schemes
Dictionaries and tolerant retrieval
- Search structures for dictionaries
- Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries
- Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction
- Phonetic correction
Index construction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
Index compression
- Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms
- Dictionary compression
- Dictionary as a string
- Blocked storage
- Postings file compression
- Variable byte codes
- γ codes
Scoring, term weighting and the vector space model
- Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g
- Term frequency and weighting
- nverse document frequency
- f-idf weighting
- The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores
- Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length
Computing scores in a complete search system
- Efficient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning
- Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together
- Vector space scoring and query operator interaction
Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
- Critiques and justifications of the concept of relevance
- A broader perspective: System quality and user utility
- System issues
- User utility
- Refining a deployed system
- Results snippets
Relevance feedback and query expansion
- Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback
- Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation
XML retrieval
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
Probabilistic information retrieval
- Review of basic probability theory
- The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs
- The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback
- An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR
Language models for information retrieval
- Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words
- The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft’s Experiments
- Language modeling versus other approaches in IR
- Extended language modeling approaches
Text classification and Naive Bayes
- The text classification problem
- Naive Bayes text classification
- Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
- A variant of the multinomial model
- Feature selection
- Mutual information
- χ2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classifiers
- Comparison of feature selection methods
- Evaluation of text classification
Vector space classification
- Document representations and measures of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
- Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
Support vector machines and machine learning on documents
- Support vector machines: The linearly separable case
- Extensions to the SVM model
- Soft margin classification
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results
- Issues in the classification of text documents
- Choosing what kind of classifier to use
- Improving classifier performance
- Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning
Flat clustering
- Clustering in information retrieval
- Problem statement
- Cardinality – the number of clusters
- Evaluation of clustering
- K-means
- Cluster cardinality in K-means
- Model-based clustering
Hierarchical clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- Time complexity of HAC
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
Matrix decompositions and latent semantic indexing
- Linear algebra review
- Matrix decompositions
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
Web search basics
- Background and history
- Web characteristics
- The web graph
- Spam
- Advertising as the economic model
- The search user experience
- User query needs
- Index size and estimation
- Near-duplicates and shingling
Web crawling and indexes
- Crawling
- Crawler architecture
- DNS resolution
- The URL frontier
- Distributing indexes
- Connectivity servers
Link analysis
- The Web as a graph
- Anchor text and the web graph
- PageRank
- Markov chains
- The PageRank computation
- Topic-specific PageRank
- Hubs and Authorities
- Choosing the subset of the Web