Information Retrival - Natural Launguage Processing - Data Science

Information Retrival - Natural Launguage Processing (NLP)

Please leave a remark at the bottom of each page with your useful suggestion.

Introduction

Information retrieval (IR) is ﬁnding material (usually documents) of an unstructured nature (usually text) that satisﬁes an information need from within large collections (usually stored on computers).

Information retrieval

Boolean retrieval
The term vocabulary and postings lists
- Document delineation and character sequence decoding
- Determining the vocabulary of terms
  - Tokenization
  - Dropping common terms: stop words
  - Normalization (equivalence classing of terms)
  - Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
  - Biword indexes
  - Positional indexes
  - Combination schemes
Dictionaries and tolerant retrieval
- Search structures for dictionaries
- Wildcard queries
  - General wildcard queries
  - k-gram indexes for wildcard queries
- Spelling correction
  - Implementing spelling correction
  - Forms of spelling correction
  - Edit distance
  - k-gram indexes for spelling correction
  - Context sensitive spelling correction
- Phonetic correction
Index construction
- Hardware basics
- Blocked sort-based indexing
- Single-pass in-memory indexing
- Distributed indexing
- Dynamic indexing
- Other types of indexes
Index compression
- Statistical properties of terms in information retrieval
  - Heaps’ law: Estimating the number of terms
  - Zipf’s law: Modeling the distribution of terms
- Dictionary compression
  - Dictionary as a string
  - Blocked storage
- Postings ﬁle compression
  - Variable byte codes
  - γ codes
Scoring, term weighting and the vector space model
- Parametric and zone indexes
  - Weighted zone scoring
  - Learning weights
  - The optimal weight g
- Term frequency and weighting
  - nverse document frequency
  - f-idf weighting
- The vector space model for scoring
  - Dot products
  - Queries as vectors
  - Computing vector scores
- Variant tf-idf functions
  - Sublinear tf scaling
  - Maximum tf normalization
  - Document and query weighting schemes
  - Pivoted normalized document length
Computing scores in a complete search system
- Efﬁcient scoring and ranking
  - Inexact top K document retrieval
  - Index elimination
  - Champion lists
  - Static quality scores and ordering
  - Impact ordering
  - Cluster pruning
- Components of an information retrieval system
  - Tiered indexes
  - Query-term proximity
  - Designing parsing and scoring functions
  - Putting it all together
- Vector space scoring and query operator interaction
Evaluation in information retrieval
- Information retrieval system evaluation
- Standard test collections
- Evaluation of unranked retrieval sets
- Evaluation of ranked retrieval results
- Assessing relevance
  - Critiques and justiﬁcations of the concept of relevance
- A broader perspective: System quality and user utility
  - System issues
  - User utility
  - Reﬁning a deployed system
- Results snippets
Relevance feedback and query expansion
- Relevance feedback and pseudo relevance feedback
  - The Rocchio algorithm for relevance feedback
  - Probabilistic relevance feedback
  - When does relevance feedback work?
  - Relevance feedback on the web
  - Evaluation of relevance feedback strategies
  - Pseudo relevance feedback
  - Indirect relevance feedback
- Global methods for query reformulation
  - Vocabulary tools for query reformulation
  - Query expansion
  - Automatic thesaurus generation
XML retrieval
- Basic XML concepts
- Challenges in XML retrieval
- A vector space model for XML retrieval
- Evaluation of XML retrieval
- Text-centric vs. data-centric XML retrieval
Probabilistic information retrieval
- Review of basic probability theory
- The Probability Ranking Principle
  - The 1/0 loss case
  - The PRP with retrieval costs
- The Binary Independence Model
  - Deriving a ranking function for query terms
  - Probability estimates in theory
  - Probability estimates in practice
  - Probabilistic approaches to relevance feedback
- An appraisal and some extensions
  - An appraisal of probabilistic models
  - Tree-structured dependencies between terms
  - Okapi BM25: a non-binary model
  - Bayesian network approaches to IR
Language models for information retrieval
- Language models
  - Finite automata and language models
  - Types of language models
  - Multinomial distributions over words
- The query likelihood model
  - Using query likelihood language models in IR
  - Estimating the query generation probability
  - Ponte and Croft’s Experiments
- Language modeling versus other approaches in IR
- Extended language modeling approaches
Text classiﬁcation and Naive Bayes
- The text classiﬁcation problem
- Naive Bayes text classiﬁcation
  - Relation to multinomial unigram language model
- The Bernoulli model
- Properties of Naive Bayes
  - A variant of the multinomial model
- Feature selection
  - Mutual information
  - χ2 Feature selection
  - Frequency-based feature selection
  - Feature selection for multiple classiﬁers
  - Comparison of feature selection methods
- Evaluation of text classiﬁcation
Vector space classiﬁcation
- Document representations and measures of relatedness in vector spaces
- Rocchio classiﬁcation
- k nearest neighbor
  - Time complexity and optimality of kNN
- Linear versus nonlinear classiﬁers
- Classiﬁcation with more than two classes
- The bias-variance tradeoff
Support vector machines and machine learning on documents
- Support vector machines: The linearly separable case
- Extensions to the SVM model
  - Soft margin classiﬁcation
  - Multiclass SVMs
  - Nonlinear SVMs
  - Experimental results
- Issues in the classiﬁcation of text documents
  - Choosing what kind of classiﬁer to use
  - Improving classiﬁer performance
- Machine learning methods in ad hoc information retrieval
  - A simple example of machine-learned scoring
  - Result ranking by machine learning
Flat clustering
- Clustering in information retrieval
- Problem statement
  - Cardinality – the number of clusters
- Evaluation of clustering
- K-means
  - Cluster cardinality in K-means
- Model-based clustering
Hierarchical clustering
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
  - Time complexity of HAC
- Group-average agglomerative clustering
- Centroid clustering
- Optimality of HAC
- Divisive clustering
- Cluster labeling
- Implementation notes
Matrix decompositions and latent semantic indexing
- Linear algebra review
  - Matrix decompositions
- Term-document matrices and singular value decompositions
- Low-rank approximations
- Latent semantic indexing
Web search basics
- Background and history
- Web characteristics
  - The web graph
  - Spam
- Advertising as the economic model
- The search user experience
  - User query needs
- Index size and estimation
- Near-duplicates and shingling
Web crawling and indexes
- Crawling
  - Crawler architecture
  - DNS resolution
  - The URL frontier
- Distributing indexes
- Connectivity servers
Link analysis
- The Web as a graph
  - Anchor text and the web graph
- PageRank
  - Markov chains
  - The PageRank computation
  - Topic-speciﬁc PageRank
- Hubs and Authorities
  - Choosing the subset of the Web

Boolean retrieval

The term vocabulary and postings lists
- Document delineation and character sequence decoding
- Determining the vocabulary of terms
  - Tokenization
  - Dropping common terms: stop words
  - Normalization (equivalence classing of terms)
  - Stemming and lemmatization
- Faster postings list intersection via skip pointers
- Positional postings and phrase queries
  - Biword indexes
  - Positional indexes
  - Combination schemes

Dictionaries and tolerant retrieval

Search structures for dictionaries
Wildcard queries
- General wildcard queries
- k-gram indexes for wildcard queries
Spelling correction
- Implementing spelling correction
- Forms of spelling correction
- Edit distance
- k-gram indexes for spelling correction
- Context sensitive spelling correction
Phonetic correction

Index construction

Hardware basics
Blocked sort-based indexing
Single-pass in-memory indexing
Distributed indexing
Dynamic indexing
Other types of indexes

Index compression

Statistical properties of terms in information retrieval
- Heaps’ law: Estimating the number of terms
- Zipf’s law: Modeling the distribution of terms
Dictionary compression
- Dictionary as a string
- Blocked storage
Postings ﬁle compression
- Variable byte codes
- γ codes

Scoring, term weighting and the vector space model

Parametric and zone indexes
- Weighted zone scoring
- Learning weights
- The optimal weight g
Term frequency and weighting
- nverse document frequency
- f-idf weighting
The vector space model for scoring
- Dot products
- Queries as vectors
- Computing vector scores
Variant tf-idf functions
- Sublinear tf scaling
- Maximum tf normalization
- Document and query weighting schemes
- Pivoted normalized document length

Computing scores in a complete search system

Efﬁcient scoring and ranking
- Inexact top K document retrieval
- Index elimination
- Champion lists
- Static quality scores and ordering
- Impact ordering
- Cluster pruning
Components of an information retrieval system
- Tiered indexes
- Query-term proximity
- Designing parsing and scoring functions
- Putting it all together
Vector space scoring and query operator interaction

Evaluation in information retrieval

Information retrieval system evaluation
Standard test collections
Evaluation of unranked retrieval sets
Evaluation of ranked retrieval results
Assessing relevance
- Critiques and justiﬁcations of the concept of relevance
A broader perspective: System quality and user utility
- System issues
- User utility
- Reﬁning a deployed system
Results snippets

Relevance feedback and query expansion

Relevance feedback and pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- Probabilistic relevance feedback
- When does relevance feedback work?
- Relevance feedback on the web
- Evaluation of relevance feedback strategies
- Pseudo relevance feedback
- Indirect relevance feedback
Global methods for query reformulation
- Vocabulary tools for query reformulation
- Query expansion
- Automatic thesaurus generation

XML retrieval

Basic XML concepts
Challenges in XML retrieval
A vector space model for XML retrieval
Evaluation of XML retrieval
Text-centric vs. data-centric XML retrieval

Probabilistic information retrieval

Review of basic probability theory
The Probability Ranking Principle
- The 1/0 loss case
- The PRP with retrieval costs
The Binary Independence Model
- Deriving a ranking function for query terms
- Probability estimates in theory
- Probability estimates in practice
- Probabilistic approaches to relevance feedback
An appraisal and some extensions
- An appraisal of probabilistic models
- Tree-structured dependencies between terms
- Okapi BM25: a non-binary model
- Bayesian network approaches to IR

Language models for information retrieval

Language models
- Finite automata and language models
- Types of language models
- Multinomial distributions over words
The query likelihood model
- Using query likelihood language models in IR
- Estimating the query generation probability
- Ponte and Croft’s Experiments
Language modeling versus other approaches in IR
Extended language modeling approaches

Text classiﬁcation and Naive Bayes

The text classiﬁcation problem
Naive Bayes text classiﬁcation
- Relation to multinomial unigram language model
The Bernoulli model
Properties of Naive Bayes
- A variant of the multinomial model
Feature selection
- Mutual information
- χ2 Feature selection
- Frequency-based feature selection
- Feature selection for multiple classiﬁers
- Comparison of feature selection methods
Evaluation of text classiﬁcation

Vector space classiﬁcation

Document representations and measures of relatedness in vector spaces
Rocchio classiﬁcation
k nearest neighbor
- Time complexity and optimality of kNN
Linear versus nonlinear classiﬁers
Classiﬁcation with more than two classes
The bias-variance tradeoff

Support vector machines and machine learning on documents

Support vector machines: The linearly separable case
Extensions to the SVM model
- Soft margin classiﬁcation
- Multiclass SVMs
- Nonlinear SVMs
- Experimental results
Issues in the classiﬁcation of text documents
- Choosing what kind of classiﬁer to use
- Improving classiﬁer performance
Machine learning methods in ad hoc information retrieval
- A simple example of machine-learned scoring
- Result ranking by machine learning

Flat clustering

Clustering in information retrieval
Problem statement
- Cardinality – the number of clusters
Evaluation of clustering
K-means
- Cluster cardinality in K-means
Model-based clustering

Hierarchical clustering

Hierarchical agglomerative clustering
Single-link and complete-link clustering
- Time complexity of HAC
Group-average agglomerative clustering
Centroid clustering
Optimality of HAC
Divisive clustering
Cluster labeling
Implementation notes

Matrix decompositions and latent semantic indexing

Linear algebra review
- Matrix decompositions
Term-document matrices and singular value decompositions
Low-rank approximations
Latent semantic indexing

Web search basics

Background and history
Web characteristics
- The web graph
- Spam
Advertising as the economic model
The search user experience
- User query needs
Index size and estimation
Near-duplicates and shingling

Web crawling and indexes

Crawling
- Crawler architecture
- DNS resolution
- The URL frontier
Distributing indexes
Connectivity servers

Link analysis

The Web as a graph
- Anchor text and the web graph
PageRank
- Markov chains
- The PageRank computation
- Topic-speciﬁc PageRank
Hubs and Authorities
- Choosing the subset of the Web

Write Your Comments or Suggestion...