The field of information retrieval is concerned with finding relevant electronic documents based on a query. For example, given a group of keywords, a search engine retrieves Web pages (documents) and displays them in order, with the most relevant documents listed first. This technology requires a way to compare a document with the query to see which is most relevant to the query.
A simple way to make this comparison is to compute the binary cosine coefficient. The coefficient is a value between 0 and 1, where 1 indicates that the query is very similar to the document and 0 indicates that the query has no keywords in common with the document. This approach treats each document as a set of words. For example, consider the following sample document:
“Cows are big. Cows go moo. I love cows.”
This document would be parsed into keywords where case is ignored and punctuation discarded and turned into the set containing the words “{cows, are, big, go, moo, i, love}”. An identical process is performed on the query.
Once we have a query Q represented as a set of words and a document D represented as a set of words, the similarity between the query and document is computed by
For example, if D = {cows, are, big, go, moo, i, love} and Q = {love, holstein, cows} then
Write a program that allows the user to input a set of strings that represents a document and a set of strings that represents a query. (If you are more ambitious, you could write a program that parses an actual text file and computes the set of unique strings.) Represent the document and query as an STL set of strings. Then compute and print out the similarity between the query and document using the binary cosine coefficient. The sqrt function is in cmath. Use the generic set_intersection function to compute the intersection of Q and D.
Here is an example of set_intersection to intersect set A with B and store the result in C, where all sets are sets of strings:
#include <iterator>
#include <algorithm>
#include <set>
#include <string>
...
using std::insert_iterator;
set<string> A,B,C;
// Code below assumes strings have been inserted into A and B
// Note space between > > in line below
insert_iterator<set<string> > cIterator(C, C.begin( ));
set_intersection(A.begin( ), A.end( ),
B.begin( ),B.end( ),
cIterator);
// set C now contains the intersection of A and B
Sorry the answer is not available at the moment…
If you are able to find the answer, please make sure to post it here. So that your Juniors have smile on their lips and feel happy.
Spread the 'tradition of sharing'.