- Home
- Interview Questions
- Data Science
tf–idf is short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
Both Regression and classification machine learning techniques come under Supervised machine learning algorithms. In Supervised machine learning algorithm, we have to train the model using labeled dataset, While training we have to explicitly provide the correct labels and algorithm tries to learn the pattern from input to output. If our labels are discreate values then it will a classification problem, e.g A,B etc. but if our labels are continuous values then it will be a regression problem, e.g 1.23, 1.333 etc.
First of all you have to ask which ML model you want to train. For Neural networks: Batch size with Numpy array will work. Steps: 1. Load the whole data in Numpy array. Numpy array has property to create mapping of complete dataset, it doesn’t load complete dataset in memory. 2. You can pass index to Numpy array to get required data. 3. Use this data to pass to Neural network. 4. Have small batch size. For SVM: Partial fit will work Steps: 1. Divide one big dataset in small size datasets. 2. Use partialfit method of SVM, it requires subset of complete dataset. 3. Repeat step 2 for other subsets.
When you perform a hypothesis test in statistics, a p-value can help you
determine the strength of your results. p-value is a number between 0 and 1.
Based on the value it will denote the strength of the results. The claim which is
on trial is called Null Hypothesis.
Low p-value (≤ 0.05) indicates strength against the null hypothesis which means
we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for
the null hypothesis which means we can accept the null Hypothesis p-value of
0.05 indicates the Hypothesis could go either way. To put it in another way,
High P values: your data are likely with a true null. Low P values: your data are
unlikely with a true null.