TF*IDF or Term frequency*Inverse document frequency is a ranking factor that Google uses to analyze the content of websites. Translated into French as “Term frequency and inverse document frequency”, this metric allows to identify important keywords or phrases in a website content.
It indicates the frequency with which the word appears in the document and the added value that the key term brings to the content of the site. In SEO, it allows you to go beyond simple keywords and produce relevant content that will reach your audience.
The battle for higher website rankings in Google SERPs has been going on for years and SEO experts are always looking for strategies that can help them dominate the competition
One of the oldest SEO practices that helps SEOs rank their sites quickly is keyword research and study.
This practice represents the basis of the production of relevant content. Whether it is the homepage of a site, subpages, product pages or categories …, relevant content written based on keyword research is the key to rank in search results and stand out from the competition.
However, one of the most famous methods in keyword research is the calculation of the TF*IDF measure. In this article, you will basically find out what term frequency and inverse document frequency means and its benefits for search engine optimization.
Chapter 1: What do term frequency and inverse document frequency mean?
TF*IDF stands for Term Frequency*Inverse Document Frequency. It is one of the measures representing the basis of web page ranking in Google SERPs. In digital marketing, SEO experts use this strategy to determine the topics they need to address to rank their site in search results.
To better understand the concept of Term Frequency and Reverse Document Frequency, I will address the acronym separately by explaining the pieces individually.
1.1) What is term frequency?
Term frequency refers to the number of occurrences of a word, phrase, or sentence in a piece of content. In the context of search engine optimization, this strategy consists of evaluating the number of times a keyword is repeated in an article on a web page.
Digital marketers use this strategy to track keyword density in optimized content, or the number of times a keyword appears in content
For example, if you write an article about website traffic and the main keyword “Qualified Traffic” is repeated four times in the article, then the term frequency for that article is four.
In reality, term frequency alone is not empowered to allow you to properly SEO your site or get a sense of your chances of success. However, SEOs use term frequency and divide it by the total number of words in the article to understand the keyword density of the content.
This “keyword density” performance indicator is often found in SEO tools like Yoast. Let’s assume that the main keyword “Qualified Traffic” appears four times in a 300 words article. By doing the operation 4/300 multiplied by 100, we get a value of 1.33.
So, the density of the keyword “Qualified Traffic” in a 300 words article is 1.33%. However, when this same keyword appears four times in an article of 3000 words, its density becomes 0.13%. We can therefore deduce that the density of the keyword is more considerable in one content than others even if the term frequency remains the same.
1.what is the inverse document frequency?
Inverse document frequency is a formula that reduces the value of the most frequent keywords and increases the value of unique or less frequent terms and phrases in a content. Basically, IDF gives you a clear idea on which terms in your article have more value and weight.
If we continue with the example of the keyword “Qualified Traffic”, a web writer can obviously insert other words like: visit, ideal customer, conversion, etc. that form a lexical field around the main keyword. According to the inverse document frequency theory, these words have more weight, or value in the content than the main keyword “Qualified traffic”.
1.3. origin of term frequency and inverse document frequency (TF*IDF)
One of the basic rules on which the first search engines rank sites in the search results is the frequency of keywords in a web page content. This is especially true of the old search engines like Altavista, Web Crawler, Infoseek, which placed a very high importance on the recurrence of keywords on web pages.
Source : advancedwebranking
With this condition, the more a key term appears several times in a content, the more it is considered relevant by the algorithms to allow the web page to be positioned in the SERPs. The formula of Term Frequency is therefore adopted to allow the algorithms of the old search engines to evaluate the frequency of appearance of a keyword on a web page or a set of pages.
But, over time, the term frequency, somewhat similar to the density of keywords became insufficient to assess the relevance of a web page. Later, in 1972, a new concept was introduced to fill the void or insufficiency of the Term frequency.
This is the famous concept of Inverse Document Frequency (IDF) invented by the English researcher Karen Spärck Jones. Literally translated into French as the inverse document frequency, this measure made it possible to evaluate the total number of documents including a given term or keyword in the entire corpus studied.
1.3.1 Invention of the first TF*IDF formula
The first formula for calculating the Term frequency*Inverse Document Frequency (TF*IDF) was invented in 1975 by the famous scientific researcher Gérard Salton. Indeed, Gerard Salton went beyond the limits of search engines by finding a formula that relates the TF and the IDF (TF*IDF).
This formula allows on the one hand to give a “Weight” or a value to a term found in a document. On the other hand, the value found for the term allows to judge if the document is relevant to be classified in the search results of a keyword query.
1.3.2. The Okapi BM25 formula
The first formula relating TF*IDF worked well and allowed search engine algorithms to present more or less accurate results to different queries. However, several other varieties were derived from this first formula and were tested for the analysis of the relevance of search results.
Among these different variants is the very recent “Okapi BM25”, based on the logic of Salton Cosine. This variant of the TF*IDF is considered the most precise and satisfactory to evaluate the relevance of a document.
1.4. How to calculate the Term Frequency (TF) and the Inverse Document Frequency (IDF)?
At first sight, the formula may seem complicated or difficult to solve. Then discover here an explanation of the TF*IDF calculation formula and its application.
1.4.1. Calculation of the Term Frequency (TF)
The main purpose of the term frequency calculation is to determine the recurrence of a keyword in relation to the remainder of words in a content. The formula therefore involves a logarithm that assigns the word its exact value.
Calculating the frequency of a key term (x) in a content (y) is equivalent to determining the frequency of appearance of the word and dividing this value by the total number of words in the document. The logarithm “Log 2” is applied to both values of the fraction to give a result that expresses the relevance of the key term.
Source : ionos
Whether it is a question of determining the density of the word or its frequency in a content, it should be noted that the logarithm is always applied to both values of the fraction.
1.4.2. How to calculate the Inverse Document Frequency (IDF)?
The Term frequency (TF) is generally limited to the frequency of a keyword on a web page. Conversely, the IDF or Inverse Document Frequency goes beyond this limit by determining the relevance of all the contents of a site by taking into account the frequency of words.
Source : ionos
To determine the IDF of a key term (x) on a site, we divide the total number of pages on the site by the number of pages containing the key term (x). Finally, to get a more precise value, you have to add the value (1) to the result obtained from the calculation and deduct the logarithm.
The formula for calculating the IDF is therefore as follows:
IDF = Log e (Total number of pages/ Number of pages containing the key term).
Let’s consider the example of the keyword “Qualified Traffic” to apply the inverse document frequency formula. Out of a total of 1,000,000 pages, we assume that 405,000 contain the keyword “Qualified Traffic”. Then the inverse document frequency gives the following:
IDF (Qualified Traffic) = Log e (1,000,000/409,000) = 0.38
Chapter 2 : Application of the Term Frequency*Inverse Document Frequency in SEO
Keywords are one of the pillars that SEO specialists take into account when developing an SEO strategy. They are very important for the positioning of a website in the SERPs of search engines, Google in particular.
However, the choice of keywords for a website is based on a thorough analysis that consists of evaluating the frequency of search for the keyword by Internet users. In the past, SEO specialists used what is known today as keyword stuffing to rank their sites in the Google SERPs.
At that time, Google did not yet have the algorithms to analyze and judge this practice. Those days are gone and a lot has changed in the field of search engine optimization.
The competition between websites has become increasingly tough and Google now has many requirements strictly controlled by its various algorithms. Keyword stuffing in content has become a Black Hat strategy penalized by Google.
Source : inspiremelabs
However, keywords still have their value except that the quality of the keywords you use today is more important than the quantity you put in a content. But, how do you find the qualified, high-powered keywords that will allow you to produce relevant articles? This is where term frequency and inverse document frequency (TF*IDF) come in.
2.1. What does TF*IDF mean for SEO?
In SEO, the application of Term frequency*Inverse document frequency consists in collecting search results based on a given keyword and evaluating the quality and relevance of this keyword. In simple terms, you necessarily need a tool or a measure that can help you discover the semantic value of the keywords you put in your SEO strategy.
This is precisely the main function of the TF*IDF in SEO. This measure will help you discover the contents that Google values on websites as well as the key terms that give semantic value to these contents.
Through Google’s numerous algorithm updates, the search engine is able to understand the needs of Internet users and especially if they are satisfied or not after visiting a website. The good news is that the TF*IDF theory is likely to give you an insight into the metrics used by Google to evaluate the relevance of sites.
The TF*IDF gives you the opportunity to discover what your competitors are doing as well as ideas for high quality content you can produce to satisfy your visitors. Let’s take the example of an SEO expert who has a site on which he addresses health and wellness topics.
For the positioning of his site in Google search results, he wants to position himself on the keyword “coconut oil”. A traditional search for relevant keywords associated with the main keyword yields results like:
- Coconut oil usage;
- Importance of coconut oil for hair;
- Benefits of coconut oil, etc.
Certainly, this research provides new ideas for writing content for the said site. But, it is not enough to write relevant content. It is important to also know the topics commonly covered by competing sites, especially the most authoritative sites in the health and wellness field.
Thus, the web SEO can use a tool like STAT to retrieve a list of pages from the competitor sites that are well referenced for the keyword “coconut oil”. Then, you can use another site analysis tool, I recommend Ryte to analyze the different pages of competitors for the keyword “coconut oil”.
Moreover, the calculation of the value “Term frequency*Inverse document frequency” TF*IDF will allow you to evaluate the quality of your competitors’ pages and to compare with your site. The results of its analysis will be used to choose quality keywords, which have a higher search frequency and lower competition.
Traditional keyword research can certainly uncover what users are commonly searching for. However, the limitation of this research is that it doesn’t provide information about what your competitors are developing in their content.
This means that you can produce quality content with the keywords you get with standard search, but your pages will remain under-referenced due to the supremacy of the competition. In-depth keyword research with TF*IDF analysis on the other hand will reveal keywords associated with your main keyword and their weight or semantic value.
This analysis also gives you a clearer picture of the power of the competition, so you know what to expect and obviously take action. What seems more brilliant about the Term Frequency*Inverse Document Frequency analysis is that it doesn’t reveal keywords that are morphologically similar to the main keyword, but rather related keywords with semantic values.
Basically, Term Frequency and Inverse Document Frequency provide insights into the topics that Google prioritizes. Therefore, this analysis represents an advantage for web SEOs to discover content ideas that work to rank a site in Google search results.
2.2. In what context should TF*IDF analysis be used in SEO?
The TF*IDF is a measure that adds to the multiple working tools of SEO experts and web writers. They can use it to detect the gaps in the content they currently have on their various web pages through the ranking of the top 10 search results on a search page.
Term frequency and reverse document frequency can also be very useful when creating new content for websites. Taking this measure into account when writing new articles can allow your site to be quickly positioned in the SERPs. Find in this section the contents on which you can apply the TF*IDF measure first.
2.2.1. Contents to rank on the 2nd page of search
If you have contents on your site that are positioned on the second page of SERPs on Google for some time, it would be appropriate to apply the TF*IDF analysis. Even if these contents have been well with the SEO practices, they can still benefit from a touch with the consideration of TF*IDF.
Source : audreytip
Indeed, the calculation of the term frequency and the inverse document frequency allows you to analyze the contents of your competitors who are ranked on the first 10 results in the SERP. When you compare the results of this analysis with the contents of your site, you can discover what was wrong
2.2.2. Contents that lose position and traffic during the year
A site that goes from the first position to the last position in Google search results has surely been a victim of the tough competition or of Google’s algorithm that would have modified the search page according to the most relevant contents. Whatever the cause, it is important to check.
To do this, you can take a screenshot of the search page from when your site was in the top position and a screenshot of its current position. You can use a tool like SpyFu and compare the two SERPs.
Either way, a TF*IDF analysis will give you an idea of what content Google values and also what ideas your competitors are developing in their content. A revision of the content of your page taking into account the results of your analysis can correct this positioning problem.
2.3) How is the TF*IDF analysis done?
At the first sight of the calculation formula of the Term Frequency*Inverse Document frequency, the analysis may seem very complicated. But, in practice the process of collecting data for TF*IDF analysis is not as tedious as it may seem.
Indeed, the first step is to select the first ten (10) results that appear on the search page for your main key term. You then need to use a tool like Screaming Frog to obtain keywords associated with your main keyword.
The keywords obtained from this analysis will give you an idea of what people are searching for and you can confirm if you need to add large content sessions to your page or if the content present better covers the topic. TF*IDF analysis can also be done with a tool like Ryte or Linkassistant.
Source : static.semrush
Ryte for example can help you compare the links of the first 10 results that are displayed in the SERP for your main keyword. The tool also provides a text editor that gives recommendations for optimizing new content.
Basically, the tool will give you a list of keywords that reflect what works for your competitors and what Google values. The tricky part is how you are going to use this list of keywords to produce useful content for your visitors.
2.3.1. Modifying the list of keywords
Keep in mind that the goal is not to repeat the same thing as your competitors or to mention them, but to use the data to come up with more powerful ideas than what they are doing. This is why it is important to start by refining the keyword list using your common sense.
2.3.1. Detecting missing topics
In reality, the list of key term obtained from the TF*IDF analysis should not be used to stuff multiple keys into a content. Although the TF*IDF measure allows you to have a lot of relevant keywords for your content, it is not a reason to go back to the old habit of stuffing keywords in articles.
Instead, TF*IDF analysis should allow you to detect the missing ideas that should be in your content to make it complete. These ideas can be as small as a dimensioning to be added to a product sheet or as big as a 200 words paragraph to be added to a Blog article to make it more complete, more relevant. TF*IDF analysis helps you find the best way to optimize your content.
2.3.3. Change the format of your pages if necessary
When analyzing competitors’ sites, you should also take into account the format they use and especially what works best. Of course, it is difficult to change the structure and layout of a website.
These actions require a lot of resources and availability. Nevertheless, if you deem it necessary from your analysis to modify the overall content of the site, its update as well as its design in order to guarantee a better experience to the users and to optimize your referencing well, it is therefore necessary to put all the resources aside to do it.
Here are some conditions that may require you to update your site design:
- Unable to add new content sessions due to site structure;
- Page does not reflect the best search intent;
- Current web page sessions do not support large content;
- The web page lacks interactive components to be effective, etc.
Chapter 3: Advantages and limitations of Term frequency*Inverse Document Frequency in SEO
Why SEO experts should not neglect the TF*IDF measure in the development of their SEO strategy? What does this method bring more to the optimization of websites in the SERPs? Discover in this section the advantages and disadvantages of TF*IDF analysis.
3.1. The advantages of the term Frequency*Inverse Document Frequency
It is obvious that the TF*IDF method (Term Frequency*Inverse Document Frequency) brings many shortcuts in SEO to reduce the pain of web SEO. Indeed, the purpose of a TF*IDF analysis is to obtain a balance or weighting values. These values essentially contribute to :
- Enrich the relevance of a content ;
- Produce well optimized web content;
- Optimize the positioning of a website for relevant keyword searches.
3.1.1. Increasing the relevance of a site
The frequency of keywords in a content is very important for the referencing of a web page. It represents one of the main criteria on which Google bases itself to rank the sites in the SERPs.
Indeed, when a user makes a request on Google, the search engine algorithms study the semantic concordance between the user’s request and the contents of indexed sites. Your site is therefore likely to appear in the search results if your article addresses the subject of the query with more relevance.
The term frequency*Inverse document frequency analysis is important just to improve the quality of the site’s content. Since Google ranks sites based on the semantic relationship between the user’s query and the site’s content, it is important for the SEO to enrich the information he offers to visitors.
The TF*IDF calculation allows to obtain weighting values to perform a semantic analysis in order to find the best ideas to make the contents of the web pages relevant.
3.1.2. Production of original and optimized web content
The originality and quality of web content are also essential points that facilitate the positioning of a website on Google. It must be said that this is what makes you stand out from the rest and places you above your competitors.
The TF*IDF analysis is one of the most used techniques in SEO to find original content ideas. This technique allows SEOs to conduct an in-depth study of competing sites and make competitive comparisons.
The results of this analysis will be used to develop a content marketing strategy based on relevant keywords. The advantage of using this technique is that you won’t have to manually calculate the TF*IDF measure.
There are several SEO tools today that automate this function. These are tools such as:
3.1.3. Website optimization for relevant keyword searches
The measurement of term frequency and reverse document frequency has become a very important indicator for web content optimization. Indeed, the role of analysis is not limited to determining the balance or frequency of appearance of a keyword or phrase in an article.
It also represents a tool for generating relevant keywords and new content ideas. The TF*IDF actually allows you to discover which keywords associated with your main term are working and especially which content ideas Google prioritizes.
This data and information will allow you to easily optimize your web content and position yourself on relevant keywords. One advantage of term frequency*inverse document frequency that should not be overlooked is that it allows you to detect if you are practicing keyword stuffing in the creation of your content or if the keywords you are using are under-optimized.
3.2. The limits of the TF*IDF analysis
While the TF*IDF calculation method represents a better strategy that can help build a keyword-driven content marketing strategy, it is far from a perfect method without drawbacks. Although it is used in the majority of cases, the TF*IDF metric has some limitations that should definitely not be overlooked.
The TF*IDF metric allows for a general study of keywords, but this study does not take into account terms that are synonyms of the main keyword. Moreover, this technique is insufficient to better reference a site on Google, especially with the permanent updates of the algorithms.
Moreover, the TF*IDF analysis does not allow to differentiate the different components of an article on a web page. If we agree that an article is composed of components such as (titles, headers, images, captions, etc.), the TF*IDF measure does not take into account all its components when detecting the frequency of the main keyword.
In case of keyword stuffing or sub-optimization of keywords, the method does not detect the affected sentences or paragraphs. Finally, it should be noted that this method of calculating the frequency of term appearance only works with large articles.
The TF*IDF gives practically insignificant results when it comes to generally short contents such as press articles, product sheets, etc. The TF*IDF analysis cannot therefore give satisfactory results on sites such as:
- Online stores ;
- Advertising sites;
- News portals, etc.
The ability to use TF*IDF analysis to generate multiple keywords associated with a main keyword as well as new content ideas makes this measure a powerful SEO tool. If you understood the meaning of term frequency and reverse document frequency calculation, then you have an idea of the basis of web page positioning in Google search results.
TF*IDF analysis can therefore be used for your relevant keyword studies to develop a profitable content marketing strategy for your online business. However, this method can sometimes have its drawbacks or doesn’t work on all occasions.
I hope this article has been helpful to you. Feel free to mention what you think about Term Frequency*Inverse Document Frequency in comments.