Choosing a good title for an article is an important step in the writing process. The more interesting the title, the greater the possibility that a reader will interact with the whole. Furthermore, showing the content of the user they prefer (interacting with) increases the user's satisfaction.
That's how my final project started from the Nanodegree specialization of Machine Learning Engineer. I just finished and I hear so proud and happy 😀 that I wanted to share with you some insights I had on the whole flow. Also, I promised Quincy Larson this article when I finished the project.
If you want to see the final technical document click here. If you want to implement the code, check it out here or bifurcate my project on GitHub. If you only want an overview using the terms of a layman, this is the right place – keep reading this article.
Some of the most used platforms for spreading ideas a day are Twitter and Medium (you're here!). On Twitter, articles are normally published, including external URLs and titles, where users can access the article and demonstrate their satisfaction with a like or a retweet of the original post.
Medium shows the complete text with tags (to classify the article) and applauds (similar to Twitter likes) to show how much users appreciate the content. A correlation between these two platforms can provide us with valuable information.
The problem I defined was a classification task that used supervised learning: Predict the number of likes and retweets an article receives based on the title.
The correlation of the likes and retweets number of Twitter with a Medium article is an attempt to isolate the effect of the number of readers reached and the number of average beats. Because the more the article is shared on different platforms, more readers will reach and more Medium applause will receive (probably).
Using only the Twitter statistics, we would expect that initially the articles would reach almost the same number of readers (those readers are the followers of the freeCodeCamp account on Twitter). Their performance and interactions, therefore, would be limited to the characteristics of the tweet, for example the title of the article. And this is exactly what we want to measure.
I chose the freeCodeCamp account for this project because the idea was to limit the scope of the articles and to better predict the response on a specific field. The same title can work well in a category (eg technology), but not necessarily in a different (eg culinary). In addition, this account publishes the title of the original article and the URL on Media as the content of the tweet.
How do the data appear?
The first step of this project was to get the information from Twitter and Medium and then correlate them. The data set can be found here and has 711 data points. Here is what the data set looks like:
Analyzing and learning with data
After analyzing the data set and plotting some graphs, I found interesting information about it. For these analyzes, the anomalous values have been removed, and I just considered the 25% of top performers for each function (retweet, like and clap).
Let's have a look at what the numbers say for FreeCodeCamp articles written on Medium and shared on Twitter.
What is a good length of the title?
Write titles that have a length more than 50 and less than 110 the characters help to increase the chances of a successful article.
What is a good number of words in the title?
The most effective number of words in the title is 9 to 17. To optimize the number of retweets and likes, try something from 9 to 18 words, and for applause from 7 to 17.
What are the best categories to label?
What are the best words to use?
Use of Machine Learning
OK! After looking at the data and extracting some information from it, the goal was to create a Machine Learning model that made predictions on the number of retweets, I like it and applause according to the article title .
The prediction of the number of retweets, likes and claps of an article can be treated as a classification problem, and this is a common task of machine learning (ML). But for this, we need to use the output as discrete values (a range of numbers). The input will be the title of the articles with each word as token (t1, t2, t3, … tn), the length of the title and the number of words in the title.
The ranges for our features are:
- Retweet: 0-10, 10-30, 30+
- I like it: 0-25, 25-60, 60+
- Applause: 0-50, 50-400, 400+
Finally, after pre-elaborating our data set and evaluating some models (fully described here), we have come to the conclusion that the MultinomialNB model has achieved better results with retweets that reach a 60.6% accuracy. . Logistic regression reached 55.3% for Likes and 49% for Claps.
As an experiment for this article, I performed the prediction of the title of this article and the model predicted that:
It will have 10-30 retweets and 25-60 favorites on Twitter and 400+ applause on Media.
How is this prediction? 😀