Monitoring and extracting trends from web content has become essential for market research, content creation, or staying ahead in your field. In this tutorial, we provide a practical guide to building your trend-finding tool using Python. Without needing external APIs or complex setups, you’ll learn how to scrape publicly accessible websites, apply powerful NLP (Natural Language Processing) techniques like sentiment analysis and topic modeling, and visualize emerging trends using dynamic word clouds.
import requests
from bs4 import BeautifulSoup
# List of URLs to scrape
urls = ["https://en.wikipedia.org/wiki/Natural_language_processing",
"https://en.wikipedia.org/wiki/Machine_learning"]
collected_texts = [] # to store text from each page
for url in urls:
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Extract all paragraph text
paragraphs = [p.get_text() for p in soup.find_all('p')]
page_text = " ".join(paragraphs)
collected_texts.append(page_text.strip())
else:
print(f"Failed to retrieve {url}")
First with the above code snippet, we demonstrate a straightforward way to scrape textual data from publicly accessible websites using Python’s requests and BeautifulSoup. It fetches content from specified URLs, extracts paragraphs from the HTML, and prepares them for further NLP analysis by combining text data into structured strings.
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
cleaned_texts = []
for text in collected_texts:
# Remove non-alphabetical characters and lower the text
text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
# Remove stopwords
words = [w for w in text.split() if w not in stop_words]
cleaned_texts.append(" ".join(words))
Then, we clean the scraped text by converting it to lowercase, removing punctuation and special characters, and filtering out common English stopwords using NLTK. This preprocessing ensures the text data is clean, focused, and ready for meaningful NLP analysis.
from collections import Counter
# Combine all texts into one if analyzing overall trends:
all_text = " ".join(cleaned_texts)
word_counts = Counter(all_text.split())
common_words = word_counts.most_common(10) # top 10 frequent words
print("Top 10 keywords:", common_words)
Now, we calculate word frequencies from the cleaned textual data, identifying the top 10 most frequent keywords. This helps highlight dominant trends and recurring themes across the collected documents, providing immediate insights into popular or significant topics within the scraped content.
!pip install textblob
from textblob import TextBlob
for i, text in enumerate(cleaned_texts, 1):
polarity = TextBlob(text).sentiment.polarity
if polarity > 0.1:
sentiment = "Positive 😀"
elif polarity < -0.1:
sentiment = "Negative 🙁"
else:
sentiment = "Neutral 😐"
print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")
We perform sentiment analysis on each cleaned text document using TextBlob, a Python library built on top of NLTK. It evaluates the overall emotional tone of each document—positive, negative, or neutral—and prints the sentiment along with a numerical polarity score, providing a quick indication of the general mood or attitude within the text data.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
# Adjust these parameters
vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(cleaned_texts)
# Fit LDA to find topics (for instance, 3 topics)
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)
feature_names = vectorizer.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
print(f"Topic {idx + 1}: ", [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])
Then, we apply Latent Dirichlet Allocation (LDA)—a popular topic modeling algorithm—to discover underlying topics in the text corpus. It first transforms cleaned texts into a numerical document-term matrix using scikit-learn’s CountVectorizer, then fits an LDA model to identify the primary themes. The output lists the top keywords for each discovered topic, concisely summarizing key concepts in the collected data.
# Assuming you have your text data stored in combined_text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Preprocess and clean the text:
cleaned_texts = []
for text in collected_texts:
text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
words = [w for w in text.split() if w not in stop_words]
cleaned_texts.append(" ".join(words))
# Generate combined text
combined_text = " ".join(cleaned_texts)
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap='viridis').generate(combined_text)
# Display the word cloud
plt.figure(figsize=(10, 6)) # <-- corrected numeric dimensions
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Scraped Text", fontsize=16)
plt.show()
Finally, we generate a word cloud visualization displaying prominent keywords from the combined and cleaned text data. By visually emphasizing the most frequent and relevant terms, this approach allows for intuitive exploration of the main trends and themes in the collected web content.
Word Cloud Output from the Scraped Site
In conclusion, we’ve successfully built a robust and interactive trend-finding tool. This exercise equipped you with hands-on experience in web scraping, NLP analysis, topic modeling, and intuitive visualizations using word clouds. With this powerful yet straightforward approach, you can continuously track industry trends, gain valuable insights from social and blog content, and make informed decisions based on real-time data.
Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.