AI

A Step by Step Guide to Build a Trend Finder Tool with Python: Web Scraping, NLP (Sentiment Analysis & Topic Modeling), and Word Cloud Visualization

4 Mins read

Monitoring and extracting trends from web content has become essential for market research, content creation, or staying ahead in your field. In this tutorial, we provide a practical guide to building your trend-finding tool using Python. Without needing external APIs or complex setups, you’ll learn how to scrape publicly accessible websites, apply powerful NLP (Natural Language Processing) techniques like sentiment analysis and topic modeling, and visualize emerging trends using dynamic word clouds.

import requests
from bs4 import BeautifulSoup


# List of URLs to scrape
urls = ["https://en.wikipedia.org/wiki/Natural_language_processing",
        "https://en.wikipedia.org/wiki/Machine_learning"]  


collected_texts = []  # to store text from each page


for url in urls:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract all paragraph text
        paragraphs = [p.get_text() for p in soup.find_all('p')]
        page_text = " ".join(paragraphs)
        collected_texts.append(page_text.strip())
    else:
        print(f"Failed to retrieve {url}")

First with the above code snippet, we demonstrate a straightforward way to scrape textual data from publicly accessible websites using Python’s requests and BeautifulSoup. It fetches content from specified URLs, extracts paragraphs from the HTML, and prepares them for further NLP analysis by combining text data into structured strings.

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


stop_words = set(stopwords.words('english'))


cleaned_texts = []
for text in collected_texts:
    # Remove non-alphabetical characters and lower the text
    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
    # Remove stopwords
    words = [w for w in text.split() if w not in stop_words]
    cleaned_texts.append(" ".join(words))

Then, we clean the scraped text by converting it to lowercase, removing punctuation and special characters, and filtering out common English stopwords using NLTK. This preprocessing ensures the text data is clean, focused, and ready for meaningful NLP analysis.

from collections import Counter


# Combine all texts into one if analyzing overall trends:
all_text = " ".join(cleaned_texts)
word_counts = Counter(all_text.split())
common_words = word_counts.most_common(10)  # top 10 frequent words
print("Top 10 keywords:", common_words)

Now, we calculate word frequencies from the cleaned textual data, identifying the top 10 most frequent keywords. This helps highlight dominant trends and recurring themes across the collected documents, providing immediate insights into popular or significant topics within the scraped content.

!pip install textblob
from textblob import TextBlob


for i, text in enumerate(cleaned_texts, 1):
    polarity = TextBlob(text).sentiment.polarity
    if polarity > 0.1:
        sentiment = "Positive 😀"
    elif polarity < -0.1:
        sentiment = "Negative 🙁"
    else:
        sentiment = "Neutral 😐"
    print(f"Document {i} Sentiment: {sentiment} (polarity={polarity:.2f})")

We perform sentiment analysis on each cleaned text document using TextBlob, a Python library built on top of NLTK. It evaluates the overall emotional tone of each document—positive, negative, or neutral—and prints the sentiment along with a numerical polarity score, providing a quick indication of the general mood or attitude within the text data.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


# Adjust these parameters
vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words="english")
doc_term_matrix = vectorizer.fit_transform(cleaned_texts)


# Fit LDA to find topics (for instance, 3 topics)
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)


feature_names = vectorizer.get_feature_names_out()


for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx + 1}: ", [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])

Then, we apply Latent Dirichlet Allocation (LDA)—a popular topic modeling algorithm—to discover underlying topics in the text corpus. It first transforms cleaned texts into a numerical document-term matrix using scikit-learn’s CountVectorizer, then fits an LDA model to identify the primary themes. The output lists the top keywords for each discovered topic, concisely summarizing key concepts in the collected data.

# Assuming you have your text data stored in combined_text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re


nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


# Preprocess and clean the text:
cleaned_texts = []
for text in collected_texts:
    text = re.sub(r'[^A-Za-z\s]', ' ', text).lower()
    words = [w for w in text.split() if w not in stop_words]
    cleaned_texts.append(" ".join(words))


# Generate combined text
combined_text = " ".join(cleaned_texts)


# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color="white", colormap='viridis').generate(combined_text)


# Display the word cloud
plt.figure(figsize=(10, 6))  # <-- corrected numeric dimensions
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Scraped Text", fontsize=16)
plt.show()

Finally, we generate a word cloud visualization displaying prominent keywords from the combined and cleaned text data. By visually emphasizing the most frequent and relevant terms, this approach allows for intuitive exploration of the main trends and themes in the collected web content.

Word Cloud Output from the Scraped Site

In conclusion,  we’ve successfully built a robust and interactive trend-finding tool. This exercise equipped you with hands-on experience in web scraping, NLP analysis, topic modeling, and intuitive visualizations using word clouds. With this powerful yet straightforward approach, you can continuously track industry trends, gain valuable insights from social and blog content, and make informed decisions based on real-time data.


Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

🚨 Meet Parlant: An LLM-first conversational AI framework designed to provide developers with the control and precision they need over their AI customer service agents, utilizing behavioral guidelines and runtime supervision. 🔧 🎛️ It’s operated using an easy-to-use CLI 📟 and native client SDKs in Python and TypeScript 📦.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.


Source link

Related posts
AI

This AI Paper Introduces CODI: A Self-Distillation Framework for Efficient and Scalable Chain-of-Thought Reasoning in LLMs

3 Mins read
Chain-of-Thought (CoT) prompting enables large language models (LLMs) to perform step-by-step logical deductions in natural language. While this method has proven effective,…
AI

A Coding Implementation of Web Scraping with Firecrawl and AI-Powered Summarization Using Google Gemini

3 Mins read
The rapid growth of web content presents a challenge for efficiently extracting and summarizing relevant information. In this tutorial, we demonstrate how…
AI

Salesforce AI Releases Text2Data: A Training Framework for Low-Resource Data Generation

5 Mins read
Generative AI faces a critical challenge in balancing autonomy and controllability. While autonomy has advanced significantly through powerful generative models, controllability has…

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *