Getting a Daily Digest From Tech Websites

_{Last Updated:
September 24, 2024}

Motivation: We want to stay informed on the latest news in tech. However, with so many websites and news happening every day, it is impossible to keep track of what is going on. But what if we could summarize the latest developments and have all this run locally with an off-the-shelf LLM in a few lines of code?

Let us see how Haystack together with TitanML’s Takeoff Inference Server can help us achieve this.

Run Titan Takeoff Inference Server Image

Remember that you must download this notebook and run it in your local environment. The Titan Takeoff Inference Server allows you to run modern open-source LLMs in your infrastructure.

docker run --gpus all -e TAKEOFF_MODEL_NAME=TheBloke/Llama-2-7B-Chat-AWQ \
                      -e TAKEOFF_DEVICE=cuda \
                      -e TAKEOFF_MAX_SEQUENCE_LENGTH=256 \
                      -it \
                      -p 3000:3000 tytn/takeoff-pro:0.11.0-gpu

Daily digest from top tech websites using Deepset Haystack and Titan Takeoff

!pip install feedparser
!pip install takeoff_haystack

from typing import Dict, List

from haystack import Document, Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

import feedparser

#from takeoff_haystack import TakeoffGenerator

# Dict of website RSS feeds
urls = {
    'theverge': 'https://www.theverge.com/rss/frontpage/',
    'techcrunch': 'https://techcrunch.com/feed',
    'mashable': 'https://mashable.com/feeds/rss/all',
    'cnet': 'https://cnet.com/rss/news',
    'engadget': 'https://engadget.com/rss.xml',
    'zdnet': 'https://zdnet.com/news/rss.xml',
    'venturebeat': 'https://feeds.feedburner.com/venturebeat/SZYF',
    'readwrite': 'https://readwrite.com/feed/',
    'wired': 'https://wired.com/feed/rss',
    'gizmodo': 'https://gizmodo.com/rss',
}

# Configurable parameters
NUM_WEBSITES = 3
NUM_TITLES = 1

def get_titles(urls: Dict[str, str], num_sites: int, num_titles: int) -> List[str]:
    titles: List[str] = []
    sites = list(urls.keys())[:num_sites]
    for site in sites:
        feed = feedparser.parse(urls[site])
        entries = feed.entries[:num_titles]
        for entry in entries:
            titles.append(entry.title)
    return titles

titles = get_titles(urls, NUM_WEBSITES, NUM_TITLES)


document_store = InMemoryDocumentStore()
document_store.write_documents(
    [
        Document(content=title) for title in titles
    ]
)

template = """
HEADLINES:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

REQUEST: {{ query }}
"""

pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", TakeoffGenerator(base_url="http://localhost", port=3000))
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

query = f"Summarize each of the {NUM_WEBSITES * NUM_TITLES} provided headlines in three words."

titles_string

'Two words: poker roguelike - Former Twitter engineers are building Particle, an AI-powered news reader - Best laptops of MWC 2024, including a 2-in-1 that broke a world record'

response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}})

print(response["llm"]["replies"])

Ranking by BM25...:   0%|          | 0/1 [00:00<?, ? docs/s]


['\n\n\nANSWER:\n\n1. Poker Roguelike - Exciting gameplay\n2. AI-powered news reader - Personalized feed\n3. Best laptops MWC 2024 - Powerful devices']