Web Scraping with AI
(Web Scraping with AI)[https://github.com/takline/TK-Projects/tree/main/Web%20Scraping%20%26%20Automation/web-scraping-with-AI]
What’s Brewing Here?
Embark on a digital treasure hunt with this code ensemble! It’s your magic wand to sift through the web’s labyrinth and gracefully extract nuggets of knowledge. We weave together the spells of OpenAI Functions and LangChain to make this happen. Just sketch out your data map in schemas.py, pick your web destination, and let scrape_with_playwright() in main.py be your guide.
Pro Tip: Web pages are a tapestry of <p>, <span>, and <h> tags. Find the combination that whispers the secrets of your chosen site.
Example Adventure
-
Craft your digital map in
provide_schema.py. Whether it’s a Pydantic class or a simple dictionary, your wish is our command:class NewsPortalSchema(BaseModel): headline: str brief_summary: str -
Set sail in
run.pylike this:asyncio.run(playwright_scrape_and_analyze( url="https://www.bbc.com", tags=["span"], schema_pydantic=NewsPortalSchema ))
Setting Up Your Magic Kit
1. Conjure a Python virtual environment
python -m venv virtual-env or python3 -m venv virtual-env (Mac)
py -m venv virtual-env (Windows 11)
2. Awaken your virtual environment
.\virtual-env\Scripts\activate (Windows)
source virtual-env/bin/activate (Mac)
3. Summon dependencies with Poetry’s charm
Cast poetry install --sync or simply poetry install
4. Enlist Playwright in your quest
playwright install
5. Secretly store your OpenAI API key in a .env scroll
OPENAI_API_KEY=XXXXXX
How to Unleash the Magic
Run it in your own mystical realm
python run.py
Scrolls of Extra Wisdom
-
Transform this into a FastAPI crystal ball to peer into your data through an API gateway.
-
Tread the web with the caution of a wise wizard. Only venture where the digital ethics compass allows.
-
A little bird told me this wizardry is now part of LangChain’s spellbook in this PR. Peek into the grand library here for more enchantments.
In this enhanced documentation, the instructions are reimagined to be more engaging and less technical, while still providing all the necessary steps to use the code effectively.