The data was extracted from bbc good food https://www.bbcgoodfood. com/recipes
The steps involved are as follows:
- Install Parsehub
- Get the URLs for each of the recipes from different cuisines (refer urls.csv attached)
- Run the below Python code:
import requests
import urllib2
import unicodedata
from bs4 import BeautifulSoup
import csv
import os
import numpy as np
import pandas as pd
os.chdir('C:\\Users\\avg\\ Desktop\\Latest_Data')
contents=[]
with open('urls.csv','r') as csvf:
urls=csv.reader(csvf)
for url in urls:
contents.append(url)
for url in contents:
req = urllib2.Request(url[0], headers={'User-Agent': 'Mozilla/5.0'})
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page, "html.parser")
title = soup.find("h1", class_="recipe-header__title")
file_name = title.text + ".csv"
file = open(file_name, "wb")
print(title.text)
file.write(title.text.encode(' ascii', 'ignore').decode('ascii'))
nutrition = soup.find("ul", class_="nutrition")
for li in nutrition.findAll("li"):
label = li.find("span", class_="nutrition__label")
value = li.find("span", class_="nutrition__value")
print label.get_text() + ": " + value.get_text()
file.write(","+value.get_text( ))
- The nutrient information is obtained as different csv files (one file per recipe)
- Merge the csv files to get the required data
No comments:
Post a Comment