Pythonã§å§ããWebã¹ã¯ã¬ã€ãã³ã°å ¥éïŒ
Pythonã®æŽ»çšæ¹æ³ã®ã²ãšã€ã«ã¹ã¯ã¬ã€ãã³ã°ããããŸãã ã¹ã¯ã¬ã€ãã³ã°ãšã¯ãWEBäžã®ããŒãžã«ã¢ã¯ã»ã¹ããå¿ èŠãªæ å ±ãæœåºããè¡çºã§ãã 人éãïŒã€ã²ãšã€æäœæ¥ã§æ å ±ãéããã®ã«æ¯ã¹ãããã°ã©ã ãå©çšããæ å ±ã®æœåºã¯å®è£ ããæžãã°å§åçã«æ©ãé床ã§å€ãã®æ å ±ãéããããšãåºæ¥ãŸãã ä»åã¯ãPythonã§ã¹ã¯ã¬ã€ãã³ã°ãè¡ãæ¹æ³ã«ã€ããŠå®éã®ã³ãŒããšå ±ã«è§£èª¬ããŠãããŸããã€ã³ã¿ãŒãããã¯ç§ãã¡ã®æ¥åžžç掻ã«ãããŠæ¬ ãããªãååšãšãªããŸãããæ å ±ãæ€çŽ¢ãããã¥ãŒã¹ãèªã¿ãååãè³Œå ¥ããéã«ã¯ãŠã§ããµã€ããå©çšããŸãããããããŠã§ãäžã®æ å ±ã¯æ°ãåããªãã»ã©ã®éã«äžããŸãããã®äžããå¿ èŠãªæ å ±ãæäœæ¥ã§åéããã®ã¯éå¹çã§ããã倧å€ãªäœæ¥ã§ãã
ããã§ç»å Žããã®ããWebã¹ã¯ã¬ã€ãã³ã°ããšããæè¡ã§ããWebã¹ã¯ã¬ã€ãã³ã°ã¯ãããã°ã©ã ã䜿çšããŠãŠã§ãããŒãžããæ å ±ãæœåºããæ¹æ³ã§ããããã«ããã倧éã®ããŒã¿ãèªåçã«åéããå¿ èŠãªæ å ±ãæœåºããããšãã§ããŸãã
æ¬èšäºã§ã¯ãWebã¹ã¯ã¬ã€ãã³ã°åºç€ããå¿çšæè¡ãŸã§ã解説ããŸããå«ççãªã¹ã¯ã¬ã€ãã³ã°ã®ååããã¹ããã©ã¯ãã£ã¹ã«ã€ããŠåŠã³ãåºæ¬çãªã¹ã¯ã¬ã€ãã³ã°ææ³ãç¿åŸããŸããããããã«ãå¿çšãã¯ããã¯ãåçãµã€ãã§ã®ã¹ã¯ã¬ã€ãã³ã°ã«ã€ããŠã詳ãã解説ããŸãã
Webã¹ã¯ã¬ã€ãã³ã°ãšã¯
Webã¹ã¯ã¬ã€ãã³ã°ã¯ããŠã§ããµã€ãããæ å ±ãèªåçã«ååŸããããã»ã¹ã§ããäŒæ¥ãå人ãWebã¹ã¯ã¬ã€ãã³ã°ã掻çšããããšã§ãç«¶åæ å ±ã®åéãåžå Žèª¿æ»ãããŒã¿åæãªã©æ§ã ãªç®çãéæã§ããŸããWebã¹ã¯ã¬ã€ãã³ã°ã¯ãHTMLãCSSãè§£æããŠå¿ èŠãªããŒã¿ãæœåºããæ¹æ³ã§ãããŠã§ãããŒãžã®æ§é ãçè§£ããå¿ èŠãªããŒã¿ãæ£ç¢ºã«æœåºããããã«ã¯ãHTMLãšCSSã®åºç€ç¥èãå¿ èŠã§ãã
å«ççãªã¹ã¯ã¬ã€ãã³ã°ãšãã¹ããã©ã¯ãã£ã¹
ã¹ã¯ã¬ã€ãã³ã°èªäœã«éæ³æ§ã¯ãããŸããã
ãã ããã¹ã¯ã¬ã€ãã³ã°ãè¡ãéã«ã¯ãå«ççãªèгç¹ãšãã¹ããã©ã¯ãã£ã¹ã®éµå®ãéèŠã§ãã
åœç¶ã¹ã¯ã¬ã€ãã³ã°è¡çºãçŠæ¢ã»å¶éããŠãããŠã§ããµã€ãããããŸãã
ãŠã§ããµã€ãéå¶è ãå©çšè ã®æš©å©ãå°éããåé¡ãåŒãèµ·ãããã«ã¹ã¯ã¬ã€ãã³ã°ãè¡ãããã«ã¯ã以äžã®ãã¹ããã©ã¯ãã£ã¹ãå®ãå¿ èŠããããŸãã
- å©çšèŠçŽã®éµå®: ãŠã§ããµã€ãã®å©çšèŠçŽã確èªããã¹ã¯ã¬ã€ãã³ã°ãèš±å¯ãããŠãããã©ããã確èªããŸããå©çšèŠçŽã«éåããªãããã«ã¹ã¯ã¬ã€ãã³ã°ãè¡ããŸãããã
- robots.txtã®ç¢ºèª: ãŠã§ããµã€ãã®ã«ãŒããã£ã¬ã¯ããªã«ããrobots.txtãã¡ã€ã«ã確èªããã¹ã¯ã¬ã€ãã³ã°ã®èš±å¯ãå¶éäºé ãèšè¿°ãããŠãããã確èªããŸããrobots.txtã«èšèŒãããŠããæç€ºã«åŸããŸãããã
- ã¢ã¯ã»ã¹é »åºŠã®å¶åŸ¡: é床ãªãªã¯ãšã¹ãã®éä¿¡ã¯ãµãŒããŒã«è² è·ããããå¯èœæ§ããããŸããé©åãªã¢ã¯ã»ã¹é »åºŠãèšå®ãããµãŒããŒã«é床ãªè² è·ããããªãããã«ããŸãããã
- ããŒã¿å©çšç¯å²ã®å°é: ååŸããããŒã¿ã®å©çšç¯å²ãå®ããŸãããããŠã§ããµã€ãã®å©çšèŠçŽãèäœæš©ã«éåããªãããã«æ³šæããããŒã¿ã®åå©çšãå ¬éã«ã€ããŠå¶çŽãããå Žåã¯é©åã«å¯Ÿå¿ããŸãããã
- ãã©ã€ãã·ãŒã®ä¿è·: ã¹ã¯ã¬ã€ãã³ã°ãè¡ãéã«ã¯ãå人æ å ±ããã©ã€ãã·ãŒã®ä¿è·ã«ååãªé æ ®ãããŸããããäžé©åãªããŒã¿åéãå人æ å ±ã®å ¬éã¯é¿ããé©åãªã»ãã¥ãªãã£å¯Ÿçã宿œããŸããã
ããäžå®ãããã°ãã¹ã¯ã¬ã€ãã³ã°ãå®è¡ããåã«ãŠã§ããµã€ãã®éå¶è ã«åãåãããŠã¿ãã®ãè¯ãã§ãããã
åºæ¬çãªã¹ã¯ã¬ã€ãã³ã°ææ³
ã¹ã¯ã¬ã€ãã³ã°ã®åºæ¬çãªææ³ãåŠã³ãŸãããã以äžã®æé ã«åŸã£ãŠã¹ã¯ã¬ã€ãã³ã°ãè¡ããŸãã
ã©ã€ãã©ãªã®ã€ã³ã¹ããŒã«
ã¹ã¯ã¬ã€ãã³ã°ã«ã¯Pythonã®Requestsã©ã€ãã©ãªãšBeautiful Soupã©ã€ãã©ãªã䜿çšããŸãã以äžã®ã³ãã³ãã䜿çšããŠã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŸãã
pip install requests beautifulsoup4
ãŠã§ãããŒãžã®ããŠã³ããŒã
Requestsã©ã€ãã©ãªã䜿ã£ãŠãŠã§ãããŒãžã®HTMLããŒã¿ãããŠã³ããŒãããŸãã以äžã®ã³ãŒãã䜿çšããŸãã
import requests
url = "https://example.com"
response = requests.get(url)
html = response.text
HTMLã®è§£æãšèŠçŽ ã®æœåº
Beautiful Soupã䜿ã£ãŠããŠã³ããŒãããHTMLããŒã¿ãè§£æããå¿
èŠãªèŠçŽ ãæœåºããŸãã以äžã®ã³ãŒãã䜿çšããŠèŠçŽ ãæœåºããŸãã
from bs4
import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
title = soup.find("h1").text
ãã®ããã«åºæ¬çãªææ³ã䜿ã£ãŠã¹ã¯ã¬ã€ãã³ã°ãè¡ãããšãã§ããŸãã
å¿çšã¹ã¯ã¬ã€ãã³ã°ãã¯ããã¯ãšåçãµã€ãã§ã®ã¹ã¯ã¬ã€ãã³ã°
å¿çšçãªã¹ã¯ã¬ã€ãã³ã°ãã¯ããã¯ãšåçãµã€ãã§ã®ã¹ã¯ã¬ã€ãã³ã°æ¹æ³ã«ã€ããŠè§£èª¬ããŸãã
ç¹ã«ãJavaScriptã䜿ãããåçãªãµã€ãã®ã¹ã¯ã¬ã€ãã³ã°ã«çŠç¹ãåœãŠãŸããSeleniumã©ã€ãã©ãªã䜿ã£ããããã¬ã¹ãã©ãŠã¶ã®æäœãããŒãžã®ã¹ã¯ããŒã«ãããŒã¿ã®èªã¿èŸŒã¿ãããŒãžã®è§£æãªã©ãå®è·µçãªã¹ã¯ã¬ã€ãã³ã°ææ³ãåŠã³ãŸãããã
Seleniumã®ã€ã³ã¹ããŒã«
ãŸããSeleniumãã€ã³ã¹ããŒã«ããŸããSeleniumã¯Pythonã§åäœãããŠã§ããã©ãŠã¶ã®èªååããŒã«ã§ãããã¹ã¯ã¬ã€ãã³ã°ã«ãããŠJavaScriptã®å®è¡ãåçãªã³ã³ãã³ãã®ååŸã«åœ¹ç«ã¡ãŸãã
pip install selenium
ãããã¬ã¹ãã©ãŠã¶ã®æäœ
Seleniumã䜿çšããŠãããã¬ã¹ãã©ãŠã¶ãæäœããããšã§ãJavaScriptãåäœãããµã€ãããããŒã¿ãååŸã§ããŸãã以äžã®ã³ãŒãã¯ãChromeãã©ãŠã¶ããããã¬ã¹ã¢ãŒãã§èµ·åããæå®ããURLã®ããŒãžãéãäŸã§ãã
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Chromeãã©ã€ãã®ãã¹ãšãªãã·ã§ã³ãèšå®
chrome_path = '/path/to/chromedriver'
chrome_options = Options()
chrome_options.add_argument('--headless') # ãããã¬ã¹ã¢ãŒãã§èµ·å
# Chromeãã©ã€ããèµ·å
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)
# æå®ããURLã®ããŒãžãéã
url = '[https://example.com](https://example.com)'
driver.get(url)
# ããŒãžã®ãœãŒã¹ã³ãŒããååŸ
html = driver.page_source
# ãã©ã€ããçµäº
driver.quit()
ããŒãžã®ã¹ã¯ããŒã«ãšããŒã¿ã®èªã¿èŸŒã¿
åçãªãµã€ãã§ã¯ãããŒã¿ãã¹ã¯ããŒã«ã«ãã£ãŠè¿œå ãããå ŽåããããŸããSeleniumã䜿çšããŠããŒãžãã¹ã¯ããŒã«ããããŒã¿ãèªã¿èŸŒãæ¹æ³ãèŠãŠã¿ãŸãããã
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
# Chromeãã©ã€ãã®èšå®ãšèµ·å
# ããŒãžãéã
# ããŒãžãã¹ã¯ããŒã«ããŠããŒã¿ãèªã¿èŸŒã
SCROLL_PAUSE_TIME = 2 # ã¹ã¯ããŒã«ã®åŸ
æ©æé
scroll_count = 3 # ã¹ã¯ããŒã«åæ°
# ããŒãžã®é«ããååŸ
last_height = driver.execute_script('return document.body.scrollHeight')
# ã¹ã¯ããŒã«ãç¹°ãè¿ã
for _ in range(scroll_count):
   # ããŒãžãã¹ã¯ããŒã«
   driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
   # ããŒãžãèªã¿èŸŒãŸãããŸã§åŸ
æ©
   time.sleep(SCROLL_PAUSE_TIME)
   # æ°ããé«ããååŸ
   new_height = driver.execute_script('return document.body.scrollHeight')
   # é«ããå€ãã£ãŠããªããã°çµäº
   if new_height == last_height:
       break
   last_height = new_height
# ã¹ã¯ããŒã«åŸã®ããŒãžã®ãœãŒã¹ã³ãŒããååŸ
html = driver.page_source
# ãã©ã€ããçµäº
driver.quit()
ããŒãžã®è§£æãšããŒã¿ã®æœåº
Beautiful Soupãçµã¿åãããŠãã¹ã¯ã¬ã€ãã³ã°ããããŒã¿ããå¿
èŠãªæ
å ±ãæœåºããæ¹æ³ãåŠã³ãŸãããã以äžã®ã³ãŒãã¯ãã¹ã¯ã¬ã€ãã³ã°ããHTMLããç¹å®ã®èŠçŽ ãæœåºããäŸã§ãã
from bs4 import BeautifulSoup
# ã¹ã¯ã¬ã€ãã³ã°ããHTMLãBeautiful Soupã§è§£æ
soup = BeautifulSoup(html, 'html.parser')
# ç¹å®ã®èŠçŽ ãæœåº
titles = soup.select('.title')
for title in titles:
   print(title.text)
以äžããSeleniumã©ã€ãã©ãªã䜿çšãããããã¬ã¹ãã©ãŠã¶ã®æäœãããŒãžã®ã¹ã¯ããŒã«ãããŒã¿ã®èªã¿èŸŒã¿ãããã³Beautiful Soupã䜿çšããããŒã¿ã®è§£æãšèŠçŽ ã®æœåºã®å®è·µçãªã¹ã¯ã¬ã€ãã³ã°ææ³ã§ãããããã®ãã¯ããã¯ãçµã¿åãããããšã§ãJavaScriptã䜿ãããåçãªãµã€ãããããŒã¿ã广çã«ã¹ã¯ã¬ã€ãã³ã°ããããšãã§ããŸãã
ãŸãšã
æ¬èšäºã§ã¯ãWebã¹ã¯ã¬ã€ãã³ã°ã®åºç€ããå¿çšæè¡ãŸã§ã解説ããŸãããå«ççãªã¹ã¯ã¬ã€ãã³ã°ã®ååããã¹ããã©ã¯ãã£ã¹ã«ã€ããŠåŠã³ãåºæ¬çãªã¹ã¯ã¬ã€ãã³ã°ææ³ãç¿åŸããŸãããããã«ãå¿çšãã¯ããã¯ãåçãµã€ãã§ã®ã¹ã¯ã¬ã€ãã³ã°ã«ã€ããŠã詳ãã解説ããŸããã
Webã¹ã¯ã¬ã€ãã³ã°ã¯æ
å ±åéãããŒã¿åæãªã©ã®éèŠãªææ³ã§ãããäŒæ¥ãå人ã«ãšã£ãŠå€ãã®å©çãããããããšãã§ããŸãã
ãã²ãã®èšäºã§åŠãã ç¥èãæŽ»ãããèªèº«ã®ãããžã§ã¯ãã調æ»ã«åœ¹ç«ãŠãŠãã ããã












