Web

[Python] BeautifulSoupで要素を取得する使い方を解説

2025-04-15更新日: 2025-04-15

BeautifulSoupは、PythonでHTMLやXMLを解析するためのライブラリです。

要素を取得する際には、find()やfind_all()メソッドを使用します。

find()は最初に見つかった要素を返し、find_all()はすべての一致する要素をリストで返します。

例えば、soup.find('a')は最初の<a>タグを取得し、soup.find_all('a')はすべての<a>タグを取得します。

また、クラス名やIDで要素を絞り込むことも可能です。

目次から探す

BeautifulSoupとは
BeautifulSoupで要素を取得する基本
属性を使った要素の絞り込み
BeautifulSoupでの階層構造の操作
CSSセレクタを使った要素の取得
BeautifulSoupでのテキストの取得
応用例：BeautifulSoupでのスクレイピング
まとめ

BeautifulSoupとは

BeautifulSoupは、PythonでHTMLやXML文書を解析するためのライブラリです。

ウェブスクレイピングやデータ抽出の際に非常に便利で、特に複雑なHTML構造を持つウェブページから必要な情報を簡単に取得することができます。

BeautifulSoupは、文書をツリー構造として扱い、要素の検索や操作を直感的に行えるように設計されています。

このライブラリは、特に以下のような特徴があります：

使いやすさ: シンプルなAPIを提供し、初心者でも扱いやすい。
柔軟性: 様々なHTML/XMLパーサーと連携可能で、異なる形式の文書に対応。
強力な検索機能: タグ名、属性、テキスト内容などを基に要素を簡単に検索できる。

これにより、ウェブデータの収集や解析が効率的に行えるため、多くのデータサイエンティストや開発者に利用されています。

BeautifulSoupで要素を取得する基本

BeautifulSoupを使用すると、HTML文書から特定の要素を簡単に取得できます。

ここでは、基本的な要素取得方法について解説します。

find()メソッドの使い方

find()メソッドは、指定した条件に一致する最初の要素を取得します。

例えば、特定のタグ名を持つ最初の要素を取得する場合に使用します。

from bs4 import BeautifulSoup
html_doc = "<html><body><h1>タイトル</h1><p>これは段落です。</p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')
# h1タグの最初の要素を取得
h1_element = soup.find('h1')
print(h1_element)

<h1>タイトル</h1>

find_all()メソッドの使い方

find_all()メソッドは、指定した条件に一致するすべての要素をリスト形式で取得します。

特定のタグ名を持つすべての要素を取得したい場合に便利です。

from bs4 import BeautifulSoup
html_doc = "<html><body><h1>タイトル</h1><p>段落1</p><p>段落2</p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')
# pタグのすべての要素を取得
p_elements = soup.find_all('p')
for p in p_elements:
    print(p)

<p>段落1</p>
<p>段落2</p>

タグ名で要素を取得する方法

特定のタグ名を指定して要素を取得することができます。

find()やfind_all()メソッドを使用して、タグ名を引数に渡すだけです。

# h1タグを取得
h1_element = soup.find('h1')
print(h1_element)

<h1>タイトル</h1>

属性で要素を取得する方法

要素の属性を指定して取得することも可能です。

find()やfind_all()メソッドの引数に属性を辞書形式で渡します。

html_doc = '<div class="content"><p>段落1</p><p class="highlight">段落2</p></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'highlight'のpタグを取得
highlight_element = soup.find('p', class_='highlight')
print(highlight_element)

<p class="highlight">段落2</p>

テキストで要素を取得する方法

要素のテキスト内容を基に要素を取得することもできます。

text引数を使用して、特定のテキストを持つ要素を検索します。

html_doc = '<p>段落1</p><p>特定のテキスト</p>'
soup = BeautifulSoup(html_doc, 'html.parser')
# テキストが'特定のテキスト'のpタグを取得
specific_text_element = soup.find('p', text='特定のテキスト')
print(specific_text_element)

<p>特定のテキスト</p>

属性を使った要素の絞り込み

BeautifulSoupでは、要素の属性を利用して特定の要素を絞り込むことができます。

ここでは、class_属性やid属性を使った要素の取得方法、複数の属性を組み合わせた取得方法、そして正規表現を使った絞り込みについて解説します。

class_属性で要素を取得する

class_属性を使用すると、特定のクラス名を持つ要素を取得できます。

find()やfind_all()メソッドの引数にclass_を指定します。

from bs4 import BeautifulSoup
html_doc = '''
<div>
    <p class="content">段落1</p>
    <p class="highlight">段落2</p>
    <p class="content">段落3</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'content'のpタグを取得
content_elements = soup.find_all('p', class_='content')
for element in content_elements:
    print(element)

<p class="content">段落1</p>
<p class="content">段落3</p>

id属性で要素を取得する

id属性を使用して、特定のIDを持つ要素を取得することもできます。

find()メソッドを使って、IDを指定します。

html_doc = '''
<div>
    <p id="first">最初の段落</p>
    <p id="second">二番目の段落</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# id属性が'second'のpタグを取得
second_element = soup.find('p', id='second')
print(second_element)

<p id="second">二番目の段落</p>

複数の属性を使って要素を取得する

複数の属性を組み合わせて要素を取得することも可能です。

属性を辞書形式で指定することで、より具体的な要素を絞り込むことができます。

html_doc = '''
<div>
    <p class="content" id="first">段落1</p>
    <p class="highlight" id="second">段落2</p>
    <p class="content" id="third">段落3</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'content'かつid属性が'third'のpタグを取得
specific_element = soup.find('p', class_='content', id='third')
print(specific_element)

<p class="content" id="third">段落3</p>

正規表現を使った属性の絞り込み

BeautifulSoupでは、正規表現を使って属性の絞り込みを行うこともできます。

reモジュールをインポートし、attrs引数に正規表現を指定します。

import re
from bs4 import BeautifulSoup
html_doc = '''
<div>
    <p class="content">段落1</p>
    <p class="highlight">段落2</p>
    <p class="content-highlight">段落3</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'content'を含むpタグを取得
pattern_elements = soup.find_all('p', class_=re.compile('content'))
for element in pattern_elements:
    print(element)

<p class="content">段落1</p>
<p class="content-highlight">段落3</p>

このように、正規表現を使うことで、より柔軟に要素を絞り込むことができます。

BeautifulSoupでの階層構造の操作

BeautifulSoupでは、HTML文書の階層構造を利用して要素を操作することができます。

ここでは、親要素、子要素、兄弟要素、祖先要素、隣接する要素を取得する方法について解説します。

親要素を取得する方法

要素の親要素を取得するには、parent属性を使用します。

これにより、指定した要素の直上の親要素を取得できます。

from bs4 import BeautifulSoup
html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# 最初のpタグの親要素を取得
p_element = soup.find('p')
parent_element = p_element.parent
print(parent_element)

<div>
    <p>段落1</p>
    <p>段落2</p>
</div>

子要素を取得する方法

要素の子要素を取得するには、children属性を使用します。

これにより、指定した要素のすべての子要素を取得できます。

html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# divタグの子要素を取得
div_element = soup.find('div')
children_elements = div_element.children
for child in children_elements:
    print(child)

<p>段落1</p>
<p>段落2</p>

兄弟要素を取得する方法

要素の兄弟要素を取得するには、find_next_sibling()やfind_previous_sibling()メソッドを使用します。

これにより、指定した要素の次または前の兄弟要素を取得できます。

html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# 最初のpタグの次の兄弟要素を取得
first_p = soup.find('p')
next_sibling = first_p.find_next_sibling('p')
print(next_sibling)

<p>段落2</p>

祖先要素を取得する方法

要素の祖先要素を取得するには、find_parent()メソッドを使用します。

これにより、指定した要素の直上の親要素だけでなく、さらに上の階層の要素も取得できます。

html_doc = '''
<div>
    <section>
        <p>段落1</p>
    </section>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# pタグの祖先要素を取得
p_element = soup.find('p')
ancestor_element = p_element.find_parent('div')
print(ancestor_element)

<div>
    <section>
        <p>段落1</p>
    </section>
</div>

隣接する要素を取得する方法

隣接する要素を取得するには、find_next()やfind_previous()メソッドを使用します。

これにより、指定した要素の次または前の要素を取得できます。

html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# 最初のpタグの次の要素を取得
first_p = soup.find('p')
next_element = first_p.find_next()
print(next_element)

<p>段落2</p>

このように、BeautifulSoupを使うことで、HTML文書の階層構造を簡単に操作し、必要な要素を取得することができます。

CSSセレクタを使った要素の取得

BeautifulSoupでは、CSSセレクタを使用して要素を取得することができます。

これにより、より直感的に要素を選択できるため、特に複雑なHTML構造を持つ文書からのデータ抽出が容易になります。

ここでは、select()メソッドとselect_one()メソッドの使い方、CSSセレクタの基本的な書き方、そして複雑なCSSセレクタを使った要素の取得方法について解説します。

select()メソッドの使い方

select()メソッドは、指定したCSSセレクタに一致するすべての要素をリスト形式で取得します。

セレクタを引数に渡すだけで、簡単に要素を取得できます。

from bs4 import BeautifulSoup
html_doc = '''
<div>
    <p class="content">段落1</p>
    <p class="highlight">段落2</p>
    <p class="content">段落3</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'content'のpタグを取得
content_elements = soup.select('p.content')
for element in content_elements:
    print(element)

<p class="content">段落1</p>
<p class="content">段落3</p>

select_one()メソッドの使い方

select_one()メソッドは、指定したCSSセレクタに一致する最初の要素を取得します。

select()メソッドと同様に、セレクタを引数に渡しますが、結果は単一の要素になります。

html_doc = '''
<div>
    <p class="content">段落1</p>
    <p class="highlight">段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'highlight'のpタグを取得
highlight_element = soup.select_one('p.highlight')
print(highlight_element)

<p class="highlight">段落2</p>

CSSセレクタの基本的な書き方

CSSセレクタは、HTML要素を選択するためのパターンです。

基本的な書き方は以下の通りです：

セレクタの種類	書き方例	説明
タグ名	`div`	divタグを選択
クラス名	`.content`	class属性が’content’の要素
ID名	`#header`	id属性が’header’の要素
子要素	`div > p`	divの直下のpタグを選択
隣接要素	`h1 + p`	h1の直後のpタグを選択

複雑なCSSセレクタを使った要素の取得

複雑なCSSセレクタを使用することで、より特定の要素を選択することができます。

例えば、特定のクラスを持つ要素の中から、さらに特定のタグを選択することができます。

html_doc = '''
<div>
    <p class="content">段落1</p>
    <div class="highlight">
        <p>段落2</p>
        <p class="content">段落3</p>
    </div>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# class属性が'highlight'のdiv内のpタグを取得
highlight_paragraphs = soup.select('div.highlight > p')
for paragraph in highlight_paragraphs:
    print(paragraph)

<p>段落2</p>
<p class="content">段落3</p>

このように、CSSセレクタを使うことで、HTML文書から必要な要素を効率的に取得することができます。

特に複雑な構造を持つ文書においては、非常に強力なツールとなります。

BeautifulSoupでのテキストの取得

BeautifulSoupを使用すると、HTML文書内の要素からテキストを簡単に取得できます。

ここでは、要素内のテキストを取得する方法、get_text()メソッドの使い方、テキストの整形と改行の扱い、特定のテキストを含む要素の取得について解説します。

要素内のテキストを取得する方法

特定の要素からテキストを取得するには、要素のtext属性を使用します。

この属性を使うことで、要素内のテキストを直接取得できます。

from bs4 import BeautifulSoup
html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# 最初のpタグのテキストを取得
first_paragraph = soup.find('p').text
print(first_paragraph)

段落1

get_text()メソッドの使い方

get_text()メソッドを使用すると、要素内のすべてのテキストを取得できます。

このメソッドは、要素内のすべての子要素のテキストを結合して返します。

html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# div内のすべてのテキストを取得
all_text = soup.find('div').get_text()
print(all_text)

段落1
段落2

テキストの整形と改行の扱い

get_text()メソッドには、separator引数を使用して、テキストの間に挿入する文字列を指定することができます。

また、strip引数をTrueに設定すると、前後の空白を削除できます。

html_doc = '''
<div>
    <p>段落1</p>
    <p>段落2</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# テキストを整形して取得
formatted_text = soup.find('div').get_text(separator=' | ', strip=True)
print(formatted_text)

段落1 | 段落2

特定のテキストを含む要素の取得

特定のテキストを含む要素を取得するには、find()やfind_all()メソッドのtext引数を使用します。

これにより、指定したテキストを持つ要素を簡単に検索できます。

html_doc = '''
<div>
    <p>段落1</p>
    <p>特定のテキストを含む段落</p>
    <p>段落3</p>
</div>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# 特定のテキストを含むpタグを取得
specific_text_element = soup.find('p', text='特定のテキストを含む段落')
print(specific_text_element)

<p>特定のテキストを含む段落</p>

このように、BeautifulSoupを使うことで、HTML文書からテキストを効率的に取得し、必要な情報を抽出することができます。

応用例：BeautifulSoupでのスクレイピング

BeautifulSoupは、ウェブスクレイピングにおいて非常に強力なツールです。

ここでは、複数ページにわたるデータの取得、動的に生成されるコンテンツの扱い、BeautifulSoupとSeleniumの併用、そしてBeautifulSoupとRequestsの併用について解説します。

複数ページにわたるデータの取得

ウェブサイトによっては、データが複数のページに分かれていることがあります。

BeautifulSoupを使って、ページをループ処理しながらデータを取得することができます。

import requests
from bs4 import BeautifulSoup
base_url = 'https://example.com/page='
data = []
# 1ページ目から5ページ目までデータを取得
for page in range(1, 6):
    response = requests.get(base_url + str(page))
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 各ページから特定の要素を取得
    items = soup.find_all('div', class_='item')
    for item in items:
        data.append(item.get_text())
print(data)

このコードでは、指定したURLのページをループし、各ページから特定の要素を取得しています。

動的に生成されるコンテンツの扱い

JavaScriptを使用して動的に生成されるコンテンツは、BeautifulSoupだけでは取得できません。

この場合、Seleniumなどのツールを使用して、ブラウザを自動操作し、ページの完全なHTMLを取得する必要があります。

from selenium import webdriver
from bs4 import BeautifulSoup
# SeleniumのWebDriverを設定
driver = webdriver.Chrome()
driver.get('https://example.com')
# ページが完全に読み込まれるまで待機
driver.implicitly_wait(10)
# ページのHTMLを取得
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
# 必要なデータを取得
data = soup.find_all('div', class_='dynamic-content')
for item in data:
    print(item.get_text())
driver.quit()

このコードでは、Seleniumを使用してページを開き、動的に生成されたコンテンツを取得しています。

BeautifulSoupとSeleniumの併用

SeleniumとBeautifulSoupを併用することで、動的なウェブサイトからデータを効率的に取得できます。

Seleniumでページを操作し、BeautifulSoupでHTMLを解析する流れです。

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('https://example.com')
# ページのHTMLを取得
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
# 特定の要素を取得
items = soup.select('div.item')
for item in items:
    print(item.get_text())
driver.quit()

この方法により、動的なコンテンツを含むページからもデータを抽出できます。

BeautifulSoupとRequestsの併用

Requestsライブラリを使用してHTTPリクエストを行い、BeautifulSoupでHTMLを解析するのが一般的なスクレイピングの手法です。

Requestsを使うことで、簡単にウェブページのデータを取得できます。

import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
response = requests.get(url)
# レスポンスのHTMLを解析
soup = BeautifulSoup(response.text, 'html.parser')
# 特定の要素を取得
items = soup.find_all('div', class_='item')
for item in items:
    print(item.get_text())