PythonでXMLをパースする

環境

Python 3.11.2
macOS Ventura 13.2.1

注意: この記録ではPython標準ライブラリxmlやXPathを使ったスクレイピングをご紹介していますが、悪意を持って加工されたXMLデータに対して脆弱であることが確認されています。信頼の置けないXMLデータを扱う場合や、プロダクトユースのコードを書く場合など、ある程度のセキュリティ水準を満たす必要がある場合は標準XMLライブラリの代替である(脆弱性対応された同等のメソッド類が用意されているライブラリ)defusedxmlを利用する、バリデーション(検証)を噛ませる、そもそも脆弱な機能の利用を避けるなど、各自工夫ください。

XMLtoDictライブラリを使う方法

XMLを読み込んでPythonのDictionary型へ変換してくれる便利なライブラリ(Githubページ)。pipから簡単にインストールできる。


    pip install xmltodict

サンプルコード

XMLtoDictのparse()


import xmltodict

xml = '''<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
      <rank>1</rank>
      <year>2008</year>
      <gdppc>141100</gdppc>
      <neighbor name="Austria" direction="E"/>
      <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore">
      <rank>4</rank>
      <year>2011</year>
      <gdppc>59900</gdppc>
      <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama">
      <rank>68</rank>
      <year>2011</year>
      <gdppc>13600</gdppc>
      <neighbor name="Costa Rica" direction="W"/>
      <neighbor name="Colombia" direction="E"/>
  </country>
</data>'''

root = xmltodict.parse(xml)
first_country = root['data']['country'][0]
print(first_country['@name']) # Liechtenstein
print(first_country['year']) # 2008

例ではXML文字列をparse()メソッドに渡しているが、ファイルパスでも良い。するともうDictionary型(辞書型)が返ってくるので、XML構造をDictionaryとして扱うことが出来る。ちなみに属性値はインデックスの頭に"@"をつけることで指定できる。

DictionaryをXMLへ

XMLtoDictの良い点はDictionary型とXMLが簡単に相互変換できること。

Dictionary型からXMLへ戻す


print(xmltodict.unparse(root, pretty=True)

これでXMLとして出力(print)される。引数としてpretty=Trueも一緒に渡すことできれいにフォーマットされたXMLを受け取ることが出来る。デフォルトはFalse。

標準ライブラリを使う方法

Python標準ライブラリのAPI“xml.etree.ElementTree”(公式ドキュメント)。簡易的なXPath(HTMLで言うCSSセレクタ)も使えるので簡単なXML解析/スクレイピングにちょうどよい。

ElementTreeとElement型

このAPIにはElementTree型とElement型がある。ElementTreeは“XML構造全体”を持った型で、これには階層構造となったXML要素も含んでいる。このXMLの“要素”情報自体をこのAPIではElement型と呼ぶ。

よく使うパースメソッドたち

ファイルを読み込むparse()メソッド → ElementTree型を返す
文字列を読み込めるstringfrom()メソッド → Element型を返す

ファイルパスを文字列として渡すことで、そのXMLファイルをElementTree型として返してくれるメソッド parse()、XML文字列を渡すとElement型として返してくれるメソッドfromstring()の2種類をご紹介。

サンプルコード

parse()メソッド


# sample - from XML files
import xml.etree.ElementTree as ET

tree = ET.parse('data.xml') # return ElementTree
root = tree.getroot() # return Element
print(root.tag) # "data"

fromstring()メソッド


# sample.py - from String variables
import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0"?>
<data>
  <country name="Liechtenstein">
      <rank>1</rank>
      <year>2008</year>
      <gdppc>141100</gdppc>
      <neighbor name="Austria" direction="E"/>
      <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore">
      <rank>4</rank>
      <year>2011</year>
      <gdppc>59900</gdppc>
      <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama">
      <rank>68</rank>
      <year>2011</year>
      <gdppc>13600</gdppc>
      <neighbor name="Costa Rica" direction="W"/>
      <neighbor name="Colombia" direction="E"/>
  </country>
</data>'''

root = ET.fromstring(xml) # return Element
print(root.tag) # "data"

Element型のプロパティやメソッドたち

ファイルから読み込む場合、返ってきたElementTree型にgetroot()することでElement型のルート要素(例では<data>)を取り出せる。

.tagプロパティで要素のタグ名を、.textで要素内の文字列を抽出できる。.attribプロパティにはDictionary型で属性値(例では{'name': 'Liechtenstein'}など)が入っている。

iter()メソッドを使うことで要素を順番に探っていくことができる。for inとの組み合わせはよく使う。

iter()メソッドでイテレート


for year in root.iter('year'):
  print(year.text) # "2008", "2011", "2011"

find()メソッド、findall()メソッドを使うことで条件に一致する子要素を見つけ出すことができる。find()メソッドは条件に一致した“最初”の要素を返し、findall()は一致したすべての要素をList型(リスト型)として返してくれる。条件として渡す文字列は簡易的なXPathとなる。

XPathによる要素の取得


first_country = root.find('country')
print(first_country.get('name')) # "Liechtenstein"
print(first_country.find('year').text) # "2008"

neighbors = root.findall('.//neighbor')
print(neighbors[0].get('direction')) # "E"

first_west = root.find('.//*[@direction="W"]')
print(first_west.get('name')) # "Switzerland"

get()メソッドで属性の値のみを取得できる。なお、使えるXPathの記法は通常のものと若干異なる点に注意。詳しくは公式ドキュメントまで。

Element型を編集する

Element型を編集することも、もちろんできる。.textプロパティに文字列を代入することで要素文字列を上書き出来たり、.set()メソッドを使えば要素の属性値を編集することも出来る。

要素の編集/変更


first_year = root.find('.//year')

first_year.text = '2009' # "2008" -> "2009"
first_year.set('modified', 'yes') # adding "modified" attribute

print(first_year.text) # "2009"
print(first_year.attrib) # {'modified': 'yes'}

もちろん、ElementTreeやElementをXMLへ戻すことも出来る。文字列として戻す場合はtostring()メソッド、XMLファイルとして書き出す場合はwrite()メソッドが使える。tostring()はElement型のみ、write()メソッドはElementTree型のみ対応している。これはパースするときと同じ。

ElementTree/Element型からXMLへ戻す


xml_str = ET.tostring(root, encoding='utf-8').decode(encoding='utf-8') # return XML as String

ET.ElementTree(root).write('country_data.xml', encoding='utf-8', xml_declaration=True)) # write to file

.tostring()や.write()のデフォルトエンコードは"US-ASCII"なので例のように"UTF-8"や"Unicode"などを指定したい場合は例のようにする。ちなみに、エンコーディングが"Unicode"ならString(文字列型)で返ってくるが、それ以外(ASCIIやUTF-8も含まれる)だとBytes(バイト型)で返ってくるので、例のように.decode()でString型へデコードする必要がある。.write()のxml_declaration引数はXML宣言(XMLファイル最初の行にある<?xml version='1.0'…)を書き出すか否かを選べるが、デフォルトだとFalse(書き出さない)なので注意が必要。

Element要素の有無

せっかくなので最後に、Element型を扱う上での注意点を。それは要素が存在するか否かif文などでチェックするときの条件文の書き方。

存在する要素かチェックする方法


first_year = root.find('.//year')

if not first_year:
  print('`first_year` does not exist OR `first_year` has no children!')

if first_year is None:
  print('`first_year` does not exist.')

if len(first_year):
  print('`first_year` has children.')

公式ドキュメントにも注意書きがあるように、not first_yearでチェックすると、first_yearが存在しないあるいは子要素を持っていない、どちらでもTrueとなってしまう。もし、first_yearが存在しないことだけをチェックしたいならばfirst_year is Noneとしよう。同様に、子要素の有無をチェックする場合はlen()を使うと良い。len(first_year)とするとその要素の持つ子要素の数がInt(整数型)で返ってくる。

最後に、架空の天気予報APIから返ってくるXMLデータをパースする、XMLtoDictとRequestsを組み合わせた簡単なサンプルコードを載せてお別れを。

明日の天気


import requests
import xmltodict

response = requests.get('https://weather.api', params={"location": "Land-of-Ooo"})
data = xmltodict.parse(response.content)

unit = data["data"]["unit"]

print(unit["temperature"]) # "Celsius"
print(unit["speed"]) # "Meter-per-second"

tomorrow = data["data"]["weather"]["tomorrow"]

print(tomorrow["condition"]) # "Sunny"
print(tomorrow["feelslike"]) # "24.5"
print(tomorrow["wind"]["speed"]) # "2.1"
print(tomorrow["wind"]["direction"]) # "North-east"

print(xmltodict.unparse(data, pretty=True))
# <?xml version='1.0' encoding='utf-8'?>
# <data>
#   <unit>
#     <temperature>Celsius</temperature>
#     <speed>Meter-per-second</speed>
#     ...
#   </unit>
#   <weather>
#     <yesterday date="2023-03-27">...</yesterday>
#     <today date="2023-03-28">...</today>
#     <tomorrow date="2023-03-29">
#       <condition>Sunny</condition>
#       <feelslike>24.5</feelslike>
#       <wind>
#         <speed>2.1</speed>
#         <direction>North-east</direction>
#       </wind>
#       ...
#     </tomorrow>
#   </weather>
# </data>

3755162766844353600 https://www.storange.jp/2023/03/parse-xml-in-python.html https://www.storange.jp/2023/03/parse-xml-in-python.html PythonでXMLをパースする 2023-03-28T11:05:00+09:00 https://www.storange.jp/2023/03/parse-xml-in-python.html Hideyuki Tabata Hideyuki Tabata

200 200

72 72