Derzeit pflegen Pandas I / O-Tools keine read_xml()
Methode und das Gegenstück to_xml()
. Es zeigt sich jedoch, read_json
dass baumartige Strukturen für den Datenrahmenimport und read_html
für Markup-Formate implementiert werden können .
Wenn die Pandas Team einen solchen hält read_xml
für eine zukünftige Version Pandas Methode, welche Implementierung würde sie verfolgen: Parsen mit einem in die xml.etree.ElementTree
mit iterfind()
oder iterparse()
Funktionen oder Fremd Modul, lxml
mit dem XPath 1.0 und XSLT 1.0 Methoden?
Im Folgenden sind meine Testläufe für vier Methodentypen für eine einfache, flache, elementzentrierte XML-Eingabe aufgeführt. Alle sind für das allgemeine Parsen für alle Kinder der zweiten Ebene von root eingerichtet, und jede Methode sollte genau den gleichen Pandas-Datenrahmen liefern. Alle bis auf die letzten Aufrufe pd.Dataframe()
auf der Liste der Wörterbücher. Die XSLT - Methode verwandelt XML CSV für gegossen StringIO()
in pd.read_csv()
.
Frage (mehrteilig)
PERFORMANCE: Wie erklären Sie die
iterparse
für größere Dateien häufig empfohlene Verlangsamung, wenn die Datei iterativ analysiert wird? Liegt es teilweise an denif
Logikprüfungen?SPEICHER: Korreliert der CPU-Speicher mit den Zeitabläufen bei E / A-Aufrufen? XSLT und XPath 1.0 lassen sich mit größeren XML-Dokumenten nicht gut skalieren, da die gesamte Datei im Speicher gelesen werden muss, um analysiert zu werden.
STRATEGIE: Ist die Liste der Wörterbücher eine optimale Strategie für den
Dataframe()
Anruf? Sehen Sie diese interessanten Antworten: Generatorversion und eine benutzerdefinierte Version von iterwalk . Beide Upcast-Listen zum Datenrahmen.
Eingangsdaten (Stack - Überlauf des aktuellen Top-Benutzer für Jahr , von denen unsere Pandas Freunde enthalten sind)
<?xml version="1.0" encoding="utf-8"?>
<stackoverflow>
<topusers>
<user>Gordon Linoff</user>
<link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link>
<location>New York, United States</location>
<year_rep>5,985</year_rep>
<total_rep>499,408</total_rep>
<tag1>sql</tag1>
<tag2>sql-server</tag2>
<tag3>mysql</tag3>
</topusers>
<topusers>
<user>Günter Zöchbauer</user>
<link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link>
<location>Linz, Austria</location>
<year_rep>5,835</year_rep>
<total_rep>154,439</total_rep>
<tag1>angular2</tag1>
<tag2>typescript</tag2>
<tag3>javascript</tag3>
</topusers>
<topusers>
<user>jezrael</user>
<link>http://www.stackoverflow.com//users/2901002/jezrael</link>
<location>Bratislava, Slovakia</location>
<year_rep>5,740</year_rep>
<total_rep>83,237</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>VonC</user>
<link>http://www.stackoverflow.com//users/6309/vonc</link>
<location>France</location>
<year_rep>5,577</year_rep>
<total_rep>651,397</total_rep>
<tag1>git</tag1>
<tag2>github</tag2>
<tag3>docker</tag3>
</topusers>
<topusers>
<user>Martijn Pieters</user>
<link>http://www.stackoverflow.com//users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom</location>
<year_rep>5,337</year_rep>
<total_rep>525,176</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>T.J. Crowder</user>
<link>http://www.stackoverflow.com//users/157247/t-j-crowder</link>
<location>United Kingdom</location>
<year_rep>5,258</year_rep>
<total_rep>508,310</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>akrun</user>
<link>http://www.stackoverflow.com//users/3732271/akrun</link>
<location></location>
<year_rep>5,188</year_rep>
<total_rep>229,553</total_rep>
<tag1>r</tag1>
<tag2>dplyr</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>Wiktor Stribi?ew</user>
<link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link>
<location>Warsaw, Poland</location>
<year_rep>4,948</year_rep>
<total_rep>158,134</total_rep>
<tag1>regex</tag1>
<tag2>javascript</tag2>
<tag3>c#</tag3>
</topusers>
<topusers>
<user>Darin Dimitrov</user>
<link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link>
<location>Sofia, Bulgaria</location>
<year_rep>4,936</year_rep>
<total_rep>709,683</total_rep>
<tag1>c#</tag1>
<tag2>asp.net-mvc</tag2>
<tag3>asp.net-mvc-3</tag3>
</topusers>
<topusers>
<user>Eric Duminil</user>
<link>http://www.stackoverflow.com//users/6419007/eric-duminil</link>
<location></location>
<year_rep>4,854</year_rep>
<total_rep>12,557</total_rep>
<tag1>ruby</tag1>
<tag2>ruby-on-rails</tag2>
<tag3>arrays</tag3>
</topusers>
<topusers>
<user>alecxe</user>
<link>http://www.stackoverflow.com//users/771848/alecxe</link>
<location>New York, United States</location>
<year_rep>4,723</year_rep>
<total_rep>233,368</total_rep>
<tag1>python</tag1>
<tag2>selenium</tag2>
<tag3>protractor</tag3>
</topusers>
<topusers>
<user>Jean-François Fabre</user>
<link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link>
<location>Toulouse, France</location>
<year_rep>4,526</year_rep>
<total_rep>30,027</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>piRSquared</user>
<link>http://www.stackoverflow.com//users/2336654/pirsquared</link>
<location>Bellevue, WA, United States</location>
<year_rep>4,482</year_rep>
<total_rep>41,183</total_rep>
<tag1>pandas</tag1>
<tag2>python</tag2>
<tag3>dataframe</tag3>
</topusers>
<topusers>
<user>CommonsWare</user>
<link>http://www.stackoverflow.com//users/115145/commonsware</link>
<location>Who Wants to Know?</location>
<year_rep>4,475</year_rep>
<total_rep>616,135</total_rep>
<tag1>android</tag1>
<tag2>java</tag2>
<tag3>android-intent</tag3>
</topusers>
<topusers>
<user>Quentin</user>
<link>http://www.stackoverflow.com//users/19068/quentin</link>
<location>United Kingdom</location>
<year_rep>4,464</year_rep>
<total_rep>509,365</total_rep>
<tag1>javascript</tag1>
<tag2>html</tag2>
<tag3>css</tag3>
</topusers>
<topusers>
<user>Jon Skeet</user>
<link>http://www.stackoverflow.com//users/22656/jon-skeet</link>
<location>Reading, United Kingdom</location>
<year_rep>4,348</year_rep>
<total_rep>921,690</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>.net</tag3>
</topusers>
<topusers>
<user>Felix Kling</user>
<link>http://www.stackoverflow.com//users/218196/felix-kling</link>
<location>Sunnyvale, CA</location>
<year_rep>4,324</year_rep>
<total_rep>411,535</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>asynchronous</tag3>
</topusers>
<topusers>
<user>matt</user>
<link>http://www.stackoverflow.com//users/341994/matt</link>
<location></location>
<year_rep>4,313</year_rep>
<total_rep>220,515</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>xcode</tag3>
</topusers>
<topusers>
<user>Psidom</user>
<link>http://www.stackoverflow.com//users/4983450/psidom</link>
<location>Atlanta, GA, United States</location>
<year_rep>4,236</year_rep>
<total_rep>36,950</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>r</tag3>
</topusers>
<topusers>
<user>Martin R</user>
<link>http://www.stackoverflow.com//users/1187415/martin-r</link>
<location>Germany</location>
<year_rep>4,195</year_rep>
<total_rep>269,380</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>swift3</tag3>
</topusers>
<topusers>
<user>Barmar</user>
<link>http://www.stackoverflow.com//users/1491895/barmar</link>
<location>Arlington, MA</location>
<year_rep>4,179</year_rep>
<total_rep>289,989</total_rep>
<tag1>javascript</tag1>
<tag2>php</tag2>
<tag3>jquery</tag3>
</topusers>
<topusers>
<user>Alexey Mezenin</user>
<link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link>
<location>??????</location>
<year_rep>4,142</year_rep>
<total_rep>31,602</total_rep>
<tag1>laravel</tag1>
<tag2>php</tag2>
<tag3>laravel-5.3</tag3>
</topusers>
<topusers>
<user>BalusC</user>
<link>http://www.stackoverflow.com//users/157882/balusc</link>
<location>Amsterdam, Netherlands</location>
<year_rep>4,046</year_rep>
<total_rep>703,046</total_rep>
<tag1>java</tag1>
<tag2>jsf</tag2>
<tag3>servlets</tag3>
</topusers>
<topusers>
<user>GurV</user>
<link>http://www.stackoverflow.com//users/6348498/gurv</link>
<location></location>
<year_rep>4,016</year_rep>
<total_rep>7,932</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>sql-server</tag3>
</topusers>
<topusers>
<user>Nina Scholz</user>
<link>http://www.stackoverflow.com//users/1447675/nina-scholz</link>
<location>Berlin, Deutschland</location>
<year_rep>3,950</year_rep>
<total_rep>61,135</total_rep>
<tag1>javascript</tag1>
<tag2>arrays</tag2>
<tag3>object</tag3>
</topusers>
<topusers>
<user>JB Nizet</user>
<link>http://www.stackoverflow.com//users/571407/jb-nizet</link>
<location>Saint-Etienne, France</location>
<year_rep>3,923</year_rep>
<total_rep>418,780</total_rep>
<tag1>java</tag1>
<tag2>hibernate</tag2>
<tag3>java-8</tag3>
</topusers>
<topusers>
<user>Frank van Puffelen</user>
<link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link>
<location>San Francisco, CA</location>
<year_rep>3,920</year_rep>
<total_rep>86,520</total_rep>
<tag1>firebase</tag1>
<tag2>firebase-database</tag2>
<tag3>android</tag3>
</topusers>
<topusers>
<user>dasblinkenlight</user>
<link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link>
<location>United States</location>
<year_rep>3,886</year_rep>
<total_rep>475,813</total_rep>
<tag1>c#</tag1>
<tag2>java</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>Tim Biegeleisen</user>
<link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link>
<location>Singapore</location>
<year_rep>3,814</year_rep>
<total_rep>77,211</total_rep>
<tag1>sql</tag1>
<tag2>mysql</tag2>
<tag3>java</tag3>
</topusers>
<topusers>
<user>Greg Hewgill</user>
<link>http://www.stackoverflow.com//users/893/greg-hewgill</link>
<location>Christchurch, New Zealand</location>
<year_rep>3,796</year_rep>
<total_rep>529,137</total_rep>
<tag1>git</tag1>
<tag2>python</tag2>
<tag3>git-pull</tag3>
</topusers>
<topusers>
<user>unutbu</user>
<link>http://www.stackoverflow.com//users/190597/unutbu</link>
<location></location>
<year_rep>3,735</year_rep>
<total_rep>401,595</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
<topusers>
<user>Hans Passant</user>
<link>http://www.stackoverflow.com//users/17034/hans-passant</link>
<location>Madison, WI</location>
<year_rep>3,688</year_rep>
<total_rep>672,118</total_rep>
<tag1>c#</tag1>
<tag2>.net</tag2>
<tag3>winforms</tag3>
</topusers>
<topusers>
<user>Jonathan Leffler</user>
<link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link>
<location>California, USA</location>
<year_rep>3,649</year_rep>
<total_rep>455,157</total_rep>
<tag1>c</tag1>
<tag2>bash</tag2>
<tag3>unix</tag3>
</topusers>
<topusers>
<user>paxdiablo</user>
<link>http://www.stackoverflow.com//users/14860/paxdiablo</link>
<location></location>
<year_rep>3,636</year_rep>
<total_rep>507,043</total_rep>
<tag1>c</tag1>
<tag2>c++</tag2>
<tag3>bash</tag3>
</topusers>
<topusers>
<user>Pranav C Balan</user>
<link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link>
<location>Ramanthali, Kannur, Kerala, India</location>
<year_rep>3,604</year_rep>
<total_rep>64,476</total_rep>
<tag1>javascript</tag1>
<tag2>jquery</tag2>
<tag3>html</tag3>
</topusers>
<topusers>
<user>Suragch</user>
<link>http://www.stackoverflow.com//users/3681880/suragch</link>
<location>Hohhot, China</location>
<year_rep>3,580</year_rep>
<total_rep>71,032</total_rep>
<tag1>swift</tag1>
<tag2>ios</tag2>
<tag3>android</tag3>
</topusers>
</stackoverflow>
Python- Methoden
import xml.etree.ElementTree as et
import pandas as pd
from io import StringIO
from lxml import etree as lxet
def read_xml_iterfind():
tree = et.parse('Input.xml')
data = []
inner = {}
for el in tree.iterfind('./*'):
for i in el.iterfind('*'):
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_iterparse():
data = []
inner = {}
i = 1
for (ev, el) in et.iterparse(path):
if i <= 2:
first_tag = el.tag
if el.tag == first_tag and len(inner) != 0:
data.append(inner)
inner = {}
if el.text is not None and len(el.text.strip()) > 0:
inner[el.tag] = el.text
i += 1
df = pd.DataFrame(data)
def read_xml_lxml_xpath():
tree = lxet.parse('Input.xml')
data = []
inner = {}
for el in tree.xpath('/*/*'):
for i in el:
inner[i.tag] = i.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
def read_xml_lxml_xsl():
xml = lxet.parse('Input.xml')
xslstr = '''
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<!-- HEADERS -->
<xsl:template match = "/*">
<xsl:for-each select="*[1]/*">
<xsl:value-of select="local-name()" />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>,</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
<xsl:apply-templates/>
</xsl:template>
<!-- DATA ROWS (COMMA-SEPARATED) -->
<xsl:template match="/*/*" priority="2">
<xsl:for-each select="*">
<xsl:if test="position() = 1">
<xsl:text>"</xsl:text>
</xsl:if>
<xsl:value-of select="." />
<xsl:choose>
<xsl:when test="position() != last()">
<xsl:text>","</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"
</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</xsl:template>
</xsl:transform>
'''
xsl = lxet.fromstring(xslstr)
transform = lxet.XSLT(xsl)
newdom = transform(xml)
df = pd.read_csv(StringIO(str(newdom)))
Timings (mit aktuellem XML und XML mit dem 25-fachen der untergeordneten Werte (dh 900 StackOverflow-Benutzerdatensätze)
# SHORTER FILE
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 3.87 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 5.5 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 3.86 msec per loop
python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 5.68 msec per loop
# LARGER FILE
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()'
100 loops, best of 3: 36 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()'
100 loops, best of 3: 78.9 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()'
100 loops, best of 3: 32.7 msec per loop
python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()'
100 loops, best of 3: 51.4 msec per loop