python – Read a large amount of XML and load in a single csv

I process a large amount of XML files that I got from here https://clinicaltrials.gov/ct2/resources/download#DownloadAllData. The download generates around 300,000 XML files of similar structure, which I ultimately want to load in a single frame of data / csv. The code gives the result I want: each line is a unique XML while the columns are the categories / names of variables coming from the XML tags. The lines are filled with the text of each XML tag.
My strategy is to first analyze the structure of each XML to get the child at the lowest level for each node and reconstruct the x-path for each of them. Using these x-paths, I get the text for each of these elements. Finally, I list the columns of the same name so that the column names are unique.

I am an absolute beginner in Python and this code is the result of a mix and painful correspondence from various forum entries and tutorials. It goes through, but given the size of the data sources, it takes a very long time. Presumably, because I have many for loops in my code that are certainly avoidable. It would be great if I could get feedback on how to improve the speed and maybe even some general remarks on how to better structure this code. I know it's not good, but that's all I could get out of it for now. 🙂
Cheers!

Find my code here:

#Import packages.
import pandas as pd
from lxml import etree
import numpy as np
import os
from os import listdir
from os.path import isfile, join
import time
from tqdm import tqdm


#Set options for displaying results
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

global df_final
df_final = pd.DataFrame()

global content
content = pd.DataFrame()


def run(file, csv, df):
    global df_final
    data = etree.parse(file)
    root = data.getroot()


    #create empty lists for names and indeces.
    l_first = ()
    l_second = ()
    l_third = ()
    l_fourth = ()

    i_first = ()
    i_second1 = ()
    i_second2 = ()
    i_third1 = ()
    i_third2 = ()
    i_third3 = ()
    i_fourth1 = ()
    i_fourth2 = ()
    i_fourth3 = ()
    i_fourth4 = ()

    #get the structure of each xml and layout in pandas dataframe
    for i in range(len(root.getchildren())):
        temp = root.getchildren()(i)
        first = root.getchildren()(i).tag
        l_first.append(first)
        i_first.append(i)

        for j in range(len(temp.getchildren())):
            temp2 = temp.getchildren()(j)
            second = temp.getchildren()(j).tag
            l_second.append(second)
            i_second1.append(i)
            i_second2.append(j)

            for x in range(len(temp2.getchildren())):
                temp3 = temp2.getchildren()(x)
                third = temp2.getchildren()(x).tag
                l_third.append(third)
                i_third1.append(i)
                i_third2.append(j)
                i_third3.append(x)

                for y in range(len(temp3.getchildren())):
                    temp4 = temp3.getchildren()(y)
                    fourth = temp3.getchildren()(y).tag
                    l_fourth.append(fourth)
                    i_fourth1.append(i)
                    i_fourth2.append(j)
                    i_fourth3.append(x)
                    i_fourth4.append(y)

    df_first = pd.DataFrame(l_first, columns=('name_1'))
    df_second = pd.DataFrame(l_second, columns = ('name_2'))
    df_third = pd.DataFrame(l_third, columns = ('name_3'))
    df_fourth = pd.DataFrame(l_fourth, columns = ('name_4'))

    df_first('index_1') = i_first

    df_second('index_21') = i_second1
    df_second('index_22') = i_second2

    df_third('index_31') = i_third1
    df_third('index_32') = i_third2
    df_third('index_33') = i_third3

    df_fourth('index_41') = i_fourth1
    df_fourth('index_42') = i_fourth2
    df_fourth('index_43') = i_fourth3
    df_fourth('index_44') = i_fourth4

    #merge all three layers into one dataframe.
    df = df_first.merge(df_second,how='left', left_on='index_1', right_on='index_21')
    df = df.merge(df_third,how='left', left_on=('index_21','index_22'), right_on=('index_31','index_32'))
    df = df.merge(df_fourth,how='left', left_on=('index_31','index_32','index_33'), right_on=('index_41','index_42','index_43'))

    #create number of children per row.
    children = 0
    df('children') = np.where((df('index_21').notna()) & (df('index_31').isna()), 1, 0)
    df('children') = np.where((df('index_21').notna()) & (df('index_31').notna()), 2, df('children'))
    df('children') = np.where((df('index_21').notna()) & (df('index_31').notna()) & (df('index_41').notna()), 3, df('children'))

    #create x-path for each row depending on number of children.
    df('x_path') = "//" + df('name_1').astype(str)
    df('x_path') = np.where(df('children') == 1, df('x_path').astype(str) + "https://codereview.stackexchange.com/" + df('name_2').astype(str), df('x_path'))
    df('x_path') = np.where(df('children') == 2, df('x_path').astype(str) + "https://codereview.stackexchange.com/" + df('name_2').astype(str) + "https://codereview.stackexchange.com/" + df('name_3').astype(str), df('x_path'))
    df('x_path') = np.where(df('children') == 3, df('x_path').astype(str) + "https://codereview.stackexchange.com/" + df('name_2').astype(str) + "https://codereview.stackexchange.com/" + df('name_3').astype(str) + "https://codereview.stackexchange.com/" + df('name_4').astype(str), df('x_path'))

    #drop comments from dataframe
    df = df(~df("x_path").str.contains('Comment', na = True))

    #reset index of dataframe after comments have been dropped.
    df = df.reset_index()
    #df('id') = df.index.astype(str) + df('x_path')

    content = pd.DataFrame(columns = ('x_path', 'content'))
    x_path = df('x_path').to_list()
    x_path = list(dict.fromkeys(x_path))

    #iterate through all x-paths and get the text assigned to each path.
    for row in x_path:
        e = root.xpath(row)
        for i in e:
            #print(row, ": ",i.text)
            content = content.append({'x_path': row, 'content': i.text}, ignore_index=True)


    content = content.sort_values(by=('x_path'))
    df = df.sort_values(by=('x_path'))
    #print(content)
    df = df.merge(content,on = 'x_path')
    #print(df)

    #mark duplicates and rename such that names are unique. (Intention: names to be used as column names in later dataset).

    df('duplicate') = df.duplicated('x_path', keep = False)
    df_unique = df.loc(df('duplicate') == True)
    df_unique = df_unique.drop_duplicates(subset = "x_path", keep = "first")
    unique = ()
    unique = df_unique('x_path').to_list()
    #print(unique)

    df = df(('x_path','content'))
    df = df.drop_duplicates(subset=("x_path","content"), keep = "first")
    df = df.transpose()

    #get row with variable names and safe to list
    df.columns = df.iloc(0)
    df = df.drop(df.index(0))

    cols = pd.Series(df.columns)

    for dup in cols(cols.duplicated()).unique():
        cols(cols(cols == dup).index.values.tolist()) = (dup + '.' + str(i) if i != 0 else dup for i in
                                                         range(sum(cols == dup)))
    # rename the columns with the cols list.
    df.columns = cols
    df_final = df_final.append(df)


def write_csv(df_name, csv):
    df_name.to_csv(csv, sep=";")

################### Run  #####################

mypath = '/Users/Documents/AllPublicXML'

folder_all = os.listdir(mypath)

file_all = ()

for folder in tqdm(folder_all):
    mypath2 = mypath + "https://codereview.stackexchange.com/" + folder
    if os.path.isdir(mypath2):
        file = (f for f in listdir(mypath2) if isfile(join(mypath2, f)))
        for x in tqdm(file):
            dir = mypath2 + "https://codereview.stackexchange.com/" + x
            output = "./Output/"+x+".csv"
            df_name = x.split(".", 1)(0)
            #print(df_name)
            run(dir, output, df_name)
            #print(output)

write_csv(df_final, output)

and an example of XML file here:





ClinicalTrials.gov processed this data on March 20, 2020

Link to the current ClinicalTrials.gov record.
https://clinicaltrials.gov/show/NCT03261284


2017-P-032
NCT03261284


D-dimer to Guide Anticoagulation Therapy in Patients With Atrial Fibrillation

DATA-AF

D-dimer to Determine Intensity of Anticoagulation to Reduce Clinical Outcomes in Patients With Atrial Fibrillation



Wuhan Asia Heart Hospital
Other


Wuhan Asia Heart Hospital

Yes
No
No



This was a prospective, three arms, randomized controlled study.




D-dimer testing is performed in AF Patients receiving warfarin therapy (target INR:1.5-2.5) in Wuhan Asia Heart Hospital. Patients with elevated d-dimer levels (>0.5ug/ml FEU) were SCREENED AND RANDOMIZED to three groups at a ratio of 1:1:1. First, NOAC group,the anticoagulant was switched to Dabigatran (110mg,bid) when elevated d-dimer level was detected during warfarin therapy.Second,Higher-INR group, INR was adjusted to higher level (INR:2.0-3.0) when elevated d-dimer level was detected during warfarin therapy. Third, control group, patients with elevated d-dimer levels have no change in warfarin therapy. Warfarin is monitored once a month by INR ,and dabigatran dose not need monitor. All patients were followed up for 24 months until the occurrence of endpoints, including bleeding events, thrombotic events and all-cause deaths.


Enrolling by invitation
March 1, 2019
May 30, 2020
February 28, 2020
N/A
Interventional
No

Randomized
Parallel Assignment
Treatment
None (Open Label)


Thrombotic events
24 months

Stroke, DVT, PE, Peripheral arterial embolism, ACS etc.



hemorrhagic events
24 months
cerebral hemorrhage,Gastrointestinal bleeding etc.


all-cause deaths
24 months

3
600
Atrial Fibrillation
Thrombosis
Hemorrhage
Anticoagulant Adverse Reaction

DOAC group
Experimental

Patients with elevated d-dimer levels was switched to DOAC (dabigatran 150mg, bid).



Higher-INR group
Experimental

Patients' target INR was adjusted from 1.5-2.5 to 2.0-3.0 by adding warfarin dose.



Control group
No Intervention

Patients continue previous strategy without change.



Drug
Dabigatran Etexilate 150 MG (Pradaxa)
Dabigatran Etexilate 150mg,bid
DOAC group
Pradaxa


Drug
Warfarin Pill
Add warfarin dose according to INR values.
Higher-INR group




Inclusion Criteria: - Patients with non-valvular atrial fibrillation - Receiving warfarin therapy Exclusion Criteria: - Patients who had suffered from recent (within 3 months) myocardial infarction, ischemic stroke, deep vein thrombosis, cerebral hemorrhages, or other serious diseases. - Those who had difficulty in compliance or were unavailable for follow-up.


All
18 Years
75 Years
No


Zhenlu ZHANG, MD,PhD
Study Director
Wuhan Asia Heart Hospital



Zhang litao
Wuhan Hubei 430022 China
China March 2019 August 22, 2017 August 23, 2017 August 24, 2017 March 6, 2019 March 6, 2019 March 7, 2019 Sponsor D-dimer Nonvalvular atrial fibrillation Direct thrombin inhibitor INR Atrial Fibrillation Thrombosis Hemorrhage Warfarin Dabigatran Fibrin fragment D

architecture – Sending an xml message as a payload to a web API

I was responsible for writing a fire and forget push web application, which can push high volume XML messages (of various types) to multiple client endpoints on the Internet (HTTPS). I don't need a response, or even to know if they got the message or not – I don't want it to fail on my side if the message doesn't arrive.

In other words, given a URL (e.g. https://192.168.3.45/MessageTypeA/v1, https://192.168.3.45/MessageTypeB/v3, etc.), my application must pass a copy of all XML messages for a given message. type that url, and if a client listens to that url, they can do whatever they want with these messages.

I can define how client URLs are defined, security, etc. – there is nothing already existing and I am therefore not limited by an existing approach.

I am relatively new to web APIs. I studied REST, SOAP, WebSub …; and try to find what is the best approach for this.

REST-based APIs, it seems to me, act on objects at the receiving end – "GET" train list, or "PUT" driver update, or "POST" new train , Or other; what is not relevant to me here – i guess all i would like in this approach is "POST" a new message of type x, y or z? The point is, the xml message, when interpreted, may well be a POST or PUT, but I don't want to preprocess the messages to decide that – all I do is provide the raw data endpoint.

In WebSub language, I think I am "the publisher" and I publish on several "Hubs"? But the difference is that there is no subscriber in my scenario – I maintain the list of targets by message type, rather than subscribing them.

So I'm not sure which protocol / approach is best for this type of scenario, so I'm looking for advice. Whichever protocol I use, it must allow message encryption and authentication by the recipient client, to ensure that it is I who send the messages to them.

Native XML viewers in Firefox and Chrome cannot parse XML + XHTML

I'm generally quite happy with the native XML viewer in Firefox.

It displays valid XML files (like the one below) in a clear and useful way.

Example of XML:




  
    https://example.com/
    2020-03-17T15:57:23+00:00
  


However, I noticed that as soon as I add XHTML to XML (using the correct XHTML namespace – see below), the XML viewers in Firefox and Chrome, switch back to viewing plain text XML:

Namespace and XHTML element:

  • XHTML namespace: xmlns:xhtml="http://www.w3.org/1999/xhtml"
  • XHTML element:

Example of XML + XHTML:




  
    https://example.com/
    2020-03-17T15:57:23+00:00
    
    
  

  
    https://example.com/de/
    2020-03-12T19:42:12+00:00
    
    
  


For a long time, I thought I introduced an error in my XML and invalidated it. But I have checked many third party XML validators and the XML is definitely valid. It is just that native viewers of the browser (apparently) cannot cope with the inclusion of XHTML in XML.

Is there anything i can do in this situation to help the native XML viewers in Firefox and Chrome understand and analyze markup in XML format or is there nothing to do at the moment and third party software is the only answer?

sharepoint online – Read XML from clientContext ExecuteQuery () request to /_vti_bin/client.svc/ProcessQuery

Battery exchange network

The Stack Exchange network includes 175 question and answer communities, including Stack Overflow, the largest and most reliable online community for developers who want to learn, share knowledge and develop their careers.

Visit Stack Exchange

Get entries from a java program and store them in an xml database [closed]

I should get the data from a java program and store it in an xml database.

Can I use XML or JSON to load an entire list

I am currently developing a text based game and now I did not need this functionality but is it possible to use XML or JSON to save and load a list of characters or a character in particular, so when the game reopens everything would load together

aes – How to decrypt encrypted XML according to the FATCA IDES standard?

My former colleagues have encrypted an XML file according to the FATCA-IDES standard:

  1. Digital signature of the XML payload (using the "wraparound" signature and creating the SHA2-256 hash)
  2. RSA digital signature using the 2048 bit private key which corresponds to our private key
  3. Compressed XML file
  4. Encrypt the XML file with the AES-256 key:
    • Encryption mode: CBC
    • Salt: no salt
    • IV: 16 bytes IV
    • Key size: 256 bits / 32 bytes
    • Encoding: none
    • Padding: PKCS # 5 or PKCS # 7 (I don't know which one was used)
  5. Encrypt the AES and IV key (48 bytes in total – 32 bytes AES key and 16 bytes IV) with the public key (given by IDES – not ours):
    • Padding: PKCS # 5 v1.5
    • Key size: 2048 bits

Therefore, from the starting point where we had a simple XML file (not encrypted), we ended up with a .zip file that contained 3 files:

  • xxxx_Payload
  • xxxx_Key
  • xxxx_Metada.xml

That said, I can't find the original XML file that was not encrypted. I need to have access to this information and since my knowledge in cryptography is close to 0, it is impossible for me to understand how to decrypt the payload generated by my former colleagues in order to have access to the readable XML file "xxxx_Payload".

FYI, I have in my possession the private key (with its password) than that used at the time. I think that should be enough to be able to decrypt the data?

xml – Magento2 – Element & # 39; handle & # 39: character content other than white space is not allowed because the content type is & # 39; element only & # 39;

Maybe help you with that …

This problem is an XML file problem.

Your all custom module that you missed White Space.

This problem is due to the fact that the XML code, which you copy directly to the Internet / web page, there is a hidden character before starting each line. It is not a space / newline character. So when you push to Magento, Magento does not know this character and it displays errors as your message. Solutions: empty all the space between the tag and order again.

And return this link: –

Magento 2: How do I resolve the message "Character content other than white space"?

Updating SharePoint Online items using the SOAP / XML API from outside of SharePoint

We need to update (CRUD) SharePoint Online list items from a standalone application outside of SharePoint from an external company.

This application uses Soap / XML calls to call SP.

The (external) application was able to connect and update a SharePoint 2010 list on site without any problems, but we are now migrating to SharePoint Online.

The external application can log in and view list items but always throws an error when we try to update a list item on SharePoint Online

Here is the body and the response to the soap request (redacted):


   
   
      
         XXXXXX GUID
         
            
   
      1002
      1
      RL101150
      2020-03-03
      andras boros
      
      This is Box 20/1
   
   
      1003
      1
      RL101151
      2020-03-03
      andras boros
      
      This is Box 20/2
   
   
      1004
      1
      RL101152
      2020-03-03
      andras boros
      
      This is Box20/3
   

         
      
   

4:02

   
      
         
            
               
                  0x81020026
                  The list that is referenced here no longer exists.
                  
               
               {TL:DR}
             
            
        
    

Is it possible to update SharePoint Online list items from a standalone webapp in a completely different doamin?

And how could we handle authentication.

Any help on this matter would be greatly appreciated.

XML XElement c # – Spanish stack overflow

I want to run this code multiple times:

      new XElement("detalle",
            new XElement("codigoPrincipal", "*****"),
                             new XElement("codigoAuxiliar", "*****"),
                             new XElement("descripcion", "*****"),
                             new XElement("cantidad", "*****"),
                             new XElement("precioUnitario", "*****"),
                             new XElement("descuento", "*****"),
                             new XElement("precioTotalSinImpuesto", "*****"),



        new XElement("impuestos",
            new XElement("impuesto",
                             new XElement("codigo", "*****"),
                             new XElement("codigoPorcentaje", "*****"),
                             new XElement("tarifa", "*****"),
                             new XElement("baseImponible", "*****"),
                             new XElement("valor", "*****")

                             )
                             )
                             )

I want to repeat it several times in a for loop (dynamically), but it gives me a syntax error:

enter description of image here