Accueil / Blog / Métier / 2014 / Talend Tutorial - Advanced XML transformation – Part 1 - XML to table rows

Talend Tutorial - Advanced XML transformation – Part 1 - XML to table rows

Par Gaël Pegliasco publié 04/02/2014, édité le 11/11/2015
Contributeurs : Gaël Pegliasco
One difficult point when using Talend Data Integration is working with XML data. Why? Because Talend DI has a "tabular" approach to data based on records, similar to SQL. But XML represents data as a tree, which is ideal to store hierarchical data, lists of lists and such. Moving from a two-dimensional tabular data representation to a tree representation is not so easy. But fortunately, Talend DI provides great tools for this. One of these is the tXMLMap that we propose to discover in this tutorial.

In this article, we introduce the tXMLMap component by transforming a list of customers having several addresses into a tabular representation using a table for the customers records, and another table for the list of adresses owned by each customer.

Input data will be an XML file having multiple <Customer> elements, each of them having multiple <CustomerAddress> elements:

    <Customer id="1"> 
        <CustomerName>Griffith Paving and Sealcoatin</CustomerName>
        <CustomerAdresses>
            <CustomerAddress>talend apres 91</CustomerAddress>
            <CustomerAddress>511 Maple Ave. Apt. 1B</CustomerAddress>
            <CustomerAddress>1799 Rosemary Way</CustomerAddress>
            <CustomerAddress>1859 Green Bay Rd.1</CustomerAddress>
        </CustomerAdresses>
        <LabelState>Connecticut</LabelState>
        <RegTime>03-11-2006</RegTime>
        <Fresh>67852.0</Fresh>
        <Frozen>61521.4852</Frozen>
    </Customer>

The expected output will be two lists of records, one for the customers and another for customer's addresses:

.--+------------------------------+-----------+----------+-------+----------.
|                              Show customers                               |
|=-+------------------------------+-----------+----------+-------+---------=|
|id|CustomerName                  |LabelState |RegTime   |Fresh  |Frozen    |
|=-+------------------------------+-----------+----------+-------+---------=|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
'--+------------------------------+-----------+----------+-------+----------'

.--+-------------------------.
|       Show addresses       |
|=-+------------------------=|
|id|CustomerAddress          |
|=-+------------------------=|
|1 |talend apres 91          |
|1 |511 Maple Ave. Apt. 1B   |
|1 |1799 Rosemary Way        |
|1 |1859 Green Bay Rd.1      |
'--+-------------------------'

The global picture of our job, once finished, will be :

Job screenshot


The input file used to read customers in XML format can be downloaded here.
For this example, we will read it as a tFileInputXML.
But we won't try to generate rows from this file. Instead, we will read the whole file content at once.
This XML document will be passed as is to the tXMLMap.
Here is the configuration for tFileInputXML:

tFileInputXML configuration 1

Note that the type of root node is « Document » for a XML document. Do not forget this or you won't be able to carry on with this example.

tFileInputXML configuration 2

Now that we have configured tFileInputXML to read the entire XML document, it is time to configure tXMLMap.

Both  the Talend User Guide and the Talend Reference Guide provide great information on how to configure tXMLMap. We suggest that you read these, even if they may seem a bit difficult to understand at first (they were for me). Take time to test each option one by one, it will greatly help you to understand all the potential of this great component.

Configure tXMLMap so it looks like this:

tXMLMap 1


Setting up the main data flow (xml_full_document):
You can create the tree manually by right clicking on Customers and adding sub-elements and attributes by hand. But we highly suggest you to select « Import from file » by right clicking on the top « Customers » element and select the XML file we provided. The tXMLMap component will parse it and automatically guess its structure. This is the right way to do it.

Once this is done, you need to set the (loop) attribute on the CustomerAddress element by right clicking on it.

So this was the first difficulty of this exercise. To split customers on their multiple addresses, we are going to iterate on address elements instead of customer elements.

Now, create the two output relations as described, close the box and connect each of these to a tLogRow in tabular mode.

Run the job and see the result below:

.--+------------------------------+-----------+----------+-------+----------.
|                              Show customers                               |
|=-+------------------------------+-----------+----------+-------+---------=|
|id|CustomerName                  |LabelState |RegTime   |Fresh  |Frozen    |
|=-+------------------------------+-----------+----------+-------+---------=|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
|2 |Bill's Dive Shop              |zona       |19-11-2004|88792.0|15434.1   |
|2 |Bill's Dive Shop              |zona       |19-11-2004|88792.0|15434.1   |
|2 |Bill's Dive Shop              |zona       |19-11-2004|88792.0|15434.1   |
|2 |Bill's Dive Shop              |zona       |19-11-2004|88792.0|15434.1   |
'--+------------------------------+-----------+----------+-------+----------'

.--+-------------------------.
|       Show addresses       |
|=-+------------------------=|
|id|CustomerAddress          |
|=-+------------------------=|
|1 |talend apres 91          |
|1 |511 Maple Ave. Apt. 1B   |
|1 |1799 Rosemary Way        |
|1 |1859 Green Bay Rd.1      |
|2 |310 Walker Ave.          |
|2 |844 Spruce St.           |
|2 |965 Marion Place Apt. 65C|
|2 |511 Hill                 |
'--+-------------------------'


Yes, this clearly does the trick.
We have successfully split each customer address into a row containing the customer id and the address belonging to him. But we've got 8 customers records in place of the 2 that we need.

This is because we loop on addresses, so a customer record is created each time we find a different address.

We could add a tUniqRow component to remove duplicated customer records, but fortunatly tXMLMap allows to loop on different elements for each output flow.

So, re-open the tXMLMap configuration and add a loop on the CustomerName element.
You will get this:

tXMLMap 2

You cannot set the second loop on the Customer element, because loop cannot be set both on an element and on its child or parent. So we choose to iterate on customer's names to get customer records.

Now, note the green dots and red cross in the title of each output relation.
This means that Talend does not know on which loop to iterate to create the output relation.

Click on each of them and add the corresponding sequence: CustomerName for customers and CustomerAddresses for adresses.

Loop customer

Loop addresses

Note: You should need to add two sequences (clicking on green +) and remove the first one to select only the second loop.

Now, validate and run your job one last time:

.--+------------------------------+-----------+----------+-------+----------.
|                              Show customers                               |
|=-+------------------------------+-----------+----------+-------+---------=|
|id|CustomerName                  |LabelState |RegTime   |Fresh  |Frozen    |
|=-+------------------------------+-----------+----------+-------+---------=|
|1 |Griffith Paving and Sealcoatin|Connecticut|03-11-2006|67852.0|61521.4852|
|2 |Bill's Dive Shop              |zona       |19-11-2004|88792.0|15434.1   |
'--+------------------------------+-----------+----------+-------+----------'

.--+-------------------------.
|       Show addresses       |
|=-+------------------------=|
|id|CustomerAddress          |
|=-+------------------------=|
|1 |talend apres 91          |
|1 |511 Maple Ave. Apt. 1B   |
|1 |1799 Rosemary Way        |
|1 |1859 Green Bay Rd.1      |
|2 |310 Walker Ave.          |
|2 |844 Spruce St.           |
|2 |965 Marion Place Apt. 65C|
|2 |511 Hill                 |
'--+-------------------------'

This time, we've got a good tabular representation with just the right number of record for each output flow!

Conclusion

We have seen that tXMLMap can manage full XML documents and can use different iterations on these documents to generate output flows.

But they can do much more as generating XML as output flow, aggregating and grouping data and others.

In our next article we will do the work in the other way: from tabular rows we will generate XML output.

 

Ce tutoriel vous a plu ? Consultez notre formation d'initiation à Talend Data Integration.

ABONNEZ-VOUS À LA NEWSLETTER !
Voir aussi
Makina Corpus devient partenaire Gold de Talend Makina Corpus devient partenaire Gold de Talend 12/05/2009

Talend Tutoriel : comprendre les connexions iterate 20/12/2013

Le lien iterate est décrit assez succinctement dans le manuel utilisateur de Talend (User Guide). ...

Préparez votre agenda de formation pour la rentrée ! Préparez votre agenda de formation pour la rentrée ! 21/12/2015

Drupal8, PostgreSQL, Python scientifique... Découvrez nos nouvelles formations et les dernières ...

Geocoder avec Talend Open Studio 22/05/2014

Mettre en place un job de geocodage d'adresses (depuis un fichier XLS) dans l'ETL Talend Open ...

Makina Corpus participe à la conférence mondiale sur OpenStreetMap Makina Corpus participe à la conférence mondiale sur OpenStreetMap 09/09/2019

Du 21 au 23 septembre se tient à Heidelberg, en Allemagne, la douzième conférence annuelle sur ...