In this example; again, we will investigate something on my table at work. So when dealing with XML the general consensus is that it takes up a lot of memory. Which is large and parcel true, we are loading full text data records into memory. Thus when they are compared to their binary structure counterparts they would obviously be sufficiently larger. This is already a known-known and has been exasperated numerous times over. When it comes to processing XML it should be obvious that you don’t want to put all of your records into one file. So for example, the record below by itself in one file name HOUSE_234.XML is fine.
<house>
<address>123 Vashtie Road</address>
<neighborhood>Lollipop County</neighborhood>
<security>ARMED</security>
</house>
Now in this example we will see how to do it the wrong way. This file is named HOUSE.XML
<house>
<id>123</id>
<address>123 Vashtie Road</address>
<neighborhood>Lollipop County</neighborhood>
<security>ARMED</security>
</house>
<house>
<id>234</id>
<address>234 Saba Lane</address>
<neighborhood>Strawberry Province</neighborhood>
<security>ARMED</security>
</house>
So what’s wrong with this, the XML is fine in both files?! They’ll both be able to be processed in an easy to work with manner and everything will be fine! What are you talking about Christopher?! Well, here are some reasons why the second example HOUSE.XML is bad:
In the first case, there is only one record per file. So if something should go wrong with that specific file or record all of the other records are still processed. We also see that the filename in the first example is more descriptive providing an id assigned to that house. Making it easy to get to should there be a problem (i’ll write something up on schemas in another post, because you should be using them).
In the second example the entire file would have to be loaded unless you write code to parse the file looking for the opening <house> tag and the closing </house> tag and even then you’ll still need to load and seek. That is a lot of wasted effort which could be better spent going to lunch or enjoying a nice painting with a friend. Or making out with your girl on the Promenade while you watch the sun set and drink wine.
Due to number 2, in the case where the file is excessively large it will simply eat away at your memory resources. So instead of loading one record into memory and then releasing that memory for use for other processes. You’ll be loading all of the records; in the above example it’s only 2 houses but imagine lets say all of the houses in Brooklyn and your file starts to get into the hundreds of thousands and if we were talking about people it would already be in the millions!
In making your decision you want something that is going to work for mostly all cases. No matter how many records are thrown at your processor. The best design decision in regards to XML would be to separate all of the records into individual files for processing. You’ll be able to handle quintillions of records and generally have a more defined workflow.
Unfortunately, i’ve been asked to do the opposite even though I have explained this but hopefully you don’t have to make the same mistake. Work smarter, not harder.
How to make proper design decisions – Processing XML
In this example; again, we will investigate something on my table at work. So when dealing with XML the general consensus is that it takes up a lot of memory. Which is large and parcel true, we are loading full text data records into memory. Thus when they are compared to their binary structure counterparts they would obviously be sufficiently larger. This is already a known-known and has been exasperated numerous times over. When it comes to processing XML it should be obvious that you don’t want to put all of your records into one file. So for example, the record below by itself in one file name HOUSE_234.XML is fine.
Now in this example we will see how to do it the wrong way. This file is named HOUSE.XML
So what’s wrong with this, the XML is fine in both files?! They’ll both be able to be processed in an easy to work with manner and everything will be fine! What are you talking about Christopher?! Well, here are some reasons why the second example HOUSE.XML is bad:
In making your decision you want something that is going to work for mostly all cases. No matter how many records are thrown at your processor. The best design decision in regards to XML would be to separate all of the records into individual files for processing. You’ll be able to handle quintillions of records and generally have a more defined workflow.
Unfortunately, i’ve been asked to do the opposite even though I have explained this but hopefully you don’t have to make the same mistake. Work smarter, not harder.
Related Posts:
About Christopher Warner
No description. Please complete your profile.