OpenOffice.org ODF, Python and XML
Let's take fix1.py and make an easy modification. Whenever two hyphens appear, replace them with the em dash. Then, when we're done, write the XML to stdout—that's exactly what the shell script (fixit.sh) expects.
We'll specify the em dash by giving its hex value; to find it, locate the em dash in OpenOffice.org Writer's Insert→Special Character dialog (Figure 8).
When I select the long dash (the em dash), its Unicode value appears in the lower-right corner, where I've put a purple ellipse; that's the value to put into the string in place of the double hyphens. Let's call this script fix2.py:
#!/usr/bin/python -tt
import xml.dom.minidom
import sys
DEBUG = 1
def dprint(what):
if DEBUG == 0: return
sys.stderr.write(what.encode('ascii',
'replace') + '\n')
emDash=u'\u2014'
def fixdata(td, depth):
dprint("depth=%d: childNode: %s" %
(depth, td.data))
# OK, so '--' becomes em dash everywhere
td.data = td.data.replace('--', emDash)
def handle_xml_tree(aNode, depth):
if aNode.hasChildNodes():
for kid in aNode.childNodes:
handle_xml_tree(kid, depth+1)
else:
if 'data' in dir(aNode):
fixdata(aNode, depth)
def doit(argv):
doc = xml.dom.minidom.parse(argv[1])
handle_xml_tree(doc, 0)
sys.stdout.write(doc.toxml('utf-8'))
if __name__ == "__main__":
doit(sys.argv)Notice how easy Python makes it to replace a pattern in a string. Strings in recent Python versions have a built-in method, replace, that causes one substring to be replaced by another:
td.data = td.data.replace('--', emDash)Let's plug fix2.py into fixit.sh to see how well it works:
% ln -sf fix2.py fixit.py % ./fixit.sh ex3.odt ex3-1.odt depth=5: childNode: The ?en? dash ? is too short depth=5: childNode: The ?em? dash ? which is longer ? is what we need. depth=5: childNode: And two hyphens -- ugly -- should be turned into ?em? dashes. depth=5: childNode: This line has "straight double quotes" depth=5: childNode: These 'single quotes' aren't pretty. % oowriter ex3-1.odt %
Success! Now for the rest. Besides the double hyphen, we want to change the en dash into an em dash. That syntax is just like the double hyphen replacement.
Replacing straight quotes with curly ones is more complicated though, because we have to decide between a starting double quote and an ending double-quote character. How to tell? Well, if the quote character is at the start of the string, and there's a nonspace character afterward, it's a left (or start of quote) curly quote. Ditto if there's a blank before it and a nonspace afterward.
That's the easy way to describe it. We could code it like that, or we could simply write a regular expression. I looked at the section titled “re -- Regular expression operations” in Chapter 4 of Python's library documentation and eventually came up with this:
sDpat = re.compile(r'(\A|(?<=\s))"(?=\S)', re.U)
Let me explain this left to right. We are creating sDpat, the pattern for a starting double quote or Starting Double-quote PATtern. We do that by calling the method compile in the re module (for regular expressions). That analyzes the pattern once and creates a regular expression object. We'll use sDpat to match straight double quotes that should be turned into nice curly quotes at the start of a quotation.
Now, about the pattern—the pattern contains a double-quote character (") so we delimit it with single quotes, 'like this'. Also, we'll pass some escapes (such as \A and \s) to re.compile, so let's make this a raw string by putting an r in front of it.
(A little explanation for Perl users: in Python, \ escapes are interpolated except in raw strings, whether single-quoted or double-quoted; the delimiters don't affect interpolation as they do in Perl.)
We can see how raw strings work by using Python's shell:
>>> print 'normal string: \n is a newline' normal string: is a newline >>> print r'raw string: \n is not a newline' raw string: \n is not a newline >>>
So, what's in that raw string? It consists of three parts:
The part before the quote character (\A|(?<=\s)). What we are doing is matching something (the '"' in this case), but only if it occurs at the beginning of the string or if it's preceded by a whitespace character. The \A means “match beginning of the string”, the | means “or” and (?<=\s) means “match if immediately preceded by whitespace (a blank, tab or newline), but don't include that whitespace itself in the match”. The enclosing parentheses denote grouping.
The straight double quote itself: ". That's what we're matching.
The part after the '"': (?=\S). What we're doing is adding another condition—that the quote character be followed by a non-whitespace character.
If all three conditions are met—that is, if a quote is there (condition 2), and it's either at the start of the string or preceded by whitespace (condition 1), and it's followed by some non-whitespace character (condition 3), we want to replace it by an opening double-quote character.
Besides the pattern, you also can pass flags to re.compile. We pass re.U to make certain escapes dependent on the Unicode character database. Because we're parsing a Unicode string, I think we want that.
Let's call this fix3.py:
#!/usr/bin/python -tt
import xml.dom.minidom
import sys
import re # new in fix3.py
DEBUG = 1
def dprint(what):
if DEBUG == 0: return
sys.stderr.write(what.encode('ascii',
'replace') + '\n')
emDash=u'\u2014'
enDash=u'\u2013' # new in fix3.py
sDquote=u'\u201c' # new in fix3.py
# sDpat: pattern for starting dbl quote, as
# "Go! <-- the quote there
# We look for it either at start (\A) or
# after a space (\s), and we want it to be
# followed by a non-space
sDpat = re.compile(r'(\A|(?<=\s))"(?=\S)', re.U) # new in fix3.py
def fixdata(td, depth):
dprint("depth=%d: childNode: %s" %
(depth, td.data))
# OK, so '--' becomes em dash everywhere
td.data = td.data.replace('--', emDash)
# Change 'en' dash to 'em' dash
td.data = td.data.replace(enDash , emDash) # new in fix3.py
# Make a nice starting curly-quote # new in fix3.py
td.data = sDpat.sub(sDquote, td.data) # new in fix3.py
def handle_xml_tree(aNode, depth):
if aNode.hasChildNodes():
for kid in aNode.childNodes:
handle_xml_tree(kid, depth+1)
else:
if 'data' in dir(aNode):
fixdata(aNode, depth)
def doit(argv):
doc = xml.dom.minidom.parse(argv[1])
handle_xml_tree(doc, 0)
sys.stdout.write(doc.toxml('utf-8'))
if __name__ == "__main__":
doit(sys.argv)
Note that the syntax for replacing a regular expression differs from that of substring replacement: we use the sub (substitute) method of the regular expression object (sDpat in this case):
td.data = sDpat.sub(sDquote, td.data)
Here we're taking td.data, the data in this particular node in the XML tree, looking for the regular expression specified by sDpat, and replacing whatever matched it (the straight " character in the appropriate context) with the starting double quote, sDquote.
Now, if we try fixit.sh with fix3.py as the lower-level program:
% ln -sf fix3.py fixit.py % ./fixit.sh ex3.odt ex3-2.odt depth=5: childNode: The ?en? dash ? is too short depth=5: childNode: The ?em? dash ? which is longer ? is what we need. depth=5: childNode: And two hyphens -- ugly -- should be turned into ?em? dashes. depth=5: childNode: This line has "straight double quotes" depth=5: childNode: These 'single quotes' aren't pretty. % oowriter ex3-2.odt %
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- RSS Feeds
- Senior Perl Developer
- Technical Support Rep
- Non-Linux FOSS: libnotify, OS X Style
- UX Designer
- Web & UI Developer (JavaScript & j Query)
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?






2 hours 33 min ago
2 hours 50 min ago
4 hours 6 min ago
4 hours 55 min ago
4 hours 58 min ago
5 hours 7 min ago
5 hours 37 min ago
8 hours 3 min ago
12 hours 2 min ago
13 hours 19 min ago