{"id":30,"date":"2015-05-09T18:42:00","date_gmt":"2015-05-09T08:42:00","guid":{"rendered":""},"modified":"2018-05-22T20:49:58","modified_gmt":"2018-05-22T10:49:58","slug":"tikadiff-graphical-diff-for-text-from-binary-files","status":"publish","type":"post","link":"https:\/\/pbw.id.au\/blog\/2015\/05\/tikadiff-graphical-diff-for-text-from-binary-files\/","title":{"rendered":"tikadiff: graphical diff for text from &#8220;binary&#8221; files"},"content":{"rendered":"<h4>Code<\/h4>\n<p>The <a href=\"https:\/\/bitbucket.org\/pbwest\/tikadiff\/downloads\">code<\/a> is from the Downloads area of my <a href=\"https:\/\/bitbucket.org\/\">Atlassian Bitbucket<\/a> <a href=\"https:\/\/bitbucket.org\/p_b_west\/tikadiff\/downloads\/\">repository<\/a>; see the <a href=\"https:\/\/bitbucket.org\/p_b_west\/tikadiff\">README<\/a> online.<\/p>\n<h4>Version Control Systems (VCSs)<\/h4>\n<p>VCSs like <i><a href=\"http:\/\/mercurial.selenic.com\/\">mercurial<\/a><\/i>, <i><a href=\"http:\/\/www.git-scm.com\/\">git<\/a><\/i> and <i><a href=\"http:\/\/bazaar.canonical.com\/en\/\">bazaar<\/a><\/i> (to mention only a few) are great for keeping track of changes to source files, but their utility doesn&#8217;t stop there. \u00a0If you&#8217;re working on documents in applications like <i>Word<\/i>, <i>OpenOffice<\/i> or <i>LibreOffice<\/i>, especially when you are asking others to review those documents, a VCS program can save you a lot of anguish.<\/p>\n<p>However, people who work not with source code, but with research papers, academic assignments and the like, are not inclined to make themselves familiar with the tools that geeks have grown used to. \u00a0Considering how long it took for software developers to embrace those tools, it&#8217;s hardly surprising.<!--more--><\/p>\n<h4>Limitations of VCSs<\/h4>\n<p>Unfortunately, VCSs aren&#8217;t set up well for managing such files. \u00a0They depend for efficient version management on tracking line-by-line differences in file. \u00a0This allows them to maintain the minimal set of changes from version to version, and to readily show what changes have occurred between any two versions. \u00a0That works well for source files, which are written in plain text, but not for files which maintain their own complex internal formats, which generally only become readable when translated by the &#8220;mother&#8221; application.<\/p>\n<p>VCSs provide for such files, but they mark them as binary, and make no attempt to track the differences between them. \u00a0Instead, for each version, they simply keep a copy of the entire file. \u00a0While this is still better by a long stretch than not being able to track, version by version, the history of a document, it makes it impossible for the native VCS to show what the differences between versions actually were.<\/p>\n<h4>Other solutions<\/h4>\n<p>This problem has been partially addressed before. For those who generate Open Document Text (ODT) files with, for example, \u00a0<i>OpenOffice<\/i>, the program\u00a0<a href=\"https:\/\/github.com\/dstosberg\/odt2txt\/\"><i>odt2txt<\/i><\/a>\u00a0will extract text from those documents. \u00a0<i>odt2txt<\/i>, in turn, enabled\u00a0<a href=\"https:\/\/github.com\/fpz\/oodiff\"><i>oodiff<\/i><\/a>, which is a script to generate a diff from the text of two ODT files, extracted with <i>odt2txt<\/i>. \u00a0It&#8217;s not a complete solution, because there is still no merge facility, but it at least allows the textual differences between versions to be examined from within the VCS.<\/p>\n<p>The combination of such a file comparison program with a VCS graphical interface presents a much lower barrier to adoption for users from outside the asylum. \u00a0Because I\u1fbdm on OS X, \u00a0I&#8217;ve been using Atlassian&#8217;s free product, <i><a href=\"https:\/\/www.sourcetreeapp.com\/\">SourceTree<\/a><\/i>, which has the advantage of working with both <i>mercurial (hg)<\/i> and <i>git<\/i>. \u00a0If you&#8217;re working with <i>mercurial<\/i> only, or if you are on a linux distribution, you can use <i><a href=\"http:\/\/tortoisehg.bitbucket.org\/\">TortoiseHg<\/a><\/i> as a graphical front end.<\/p>\n<p>That wasn&#8217;t quite enough, because my wife, whose requirements got me thinking about this, works mainly with <i>Word<\/i> files. \u00a0I needed an equivalent of <i>odt2txt<\/i> and <i>oodiff<\/i> for both .doc and .docx files \u2014 which have very different formats.<\/p>\n<h4>Thank you, Tika<\/h4>\n<p>Fortunately, the problem of extracting text from these formats has already been conveniently solved by the <a href=\"https:\/\/tika.apache.org\/\">Apache Tika<\/a> project. \u00a0Tika can extract metadata, plain text, xml and html from a dazzlingly array of file types. \u00a0For my purposes, plain text and metadata will suffice; metadata only for any file types, like images, from which text cannot be gleaned. \u00a0Tika provides not merely a substitute for <i>odt2txt<\/i>, but an extension to virtually any commonly (an many not so commonly) used file formats. All that remained was to provide a substitute for <i>oodiff<\/i>.<\/p>\n<h4><i>tikadiff<\/i><\/h4>\n<p>That&#8217;s <i>tikadiff<\/i>. \u00a0It takes either two filenames and generates a graphical diff of them, or two directories and, for each file in the first, generates a diff of the file of the same name (if it exists) in the second. \u00a0t<i>ikadiff<\/i> depends on Tika (obviously) and a graphical diff program; by default <i><a href=\"http:\/\/kdiff3.sourceforge.net\/\">kdiff3<\/a><\/i>, but it is currently written to look also for Perforce <i><a href=\"http:\/\/www.perforce.com\/downloads\/helix\">p4merge<\/a><\/i>, and can be instructed to use a diff program of your choice, provided that it accepts the same two arguments.<\/p>\n<p><i>tikadiff<\/i> depends also on a number of scripts that are distributed with it.<\/p>\n<h4><i>tika<\/i><\/h4>\n<p><i>tika<\/i> is a convenience script to run the CLI from the tika-app jar file. \u00a0It passes all its arguments to tika-app. \u00a0In addition to that basic function, it will preset certain arguments to tika-app depending on the name by which it is invoked.<\/p>\n<div>\n<ul>\n<li><i>tikatype<\/i> \u2014prints the mimetype of the file named in its argument<\/li>\n<li><i>tikameta<\/i> \u2014prints the metadata of\u00a0the file named in its argument<\/li>\n<li><i>tikatext<\/i> \u2014prints the plain text extracted from\u00a0the file named in its argument<\/li>\n<li><i>tikaxml<\/i> \u2014prints the XML extracted from\u00a0the file named in its argument<\/li>\n<li><i>tikahtml<\/i> \u2014prints the HTML extracted from\u00a0the file named in its argument<\/li>\n<\/ul>\n<\/div>\n<h4><em>tikserve<\/em><\/h4>\n<p><i>tikserve<\/i>, like <i>tika<\/i>, is a convenience script to run tika-app. \u00a0Unlike <i>tika<\/i>, it is not invoked (except during setup) under its own name, but only through a series of links. \u00a0It runs tika-app as a server, which performs some operation on any file which is written to its open TCP port.<\/p>\n<div>\n<ul>\n<li><i>tikstype<\/i> \u2014set up server to return mimetype of file<\/li>\n<li><i>tiksmeta<\/i> \u2014set up a server to return the metadata of the file<\/li>\n<li><i>tikstext<\/i> \u2014set up a server to return the plain text extracted from the file<\/li>\n<li><i>tiksxml<\/i> \u2014set up a server to return the XML\u00a0extracted from the file<\/li>\n<li><i>tikshtml<\/i> \u2014set up a server to return the HTML\u00a0extracted from the file<\/li>\n<\/ul>\n<div><i>tikadiff<\/i> only uses <i>tikstype<\/i>, <i>tiksmeta<\/i> and <i>tikstext<\/i> when comparing directories, on the theory that it will be faster to use a server when testing and comparing multiple files. \u00a0I have not checked whether this theory is valid.<\/div>\n<\/div>\n<p>In order to run the server(s), <i>tikserve<\/i> needs a port on localhost. To support this habit, <i>tikserve<\/i> looks to two other scripts.<\/p>\n<h4><i>freeport<\/i><\/h4>\n<p><i>freeport<\/i> hands back the next available port on localhost, starting at 1024. You may optionally give it a minimum port number and, sub-optionally, a maximum port number as constraints. \u00a0In order to perform this task, <i>freeport<\/i> requires\u2014<\/p>\n<h4><i>localhostports<\/i><\/h4>\n<p><i>localhostports<\/i>\u00a0prints the ports which, in the opinion of\u00a0<i>netstat<\/i>, are currently associated with localhost.<\/p>\n<div>\n<h4>Code, again<\/h4>\n<p>Bitbucket\u00a0<a href=\"https:\/\/bitbucket.org\/p_b_west\/tikadiff\/downloads\">repository Downloads<\/a>; see the\u00a0<a href=\"https:\/\/bitbucket.org\/p_b_west\/tikadiff\">README<\/a>\u00a0online.<\/p>\n<\/div>\n<div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Code The code is from the Downloads area of my Atlassian Bitbucket repository; see the README online. Version Control Systems (VCSs) VCSs like mercurial, git and bazaar (to mention only a few) are great for keeping track of changes to source files, but their utility doesn&#8217;t stop there. \u00a0If you&#8217;re working on documents in applications &hellip; <a href=\"https:\/\/pbw.id.au\/blog\/2015\/05\/tikadiff-graphical-diff-for-text-from-binary-files\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;tikadiff: graphical diff for text from &#8220;binary&#8221; files&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[19],"tags":[],"class_list":["post-30","post","type-post","status-publish","format-standard","hentry","category-code"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8SCfl-u","jetpack-related-posts":[{"id":33,"url":"https:\/\/pbw.id.au\/blog\/2014\/06\/zargrep-grep-files-in-a-zip-archive\/","url_meta":{"origin":30,"position":0},"title":"zargrep: grep files in a zip archive","author":"pbw","date":"Sat 21st Jun '14","format":false,"excerpt":"How do you search for strings within a zip archive? I'm tinkering with EPUB3 files, and I wanted to be able to find certain strings within .epub files, so I had a look around, and I immediately found zgrep and family. The trouble was that zgrep assumes a single zipped\u2026","rel":"","context":"In &quot;Code&quot;","block_context":{"text":"Code","link":"https:\/\/pbw.id.au\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":341,"url":"https:\/\/pbw.id.au\/blog\/2017\/01\/find-files-only-with-scm-directory-pruning\/","url_meta":{"origin":30,"position":1},"title":"find: files only with scm directory pruning","author":"admin","date":"Mon 9th Jan '17","format":false,"excerpt":"The version of find I'm discussing here is find (GNU findutils) 4.7.0-git I use this pattern frequently\u2014 $ find . <conditions> |xargs grep <pattern> to find files containing, say, a regular expression. \u00a0If the search tree contains mercurial or git directories, I usually want to exclude their contents from the\u2026","rel":"","context":"In &quot;Code&quot;","block_context":{"text":"Code","link":"https:\/\/pbw.id.au\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":34,"url":"https:\/\/pbw.id.au\/blog\/2013\/11\/setting-environment-variables-in-os-x-yosemite-and-mavericks\/","url_meta":{"origin":30,"position":2},"title":"Setting environment variables in  MacOS Big Sur","author":"pbw","date":"Thu 7th Nov '13","format":false,"excerpt":"This method uses launchctl to manage environment variables for programs invoked directly from Finder. \u00a0See the launchctl man page, especially the section LEGACY SUBCOMMANDS. \u00a0It's not entirely accurate, but that's not unusual. \u00a0The critical subcommands are getenv, setenv, and unsetenv. The man page indicates that the export subcommand is available;\u2026","rel":"","context":"In &quot;Code&quot;","block_context":{"text":"Code","link":"https:\/\/pbw.id.au\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":40,"url":"https:\/\/pbw.id.au\/blog\/2013\/05\/cloning-my-livejournal-blog\/","url_meta":{"origin":30,"position":3},"title":"Cloning my LiveJournal blog","author":"pbw","date":"Fri 3rd May '13","format":false,"excerpt":"I am moving any potentially useful posts over from my LiveJournal blog in hopes that they may be found a little more readily. Most of these posts are from years ago, but some still have relevance.","rel":"","context":"In &quot;Code&quot;","block_context":{"text":"Code","link":"https:\/\/pbw.id.au\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":39,"url":"https:\/\/pbw.id.au\/blog\/2013\/05\/setting-environment-variables-in-os-x-lion\/","url_meta":{"origin":30,"position":4},"title":"Setting Environment Variables in OS X Lion","author":"pbw","date":"Fri 3rd May '13","format":false,"excerpt":"If you want to set environment variables in OS X in such a way as to be recognised in applications run from Finder, it is not enough to set the env var in .profile. \u00a0You must also ensure that the variables are set in the file ~\/.MacOSX\/environment.plist. Setting values in\u2026","rel":"","context":"In &quot;Code&quot;","block_context":{"text":"Code","link":"https:\/\/pbw.id.au\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1040,"url":"https:\/\/pbw.id.au\/blog\/2021\/07\/breaking-news-stairs-sue-dan-andrews-for-defamation\/","url_meta":{"origin":30,"position":5},"title":"Breaking News: Stairs sue Dan Andrews for defamation","author":"admin","date":"Thu 8th Jul '21","format":false,"excerpt":"Published at Catallaxy Files on 10\/06\/2021 A set of stairs today filed a defamation suit against Victorian Premier Dan Andrews, lawyers representing the as-yet unnamed stairs announced today. \u201cDan Andrews called our client \u2018slippery\u2019,\u201d a spokes-entity for the stairs\u2019 lawyers said. \u201c\u2018Slippery\u2019 is an entity slur that stairs take very\u2026","rel":"","context":"In &quot;Catallaxy Files&quot;","block_context":{"text":"Catallaxy Files","link":"https:\/\/pbw.id.au\/blog\/category\/publications\/catallaxy-files\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/posts\/30","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/comments?post=30"}],"version-history":[{"count":5,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/posts\/30\/revisions"}],"predecessor-version":[{"id":563,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/posts\/30\/revisions\/563"}],"wp:attachment":[{"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/media?parent=30"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/categories?post=30"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pbw.id.au\/blog\/wp-json\/wp\/v2\/tags?post=30"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}