Key Challenges for Commercial Text Miners

This week’s Guest Post comes from Michael Iarrobino, Product Manager at Copyright Clearance Center. In the post, Michael explains how text mining can accelerate and enrich your company’s research and development program, but only when the barriers between researchers and the content they want to mine are lowered.

Michael Iarrobino is Product Manager at Copyright Clearance Center

Michael Iarrobino is Product Manager at Copyright Clearance Center

In biomedical research and development, researchers use text mining tools to extract and interpret facts, assertions, and relationships from vast amounts of published information. Mining accelerates the research process, increases discovery of novel findings, and helps companies identify potential safety issues in the drug development process. However, despite the many benefits of text mining, researchers face a number of obstacles before they even get a chance to run queries against the body of biomedical literature.

Incomplete Information in Article Abstracts

One challenge for researchers as they build a collection of articles (or “corpus”) for their text mining projects is relying on article abstracts. Many researchers build their corpus using scientific article abstracts because they are easily accessible via biomedical databases such as PubMed. In addition, article abstracts are usually provided in a format that is suitable for text mining. However, although text mining data from abstracts provides some value, there are limitations as to what data can be found within an abstract. To ensure that researchers don’t miss vital data, discoveries, and assertions, the full text of the article should be mined – including detailed descriptions of methods and protocols and the complete study results.

Limited Access to XML-Formatted Content

Another challenge for researchers is that, unlike article abstracts, full-text articles are not often readily available from publishers in a format suitable for text mining. When researchers have subscriptions to journals, the documents are often provided as PDFs, a format not intended for use with text mining software. Researchers must then spend time converting the PDFs to XML (Extensible Markup Language), the preferred format for use in text mining software. XML is a markup language used to encode documents in a format that is easily read by computers. To convert PDFs to XML, researchers must use additional software tools, which is not only inefficient but also creates a number of problems with the document itself, including loss of data and tables, conflation of document sections into a “blob of text,” and the addition of bad characters and non-words.

Inconsistent Licensing Terms and Fees

In addition, researchers must contend with inconsistent licensing terms and fees from multiple publishers. Because text mining projects depend on access to a broad base of content, businesses must work directly with multiple rightsholders for the use of full-text XML articles, resulting in varying fee structures, inconsistent terms of use, and, ultimately, reduced productivity. Without a common set of terms and conditions for the use of full-text content across publishers, researchers and information managers are left with the task of negotiating one-by-one with individual rightsholders to obtain the content and rights they need for text mining.

Text mining can accelerate and enrich your company’s research and development program, but only when the barriers between researchers and the content they want to mine are lowered.

Michael Iarrobino is Product Manager at Copyright Clearance Center (CCC), the leader in content workflow and rights licensing technology. He oversees CCC’s RightFind™ XML for Mining product, a workflow solution for text mining researchers using peer-reviewed scientific articles. He has previously managed products that solve problems in the marketing technology and content discovery spaces while at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management.