Now Reading
Top 23 Free Software for Text Analysis, Text Mining, Text Analytics

Top 23 Free Software for Text Analysis, Text Mining, Text Analytics

Top 23 Free Software for Text Analysis, Text Mining, Text Analytics
4.68 (93.61%) 97 ratings

Top 23 Free Software for Text Analysis, Text Mining, Text Analytics : Text Analytics is the process of converting unstructured text data into meaningful data. List of the Top 23+ Free Software for Text Analysis, Text Mining, Text Analytics include QDA Miner Lite, KH Coder, TAMS Analyzer, Carrot2, CAT, GATE, tm, Gensim, Natural Language Toolkit, RapidMiner, Unstructured Information Management Architecture, OpenNLP, KNIME, Orange-Textable, LPU, Apache Mahout, Pattern, LingPipe, S-EM, LibShortText, VisualText, Twinword and Coh-Metrix. These are some of the key vendors who provides open source text analytics software in no particular order. The text analysis applications scan a set of documents written in a natural language. These applications model the document set for predictive classification purposes or populate a database or search index with the information extracted.

You may also like to review the Text Analysis, Text Mining, Text Analytics proprietary software list:

Top software for Text Analysis, Text Mining, Text Analytics

 Top 11 Free Software for Text Analysis, Text Mining, Text Analytics

Top 23 Free Software for Text Analysis, Text Mining, Text Analytics

Here is a list of some of the open source – Top 23 Free Software for Text Analysis, Text Mining, Text Analytics :

Top 23 Free Software for Text Analysis, Text Mining, Text Analytics

1.QDA Miner Lite

QDA Miner Lite is a free computer assisted qualitative analysis software from Provalis Research. It can be used for the analysis of textual data such as interview and news transcripts, open ended responses, as well as for the analysis of still images. It offers basic CAQDAS features such as, importation of documents from plain text, RTF, HTML, PDF as well as data stored in Excel, MS Access, CSV, tab delimited text files. Features also include importation from other qualitative coding software, intuitive coding using codes organized in a tree structure, ability to add comments (or memos) to coded segments, cases or the whole project.

The software also has functionalities for fast Boolean text search tool for retrieving and coding text segments, code frequency analysis with bar chart, pie chart and tag clouds, coding retrieval with Boolean and proximity operators, export tables to XLS, Tab Delimited, CSV formats, and Word format and export graphs to BMP, PNG, JPEG, WMF formats.

Provalis Research

QDA Miner Lite

QDA Miner Lite


GATE is the General Architecture for Text Engineering. This is an open source toolbox for natural language processing and language engineering. Used for all sorts of language processing tasks and applications, including voice of the customer, cancer research, drug research, decision support, recruitment, web mining, information extraction and semantic annotation.

GATE includes an information extraction system called ANNIE which is known as A Nearly-New Information Extraction System. This is a set of modules comprising a tokenizer, a gazetteer, a sentence splitter, a part of speech tagger, a named entities transducer and a coreference tagger. ANNIE can be used as-is to provide basic information extraction functionality, or provide a starting point for more specific tasks.
Languages currently handled in GATE are English, Spanish, Chinese, Arabic, Bulgarian, French, German, Hindi, Italian, Cebuano, Romanian, Russian.




3.TAMS Analyzer

TAMS Analyzer for Macintosh OS X is a convention for identifying themes in texts such as web pages, interviews, field notes. It was designed for use in ethnographic and discourse research. TAMS Analyzer is a program that works with TAMS to assign ethnographic codes to passages of a text just by selecting the relevant text and double clicking the name of the code on a list. It then allows to extract, analyze, and save coded information.

TAMS Analyzer

TAMS Analyzer

TAMS Analyzer


Carrot2 does text and search results clustering frame work. It can automatically cluster small collections of documents, search results or document abstracts into thematic categories. Its an open source search results Clustering Engine. Apart from two specialized search results clustering algorithms, Carrot also offers ready to use components for fetching search results from various sources such as including GoogleAPI, Bing API, eTools Meta Search, Lucene, SOLR, and more.





CAT is a free service of the Qualitative Data Analysis Program, which efficiently code raw text data sets, annotate coding with shared memos, manage team coding permissions via the Web, create unlimited collaborator sub-accounts and assign multiple coders to specific tasks. CAT, easily measure inter-rater reliability, adjudicate valid & invalid coder decisions,report validity by dataset, code or coder and export coding in RTF, CSV or XML format.


6.KH Coder

KH Coder is an application for quantitative content analysis, text mining or corpus linguistics. It can handle Japanese, English, French, German, Italian, Portuguese and Spanish language data.
By inputting the raw texts the searching and statistical analysis functionalities like KWIC, collocation statistics, co-occurrence networks, self-organizing map, multidimensional scaling, cluster analysis and correspondence analysis can be utilized.KH Coder provides various kinds of search and statistical analysis functions using back-end tools such as Stanford POS Tagger, Snowball stemmer, MySQL and R.

KH Coder

KH Coder

KH Coder (Text Mining Infrastructure in R)

tm package provides a framework for text mining applications within R. The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package provides native support for reading in several classic file formats such as plain text, PDFs, or XML files. There is also a plug-in mechanism to handle additional file formats. The data structures and algorithms can be extended to fit custom demands.





Gensim is a Python library which provides scalable statistical semantics, analyze plain text documents for semantic structure and retrieve semantically similar documents. The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised.


9.Natural Language Toolkit (NLTK)

Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK is available for Windows, Mac OS X, and Linux.

Natural Language Toolkit (NLTK)


RapidMiner is an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. The Text Processing Extension provides data and text mining software.




11.Unstructured Information Management Architecture (UIMA)

Unstructured Information Management Architecture (UIMA) is a component framework to analyze unstructured content such as text, audio and video. This is originally developed by IBM.

UIMA enables applications to be decomposed into components, for example “language identification” => “language specific segmentation” => “sentence boundary detection” => Each component implements interfaces defined by the framework and provides self describing metadata via XML descriptor files. Also provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes.

Unstructured Information Management Architecture (UIMA)




The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services.


13. KNIME Text Processing

The KNIME Text processing feature enables to read, process, mine and visualize textual data in a convenient way. It provides functionality from natural language processing (NLP), text mining and information retrieval.

KNIME Text processing


Textable is an add-on for Orange data mining software package. It enables users to build data tables on the basis of text data, by means of a flexible and intuitive interface. It offers in particular the following features such as import text data from various sources, apply systematic recoding operations, apply analytical processes such as segmentation and annotation, manually, automatically or randomly select unit subsets and build concordances and collocation lists.

Orange Textable


LPU is Learning from Positive and Unlabeled data. LPU is a text learning or classification system that learns from a set of positive documents and a set of unlabeled documents, without labeled negative documents. This type of learning is different from classic text learning/classification, in which both positive and negative training documents are required.


16.Apache Mahout

Apache Mahout is a project of the Apache Software Foundation with the objective of creating scalable machine learning algorithms that are free to use under the Apache license. Mahout contains implementations for clustering, categorization and collaborative filtering. The implementation can be on the top of Apache Hadoop using the map/reduce paradigm. The three use cases which are supported are, recommendation mining, which takes users behavior and from that tries to find items users might like. Clustering which takes the text documents and groups them into groups of topically related documents. Classification which learns from existing categorized documents on what documents of a specific category look like and assign unlabelled documents to the correct category.

Apache Mahout


Pattern is a web mining module for the Python programming language which provide tools for data mining: Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser, natural language processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet, machine learning: vector space model, clustering, SVM, network analysis and canvas visualization.



LingPipe is tool kit for processing text using computational linguistics. LingPipe is used to do tasks like finding the names of people, organizations or locations in news, automatically classify Twitter search results into categories and suggest correct spellings of queries. LingPipe is Java API with source code and unit tests and multi-lingual, multi-domain, multi-genre models.



S-EM is a text learning or classification system that learns from a set of positive and unlabeled examples with no negative examples. It is based on a “spy” technique, naive Bayes and EM algorithm.



LibShortText is an open source tool for short-text classification and analysis. LibShortText can handle the classification of titles, questions, sentences, and short messages. It is more efficient than general text-mining packages. On a typical computer, processing and training 10 million short texts takes only around half an hour. An interactive tool for error analysis is included. Based on the property that each short text contains few words, LibShortText provides details in predicting each text.


VisualText is the premier integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. It features NLP++ — a new C++ -like programming language for quickly elaborating grammars, patterns, heuristics, and knowledge.



Twinword provides text analysis APIs that can understand and associate words in the same way as humans do. Features include context and topic extraction ,online consumer sentiment analysis for brands and products and personalized and targeted e-commerce/advertising platforms.



Coh-Metrix is a system for computing computational cohesion and coherence metrics for written and spoken texts. Coh-Metrix allows readers, writers, educators, and researchers to instantly gauge the difficulty of written text for the target audience.


You may also like to review the Text Analysis, Text Mining, Text Analytics proprietary software list:
Top software for Text Analysis, Text Mining, Text Analytics

You may also like to review the Top Qualitative Data Analysis Software proprietary software list:
Top Qualitative Data Analysis Software

You may also like to review the Top Free Qualitative Data Analysis Software software list:
Top Free Qualitative Data Analysis Software

What's your reaction?
Love It
Very Good
About The Author
  • May 22, 2014 at 9:54 am

    Have you looked at the free, open source, web-based ?

  • February 16, 2015 at 7:46 pm

    DiscoverText is a freemium software with many powerful text analytics features that is free for 30 days and a core set of coding (labeling/annotation) that remain free after the 30 day trial expires.

  • Amnon Meyers
    April 9, 2015 at 4:12 pm

    VisualText at has been here for 15 years, and is a one-stop shop for developing the most accurate and complete NLP solutions. Free for non-commercial use (that is, till you are actually deploying or reaping revenue from your analyzers).
    NLP++ is one of the only programming languages for NLP.

    Check out the new website at

    Amnon Meyers
    Text Analysis International, Inc

  • June 21, 2015 at 9:30 pm

    I would like to recommend Twinword’s Text Analysis APIs.

    Check out the website for a list of APIs for different functions of text analysis at:


  • July 25, 2015 at 11:23 am

    Coh-Metrix, a theoretically grounded, computational linguistics facility that analyzes texts on multiple levels of language and discourse (Graesser et al., 2014; Graesser, McNamara, Louwerse, & Cai, 2004; D. S. McNamara, Graesser, McCarthy, & Cai, 2014).

Leave a Response