• Aug 4, 2020
  • Goutham
  • Solutions, Websites

Unleashing Talend Machine Learning Capabilities

artha

Introduction

This article covers how Talend Real-time Big Data can be used to effectively leverage Talend’s Real-time Data processing and Machine Learning capabilities. The use case handled in this article is how Twitter data can be processed in real time, and classify if the person tweeting has post-traumatic stress disorder (PTSD). This solution can work for any major health situation of a person, for example cancer, which is discussed at the end.

What is PTSD?

PTSD is a mental disorder that can develop after a person is exposed to a traumatic event, such as sexual assaultwarfaretraffic collisions, or other threats on a person’s life.

Statistics about PTSD

  • 70% of adults in the U.S. have experienced some traumatic event at least once in their lives, and up to 20% of
    these people go on to develop PTSD.
  • An estimated 8% of Americans, 24.4 million people, have PTSD at any given time.
  • An estimated one out of every nine women develop PTSD, making them about twice as likely as men.
  • Almost 50% of all outpatient mental health patients have PTSD.
  • Among people who are victims of a severe traumatic experience, 60 – 80% will develop PTSD.

Source: Taking a look at PTSD statistics

Insights into the solution

Considering the high increase in the end-users of the social networks, we expect a humongous amount of data written every day into social networks. To handle such a huge amount of data, we need a Hadoop Ecosystem. Hence, this use case of PTSD is classified as a Big Data use case, as Twitter is our data source.

Spark Framework
Apache Spark™ is a fast and general engine for large-scale data processing.
Random Forest Model
Random forest is an ensemble learning method for classificationregression, and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
Hadoop Cluster (Cloudera)
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.
Hashing TF
As a text-processing algorithm, Hashing TF converts input data into fixed-length feature vectors to reflect the importance of a term (a word or a sequence of words) by calculating the frequency that these words in the input data appear.
Talend Studio for Real Time Big Data
Talend Studio to perform MapReduce, Spark, Big Data real-time Jobs.
Inverse Document Frequency
As a text-processing algorithm, Inverse Document Frequency (IDF) is often used to process the output of the Hashing TF computation in order to downplay the importance of the terms that appear in too many documents.
Kafka Service
Apache Kafka is an open-source stream processing platform written in Scala and Java to provide a unified, high-throughput, low-latency platform for handling a real-time data feed.
Regex Tokenizer
Regex tokenizer performs advanced tokeni

Step 1: Retrieve data from Twitter using Talend

Talend Studio not only supports Talend’s own components, it also supports the custom-built components from any third parties. All these custom-built components can be accessed from Talend Exchange, an online component store.

  • Taking advantage of a custom Twitter component, we can get data from Twitter by accessing both REST and Stream
    APIs.
  • To take advantage of the Hadoop ecosystem and for Big Data, we implemented a real time Kafka service to read
    data from Twitter.
  • Talend Studio for Real-time Big Data has Kafka components that we can leverage to read the data that is being
    read by the Kafka service, and pass it on to the next stages of the design in real time.

To perform all of the above, we need to get access to the Twitter API.

Snapshots of Talend Job designs

Deciding which hashtags to use plays a vital role. We may use a single hashtag, or a combination of multiple hashtags to
pull the accurate data required. Choosing appropriate hashtags helps to filter the large volume of source data.

Step 2: Create and train the model using Talend

As we all know, nothing can be done without human intervention. Once the data pulled from Twitter is in place, we
need to manually classify the tweets as Having PTSD or Not Having PTSD.

Classification can be done by adding a new attribute to that data. Values can
be Yes or No (Yes – having PTSD, No – Not having PTSD). Once the classification is done,
we can call this data as a training set that can be used to create and train the model.

To achieve our use case, before creating the model, training data needs to undergo some transformations such as:

  • Hashing TF
  • Regex Tokenizer
  • Inverse Document Frequency
  • Vector Conversion

After passing through all the algorithms above, training data can be passed into the model to create and train it. The model that suits this prediction use case best is the Random Forest Model.

Talend Studio for Real-time Big Data has some very good machine learning components that can perform regression, classification & prediction using Spark Framework. Leveraging the capability of Talend to handle machine learning tasks, the Random Forest Model has created and trained the model with the training data. Now we have the model ready to predict the tweets.

Note: All the work is done on a Cloudera Hadoop Cluster, Talend is connected to the cluster, and the rest of the computation is achieved by Talend.

Snapshot of a Talend Spark Job design

Step 3: Prediction of tweets using Talend

Now we have the model ready on our Hadoop cluster. We can use the process in step 1 and pull the data from Twitter again, which acts as a test
data. The test data has only one attribute: Tweet.

When the test data is passed to the model we have created, the model adds a new attribute Label to the test data, and its value will be Yes or No (Yes – having PTSD, No – Not having PTSD). The predicted value depends solely on the way the model is trained in step 2. Again, al this prediction can be done in Talend Studio for Real- time using Spark framework.

Snapshot of a Talend Spark Job design for prediction

Once the model predicts the classification of the test data set, we find the records to be 25% erroneous (on average). We need to assign the right classification to that 25% of the records, add them to the training set, and retrain the model. It should predict accurately now. Add more records to the training set, and repeat the same procedure until the model becomes accurate. A model needs to evolve over time, by training it with newly added training data that comes with time. Some management is required.

Note: To boost the effectiveness of the model, we can add synonyms of the training data to the training set and retrain the model, which leads to developing the model synthetically rather than just organically.

A threshold of 90% accurate predictions is a must to classify the model as accurate. If the prediction accuracy level drops below 90%, then it is time to retrain the model.

Note:
Once the classification of data is done (Yes or No), it may lead to many more useful real-time applications.

Broader Scope

The use case solution designed can work for any of the major health situations. For example, if the use case is with cancer, using cancer-specific hashtags we can train the model in an equivalent way and start predicting if the person has cancer or not. The same real-time applications as discussed above can be achieved.

Related articles

  • Blog
artha
Mastering Data Evolution: The Transformative Power of AI-Driven MDM

The landscape of data management is evolving rapidly, and traditional MDM approaches are facing new challenges. The volume, variety, and velocity of data are increasing exponentially, making it harder to keep up with the changing data needs and expectations.

  • Blog
artha
Navigating the Cloud: Unravelling the Power of Cloud MDM in Modern Data Management

Traditionally, organizations deployed MDM solutions on-premises i.e. installing, and maintaining them on their own servers and infrastructure.

  • Blog
artha
Top 5 Trends in Master Data Management

In the era of digital transformation, businesses grapple with not only a surge in data volumes but also increased complexity, and stringent regulatory demands. Addressing these challenges necessitates the adoption and evolution of Master Data Management (MDM). Master data management (MDM) is the process of creating, maintaining, and governing a single, consistent, and accurate source […]

  • Blog
artha
From Data to Insights: Cultivating a Data-Driven Culture for Business Growth

Data is an asset for businesses. It holds the power to unlock valuable insights and drive informed decision-making. But data alone is not enough to drive business growth. You need to turn data into insights and insights into actions. You can do that by cultivating a data-driven culture in your organization. A data-driven culture is where data […]

  • Blog
artha
Decoding Efficiency: The Transformative Role of Data Catalogues in the Financial Sector

Data catalogues play a pivotal role in organizations by assisting in managing, organizing, and governance of data assets. This not only enhances operational efficiency but also facilitates more informed decision-making. This metadata management tool that enables users to discover, understand, and manage data across the enterprise. It provides a central repository of metadata, including: Data […]

  • Blog
artha
Key Data Management Trends That Defined This Year: Embracing 2024 with Top 5 Trends

Explore the future of data management with our blog on the key trends that drove 2023 and anticipated in 2024. From data democratization through Mesh and Fabric technologies to enhancing GDPR compliance with data masking, leveraging Industry 4.0, and the growing impact of DataOps, stay ahead in the evolving data landscape.

  • Blog
artha
Data Modernization: Revolutionizing Business Strategy for Competitive Advantage

Data modernization is critical given that companies are increasingly relying on data as business differentiator. Here is our take on that.

  • Blog
artha
The Quest for Data Consistency

Data, as they say, is the new oil. But, like oil, data needs to be extracted, processed, and refined before it can be used effectively. Data quality is a crucial aspect of data management, as it affects data accuracy, reliability, and usefulness. One of the critical dimensions of data quality is data consistency, which refers […]

  • Blog
artha
The Role of Data Management in Driving Digital Transformation

Digital transformation goes beyond the mere adoption of new technologies or tools. It entails a fundamental shift in how organizations harness the power of data to drive value,

  • Blog
artha
Creating A Competitive Edge With Talend Data Management

Talend is an ETL tool that offers solutions for big data, application integration, data integration, data quality, and data preparation. Talend’s big data and data integration tools are widely utilised. Customers are given access to Data Integration and Data Quality features through the Talend Data Management Platform, which may be used for batch data processing. […]

  • Blog
artha
Data Science Solutions: Reinvents Business Operations

Data science is a vast subject with numerous possible uses. It reinvents how businesses run and how various departments interact, going beyond simple data analysis and algorithm modelling. Every day, data scientists use a variety of data science solutions to solve challenging problems, such as processing unstructured data, identifying patterns in massive datasets, and developing […]

  • Blog
artha
Are Your Data Governance Initiatives Failing? You must read this

In today’s dynamic and ever-changing organisational environment, data governance is a pressing need. Businesses today collect enormous amounts of data from several sources while data governance aids in risk management, value maximisation, and cost reduction of the data accumulated. Data governance, in a nutshell, is the activity of being aware of where your data is, […]

  • Blog
artha
Cloud Migration Strategy – 6 Steps to Ensure Success

As organisations progressively shift their apps to the cloud to stimulate growth, success in the contemporary digital environment entails embracing the potential of the cloud. Despite making such significant investments in the cloud, one in three businesses never reap the rewards. After adopting the cloud, 33% of firms reported little to no improvement in organisational […]

  • Blog
artha
How MDM Lite will help Improve the Standards of Your Master Data Management

Efficiency is the key to functionality in the long run. Companies and businesses go length and breadth to achieve efficiency in all parts of their operations. From short-run operations to long-term outputs running a business efficiently and effectively is the main task for the top management. It is the management’s responsibility to avail better and […]

  • Blog
artha
Make the Most Out of Your Data With a Data Ingestion Framework

Forward-thinking businesses use data-based insights in today’s fast-paced global market to identify and seize major business opportunities, create and market ground-breaking goods and services, and keep a competitive edge. As a result, these businesses are gathering more data overall as well as new sorts of data, like sensor data. However, businesses need a data ingestion […]

  • Blog
artha
6 Critical Challenges in Implementing Cloud Migration Solutions

Cloud computing has caught momentum with the rise in cloud providers and solutions over the past ten years. Studies show that companies around the world are gradually integrating the cloud into their infrastructure. However, you should formulate a strategy for cloud migration solutions before your company takes the step towards transformation, including an understanding of […]

  • Blog
artha
Drive Innovation in Business Operations With These 5 Digital Solutions

Digital business solutions are particularly effective in boosting corporate productivity since they eliminate numerous roadblocks in communication. By using digital technologies to automate some operations, businesses may operate and produce more effectively while reducing the chance of human error. Here are 5 Digital business solutions that can improve the company’s operations. Project Management Companies need […]

  • Blog
artha
6 Master Data Management Strategy Tips Essential for Business Success

Master data management Strategy (MDM) describes the rules for collecting, gathering, combining, de-duplicating, regulating, and managing data collectively throughout a corporation.

Big Data For Small Businesses: How They Give Companies An Edge

oil was the most valuable commodity available in the 20th century, data has snatched the crown for the 21st century.

Want Enterprise Efficiency? Look Out For Digital Transformation Trends!

Today, the Internet of Things and Cloud technology govern business operations across industry verticals, no matter which sector they belong to.

What is Enterprise Data Management, and How Does it Help?

Whether it is a start-up or a well-established business giant, they all need to handle and manage a large amount of data. Mishandling of data can create chaos and disturb the smooth functioning of various departments, leading to poor outcomes.

Customer 360: The Master Data Management Solutions SMES need

The concept of 'customer 360,' or having a single view of all your customer data, is gaining traction in trade publications, analyst circles, and even mainstream media. But what exactly is a customer 360?

How to Choose the Right Managed Cloud Services Provider for Your Business?

Businesses are increasingly relying on cloud services to support their business infrastructure (databases, performance, storage, networking), software, or services to support performance, flexibility, innovation, scalability, and provide cost savings at the same time.

Future of Data Governance Services: Top Trends For 2022 and Beyond

There was a time in the early 2000s when data governance was not really a thing. Surely, there were pioneers back then who laid down the groundwork for data governance, but it wasn’t still taken seriously.

7 Best Practices That Help To Avoid Common Data Management Mistakes

Considering big data applications are growing at such a rapid rate, more and more firms are opting for digital transformation to stay relevant and up to date with the latest trends.

Data Governance Vs Data Management The Difference Explained

People often wonder if there is any difference between Data Governance and Data Management.

Unmask the 3 Levels of Holistic Data Governance Strategy

Gathering quality data is the first step towards business success. However, the growth of the same business relies on the usage of given data. The trick to any successful business nowadays is defined not by the data collected, but by the best use of data. As important as data is to a successful business, it […]

What’s The Foundation of Hybrid Cloud Self-Service Automation?

In the last one decade, cloud application delivery has become extremely important but undeniably complex, sometimes getting out of direct control.

Choosing The Best Methodology for a Successful Data Migration

Modern-day businesses need modern-day data operation solutions. A company that excels at its core competence and yet fails to manage its data well, will underperform in the market because data is the basic infrastructural unit of every business now.

Digital Transformation Services: Company Transition Strategy and Framework

For a long time, Digital Transformation existed as a futuristic organizational fantasy but quickly transformed into a reality as the pandemic took over the world.

Typical Data Migration Errors You Must Know

Data migration is the process of transferring data from one software or hardware to another software or hardware. Although the term only means as much, it is typically used in reference to more prominent companies with huge amounts of data.

Talend Improving on iPaas to Provide Better Data Quality

Talend is a data integration platform as a service (iPass) tool for companies that rely on cloud integration for their data.

The Role Of Microsoft Azure Datalake in Healthcare Industry

The Healthcare industry has surprisingly evolved to be the producers of maximum amount of data in the current times, especially after the Covis-19 pandemic.

How To Overcome 9 Common Data Governance Challenges

Overcoming Data Governance Challenges- As data becomes the most household word of the decade, the discussions about data governance are massively confusing. Some call for it, some ask for zero interference and some ask that the government own the data.However, here are the 9 most common challenges involved in data governance. 1. We fall short […]

Data, Consumer Intelligence, And Business Insight Can All Benefit From Pre-built Accelerators

Personalized software development can be expensive. That’s why organizations are constantly on the lookout to minimize these costs without compromising on quality.

How Modernizing ETL Processes Helps You Uncover Business Intelligence

We live in a world of information: there's a more significant amount of it than any time in recent years, in an endlessly extending cluster of structures and areas.

5 Ways Talend Helps You Succeed At Big Data Governance and Metadata Management

Concerning these and several of the hurdles big data governance can pose to organizations, metadata management can be a precious asset.

Do you know how single customer view is critical to business success?

Similarly, other businesses may use data and attract your loyal customers with a great personalized experience, deals, cashback, etc.

Here Are 9 Ways To Make The Most Of Talend Cloud

The business ecosystem at present majorly revolves around big data analytics and cloud-based platforms. Throughout companies, the functions that involve decision-making and day-to-day operations depend on data collected in their data storage systems.

How To Get Started With Migrating On-Premise Talend Implementations To The Cloud

If you’re an on-premises Talend client, and your organization decides to move all the operations to the cloud, you have a huge task ahead of you.

The Right Digital Transformation Strategy Will Change The Game

Digital transformation refers to the amalgamation of digital technology into all the aspects of an organization. Such change brings in fundamental shifts in the manner that a business functions.

How to Choose the Right Data Management Platform for Your Business?

A Data Management Platform helps organizations conduct centralized data management and data sorting, giving businesses greater control over their consumer data. For example, in marketing, a DMS tool can collect, segregate, and analyze data for the optimization, targeting, and deployment of campaigns to the correct target audience. Data Management Platforms gather information from first-parties such […]

Achieve better performance with an efficient lookup input option in Talend Spark Streaming

Using a lookup input component will provide heavy uplifting in performance and code optimization for any Spark streaming Job.

Quick Start Guide: Talend and Docker

This article is intended as a quick start guide on how to generate Talend Jobs as Docker images using a Docker service that is on a remote host.

Talend Cloud & AMC Web UI: Hybrid approach

Talend Activity Monitoring Console is an add-on tool integrated into Talend Studio and Talend Administration Center for monitoring Talend Jobs and projects.

Talend Studio Best Practices – Increase Studio Performance and Settings

Lets discuss about Talend Studio best practices, Issues/Fixes/Recommendation’s at studio level.

Fastest MDM Rollout

Thus, what was the best way for Carhartt to do so? The best solution was to shift from a multi-channel approach to an Omni channel retail experience.