It might be a long shot, but are you interested in named entities or natural language processing? Named entities are more important than you think in identifying what something is about and why it is important. Want to learn a lot more about them in under 14 minutes? Have a look at this presentation I gave and see how you can supercharge your ability to find and use relevant digital assets. - Joseph Busch #namedentities #naturallanguageprocessing #digitalassets #taxonomy
Transcript
There's a lot of software available that does a good job of identifying named entities, the names of people, organizations, events, places, and other things that occur in text. Too often these are just used as keywords to find digital assets without any further differentiation. In this talk, I will tell you how to refine or supercharge those named entities so you can tag content assets more specifically and then. Be able to find and use them more effectively. 1st I'll talk about what Named Entity Recognition or Nur is. Then I'll present some case studies, and finally I'll identify some tools and resources. Named Entity Recognition or Nur is a stack of methods for programmatically identifying named entities mentioned in unstructured text and classifying them into predefined categories such as people, organizations, locations, events, and identify identifiable patterns such as dates, time, expressions, quantities, monetary values, percentages, codes, and so on. Here's a snippet. Where person, location, organization, event data, and other named entities have been identified in the text. Named entity recognition is based on a stack of natural language processing, or NLP, methods that parse sentences, which are strings of words separated by white spaces and in some cases punctuation, into meaningful expressions that can be acted upon, for example to infer the intent of a search query. Identifying the part of speech, normalizing the terms, and analyzing the phrase structure is usually sufficient to identify the type of a named entity. Most known phrases contain named entities. They may be verified by lookup and a lexicon, glossary, authority file, or other resource such as Wikidata. For example, the longitude and latitude of Kotono and other related information can be retrieved from Wikidata. Patterns such as numbers and dates can be identified, normalized and further contextualized. For example, inferring that 2.4 million is a quantity for the population of Cotonou. Natural language processing doesn't just identify keywords, the strings in between the white spaces. It provides the capability to understand the context of the named entities and tell us what the content is really about. This requires understanding the grammatical relationships in the sentence. For example, does the string apple refer to the fruit, the technology company, or the record company? It depends whether it's related to the phrase. Granny Smith or iPhone 14 or The Beatles? What does the string Katona refer to? It's the French name of the capital of Benin. What is the number 2.4 million? A quantity of It's the population of Benin. Where is Cotonou located on a map? What is its longitude and latitude? How do you evaluate the accuracy of named entity extraction? A combination of techniques can be used to optimize recall and precision and named entity extraction. Ad hoc testing is used in the early stages of performance testing. This method uses iterative trial and error observation to obtain an initial level of Neuraperformance. We use ad hoc testing with our clients to develop a proof of concept, to identify performance bottlenecks, to identify requirements, and to assess the clients comfort zone with the technology. Quality control scripts is a software engineering best practice to develop a script consisting of test cases that simulate as much as possible all Nur cases. Quality assurance engineers develop a script based on the engineering specification and then test the product to ensure that it meets the specification. This technique is not feasible for gauging your accuracy, but it is useful for testing systems integration issues. Random sampling is a statistically based approach that yields high confidence from random sampling of nerve results. This is an established methodology adapted from the social sciences. A comparison of new results against an existing collection of similar and correctly identified named entity results, such as the TREC Text Retrieval Conference Test collection, can be used to validate neuro performance. The drawback of this approach is that mirror requirements of the standard text collection might differ from the clients requirements, leading to misleading results. Creating a representative test collection of correctly identified named entities from the target source content against which to test the newer applications. Performance is the best possible approach for measuring accuracy of the system as it relates to the customer's requirements for nerve. So how do you measure NOR accuracy? You do it the same way you measure search accuracy, by recall and precision. Recall is the number of true positive results divided by the number of all samples that should have been identified as positive, while precision is the number of true positive results divided by the number of all positive results, including those not identified correctly. The tests accuracy is measured by the F1 score, which is calculated from the precision and recall of the test. Here are some examples of neural use cases. So which organizations were mentioned in the news article? Were specified products mentioned in complaints or reviews? Does the tweet contain the name of a person? Let's consider two content sources that are very different. The Chronicle of Higher Education is a newspaper and website that presents news, information, and jobs for college and university faculty and student affairs professionals. A subscription is required to read some articles. While the Oracle Press Room is a website where you can search and sort press releases from the American multinational computer technology corporation. Let's paste the text. Of a Chronicle story US is investigating whether the university athletic cuts harm black students into pool parties. Entity Recognition demo. Click on the Extract button. And then click on the Named entity tab and extraction results. This displays 6 organizations. The first is an error. Probably. Identified as an organization because it is in the title case and includes the string. And includes the string university. In fact, it's a snippet from the title of the article. A full-featured Nur would have a method to identify the content structure and interpret titles and headings differently from body text. The remaining organizations look like organizations, even Central Michigan, which in the context of this article is a short form for the Central Michigan University. To supercharge this Nur, the next step would be to classify these organizations by type, for example, which are government agencies, which are colleges and universities, which are philanthropies and so on. Besides classifying organizations by type, other related information such as location, size and so on would also be interesting to identify and to provide the opportunity to enrich the context of this content. But. How do you sort? How do you do this sort of enrichment? It can be retrieved from Wikidata and other Internet sources. This slide shows some of the information about Central Michigan University that can be extracted from Wikidata. Let's paste the text. Of an Oracle press release University of Tennessee system upgrades finance and HR tech with Oracle Fusion Cloud Applications into expert AIS. Document analysis demo. Click on the Analyze button and then click on the entities results in the left rail. This displays. People, organizations, places, and values such as currency amounts. Percentages and measures mentioned in the text, with the corresponding links to open data sources such as Wikidata, Dbpedia, and Geonames. This displays 6 organizations. The first three are errors and the second four are correct, but the University of Tennessee system doesn't have a link to the Wikidata record. The business companies identified seemed to be more accurate, but this Nur has difficulty recognizing products. To supercharge this Nur, the Oracle products could be more accurately recognized and disambiguated by looking them up on the Oracle website. Oracle divides their products into 3 categories. Product line. Currently called Oracle Cloud Infrastructure Technology. Currently called Hardware and Software and Applications. Currently called Oracle Cloud Applications and also provides industry solutions, for example for government and education. The specific application that the University of Tennessee system purchased is Enterprise Performance Management. So what about non text media? It's true that named entity recognition needs text to work on, but even for non text media, sometimes text can be generated, for example by voice to text transcription or by using captions or surrounding text. As shown in this example from the Washington Post. There are lots of toolkits and applications that can be used to build named Entity Recognizer environments. This is a selected list of them. I showed brief demos of expert AI. And pull tardy neural components. These are both semantic platforms that provide a wide range of functionality to build automated classification applications and to build and manage categories and their relationships. Nettel has been used for Nur in the news business for a long time. For those who are developers or are located within organizations that have application development resources, toolkits are the way to go. Apache has almost everything you need to build semantic applications. Developers wonder why you would ever buy an expensive application when you can build one that has all the functionality you want. I don't know what you like to do. I like to play with toolkits, but I don't want to be responsible to build applications for my clients. To summarize. In this. Talk. I've explained what NEAR is, given you some examples of typical use cases, and demonstrated how Nerve works with a couple of applications. I've also shown you simple ways to enrich named entities with links to related information that provides more context to base content such as news articles and press releases. Please let me know if you have any questions.To view or add a comment, sign in