Google Books Ngram Viewer

Google Books Ngram Viewer

The Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in printed sources published between 1500 and 2022 in Google's text corpora in English, Chinese (simplified), French, German, Hebrew, Italian, Russian, or Spanish. There are also some specialized English corpora, such as American English, British English, and English Fiction. The program can search for a word or a phrase. The n-grams are matched with the text within the selected corpus, and if found in 40 or more books, are then displayed as a graph. The program supports searches for parts of speech and wildcards. It is routinely used in research. == History == The Ngram Viewer was created by Google software engineers Will Brockman and Jon Orwant , who teamed up with Harvard researchers Jean-Baptiste Michel and Erez Lieberman Aiden. The service was released on December 16, 2010. Before the release, it was difficult to quantify the rate of linguistic change because of the absence of a database that was designed for this purpose, said Steven Pinker, a well-known linguist who was one of the co-authors of the Science paper published on the same day. The Google Books Ngram Viewer was developed in the hope of opening a new window to quantitative research in the humanities field, and the database contained 500 billion words from 5.2 million books publicly available from the very beginning. The intended audience was scholarly, but the Google Books Ngram Viewer made it possible for anyone with a computer to see a graph that represents the diachronic change of the use of words and phrases with ease. Lieberman said in response to The New York Times that the developers aimed to provide even children with the ability to browse cultural trends throughout history. In the Science paper, Lieberman and his collaborators called the method of high-volume data analysis in digitized texts "culturomics". == Usage == Commas delimit user-entered search terms, where each comma-separated term is searched in the database as an n-gram (for example, "nursery school" is a 2-gram or bigram). The Ngram Viewer then returns a plotted line chart. Due to limitations on the size of the Ngram database, only matches found in at least 40 books are indexed. == Limitations == The data sets of the Ngram Viewer have been criticized for their reliance upon inaccurate optical character recognition (OCR) and for including large numbers of incorrectly dated and categorized texts. Because of these errors, and because they are uncontrolled for bias (such as the increasing amount of scientific literature, which causes other terms to appear to decline in popularity), care must be taken in using the corpora to study language or test theories. Furthermore, the data sets may not reflect general linguistic or cultural change and can only hint at such an effect because they do not involve any metadata like date published, author, length, or genre, to avoid any potential copyright infringements. Systemic errors like the confusion of s and f in pre-19th century texts (due to the use of ſ, the long s, which is similar in appearance to f) can cause systemic bias. Although the Google Books team claims that the results are reliable from 1800 onwards, poor OCR and insufficient data mean that frequencies given for languages such as Chinese may only be accurate from 1970 onward, with earlier parts of the corpus showing no results at all for common terms, and data for some years containing more than 50% noise. Guidelines for doing research with data from Google Ngram have been proposed that try to address some of the issues discussed above.

Data exploration

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data. All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality. Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations. == Interactive Data Exploration == This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques. == Software == Trifacta – a data preparation and analysis platform Paxata – self-service data preparation software Alteryx – data blending and advanced data analytics software Microsoft Power BI - interactive visualization and data analysis tool OpenRefine - a standalone open source desktop application for data clean-up and data transformation Tableau software – interactive data visualization software

Data annotation

Data annotation is the process of labeling or tagging relevant metadata within a dataset to enable machines to interpret the data accurately. The dataset can take various forms, including images, audio files, video footage, or text. == Applications == Data is a fundamental component in the development of artificial intelligence (AI). Training AI models, particularly in computer vision and natural language processing, requires large volumes of annotated data. Proper annotation ensures that machine learning algorithms can recognize patterns and make accurate predictions. Common types of data annotation include classification, bounding boxes, semantic segmentation, and keypoint annotation. Data annotation is used in AI-driven fields, including healthcare, autonomous vehicles, retail, security, and entertainment. By accurately labeling data, machine learning models can perform complex tasks such as object detection, sentiment analysis, and speech recognition with greater precision. This growing demand has led to the emergence of specialized sectors and platforms dedicated to AI training and human-in-the-loop workflows, which often utilize Reinforcement Learning from Human Feedback (RLHF) to refine model behavior. == In computer vision == === Image classification === Image classification, also known as image categorization, involves assigning predefined labels to images. Machine learning algorithms trained on classified images can later recognize objects and differentiate between categories. For instance, an AI model trained to recognize furniture styles can distinguish between Georgian and Rococo armchairs. === Semantic segmentation === Semantic segmentation assigns each pixel in an image to a specific class, such as trees, vehicles, humans, or buildings. This type of annotation enables machine learning models to differentiate objects by grouping similar pixels, allowing for a detailed understanding of an image. === Bounding boxes === Bounding box annotation involves drawing rectangular boxes around objects in an image. This technique is commonly used in autonomous driving, security surveillance, and retail analytics to detect and classify objects such as pedestrians, vehicles, and products on store shelves. === 3D cuboids === 3D cuboid annotation enhances traditional bounding boxes by adding depth, enabling models to predict an object's spatial orientation, movement, and size. This method is particularly useful for autonomous vehicles and robotics, where understanding object dimensions and depth is critical. === Polygonal annotation === For objects with irregular shapes, such as curved or multi-sided items, polygonal annotation provides more precise labeling than bounding boxes. This technique is often used in applications that require detailed object recognition, such as medical imaging or aerial mapping. === Keypoint annotation === Keypoint annotation marks specific points on an object, such as facial landmarks or body joints, to enable tracking and motion analysis. This method is widely used in facial recognition, emotion detection, sports analytics, and augmented reality applications.

Compute (machine learning)

In machine learning and deep learning, compute is the amount of computing power or computational resources required to train machine learning models and large language models. More broadly, compute is the computational power or resources necessary for a computer or computer program to function. == Definition == Compute is commonly defined as the amount of computing power or computational resources required to train machine learning and large language models. The term "compute" has also been more broadly applied to cloud computing, referencing processing power, memory, networking, storage, and other resources required for the computation of any program. Compute is measured in petaflop/s-days and is used to document AI training. A petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations. The compute-time product serves as a mental convenience, similar to kilowatt-hour for energy. An amount of compute is meant to give an idea of the number of actual operations performed. == History == In a 2018 analysis titled "AI and compute", artificial intelligence company OpenAI introduced the concept of compute. OpenAI identified two eras of training AI systems in terms of compute-usage. From 1959 to 2012, compute roughly followed Moore’s law. Between 2012 and 2018, the amount of compute used in the largest AI training runs increased exponentially, growing by more than 300,000 times — roughly doubling every 3.4 months. By comparison, Moore’s Law doubled every two years over the same period. One of the largest models, released in 2020, used 600,000 times more computing power than the 2012 model. After 2020, compute growth began to slow down, with the compute needed for the largest AI models continuing to slow down in 2023. The notion of compute has become increasingly used from the mid-2020s onwards. == Compute growth and AI progress == Larger AI models trained on more data and using more computational resources, tend to perform better. This happens even if the algorithms themselves remain unchanged. As early as 2018, OpenAI noted the exponential increase in compute to be have a key role in AI progress. OpenAI considers three factors drive the advance of AI: algorithmic innovation, data, and the amount of compute available for training. AI models with more compute not only improve in the tasks they were trained on but can develop emergent abilities. Incremental improvements can lead to more abrupt leaps in capabilities. AI provider SpaceXAI said in 2026 that their AI progress is driven by compute and used it a key metric in the AI training of its supercomputer Colossus, the which contains 1 million GPUs. Anthropic has a contract of $1.25 billion per month with SpaceXAI to buy all the compute capacity at Colossus 1 data center. === Criticism and policy === Increasing, promoting or constraining progress in artificial intelligence has often be done via controlling the amount of compute. Policymarkers have enacted policies and provided support to make compute resources more accessible to domestic AI researchers. In a January 2022 report, the Center for Security and Emerging Technology (CSET) suggested to institutions that increasingly powerful and generalizable AI (AGI) will likely require other strategies than maximizing compute. Some AI researchers are also concerned that government might exclusively focus on scaling compute instead of other strategies. The CSET has reported on the various bottlenecks which could explain why deep learning needs for compute have slow down: training is expensive and training extremely large models generates traffic jams across many processors that are difficult to manage. there is a limited supply of AI chips (see AI chip memory shortage). CSET advances that the main resource is human capital, specifically talented researchers — according to a 2023 published survey of more than 400 AI researchers, academic and private sector workers. The survey found that AI researchers are not primarily or exclusively constrained by compute access. However, both academic and industry AI researchers equally report concerns that insufficient compute could prevent them from contributing meaningfully to AI research in the future. High compute users are more concerned about compute access. When asked about which resource provided by the government would be the most useful to them, some AI researchers select compute, other prefer grant funding. For this goal, CSET advised policymakers to ensure that even researchers with smaller budgets could effectively contribute to AI research. Other proposed strategies include using contemporary AI algorithms, managing modern AI infrastructure or focusing on interdisciplinary work between the AI field and other fields of computer science. A 2024 study on compute access found that academic-only AI research teams often have less compute intensive research topics, especially foundation models, compared to industry AI labs. As a consequence, academia is likely to play a smaller role in advancing such techniques. The researchers suggest nationally-sponsored computing infrastructure as well as open science initiatives to boost academic compute access. === Data === A 2022 study found that current large language models are significantly under-trained, a consequence of focusing on scaling language models whilst keeping the amount of training data constant. By training over 400 language models of various parameter and token size, they found that "for compute-optimal training", the model size and the number of training tokens should ideally be scaled equally: for every doubling of model size the number of training tokens should also be doubled.

Relational data mining

Relational data mining is the data mining technique for relational databases. Unlike traditional data mining algorithms, which look for patterns in a single table (propositional patterns), relational data mining algorithms look for patterns among multiple tables (relational patterns). For most types of propositional patterns, there are corresponding relational patterns. For example, there are relational classification rules (relational classification), relational regression tree, and relational association rules. There are several approaches to relational data mining: Inductive Logic Programming (ILP) Statistical Relational Learning (SRL) Graph Mining Propositionalization Multi-view learning == Algorithms == Multi-Relation Association Rules: Multi-Relation Association Rules (MRAR) is a new class of association rules which in contrast to primitive, simple and even multi-relational association rules (that are usually extracted from multi-relational databases), each rule item consists of one entity but several relations. These relations indicate indirect relationship between the entities. Consider the following MRAR where the first item consists of three relations live in, nearby and humid: “Those who live in a place which is near by a city with humid climate type and also are younger than 20 -> their health condition is good”. Such association rules are extractable from RDBMS data or semantic web data. == Software == Safarii: a Data Mining environment for analysing large relational databases based on a multi-relational data mining engine. Dataconda: a software, free for research and teaching purposes, that helps mining relational databases without the use of SQL. == Datasets == Relational dataset repository: a collection of publicly available relational datasets.

PARRY

PARRY was an early example of a chatbot, implemented in 1972 by psychiatrist Kenneth Colby. == History == PARRY was written in 1972 by psychiatrist Kenneth Colby, then at Stanford University. While ELIZA was a simulation of a Rogerian therapist, PARRY attempted to simulate a person with paranoid schizophrenia. The program implemented a crude model of the behavior of a person with paranoid schizophrenia based on concepts, conceptualizations, and beliefs (judgements about conceptualizations: accept, reject, neutral). It also embodied a conversational strategy, and as such was a much more serious and advanced program than ELIZA. It was described as "ELIZA with attitude". PARRY was tested in the early 1970s using a variation of the Turing Test. A group of experienced psychiatrists analysed a combination of real patients and computers running PARRY through teleprinters. Another group of 33 psychiatrists were shown transcripts of the conversations. The two groups were then asked to identify which of the "patients" were human and which were computer programs. The psychiatrists were able to make the correct identification only 48 percent of the time — a figure consistent with random guessing. PARRY and ELIZA (also known as "the Doctor") interacted several times. The most famous of these exchanges occurred at the ICCC 1972, where PARRY and ELIZA were hooked up over ARPANET and responded to each other.

Business process automation

Business process automation (BPA), also known as business automation, refers to the technology-enabled automation of business processes. == Development approaches == There are three main approaches to developing BPA: traditional business process automation involves developing BPA software in a programming language for integrating relevant applications in the digital ecosystem to execute a given process; robotic process automation uses software robots (also called agents, bots, or workers) to emulate human-computer interaction for executing a combination of processes, activities, transactions, and tasks in one or more unrelated software systems; hyperautomation (also called intelligent automation (IA), intelligent process automation (IPA), integrated automation platform (IAP), and cognitive automation (CA) combines business process automation, artificial intelligence (AI), and machine learning (ML) to discover, validate, and execute organizational processes automatically with no or minimal human intervention. == Deployment == BPA toolsets vary in capability. With the increasing adoption of artificial intelligence (AI), organizations are implementing AI-driven technologies that can process natural language, interpret unstructured datasets, and interact with users. These systems are designed to adapt to new types of problems with reduced reliance on human intervention. == Business process management implementation == A business process management system differs from BPA. However, it is possible to implement automation based on a BPM implementation. The methods to achieve this vary, from writing custom application code to using specialist BPA tools. == Robotic process automation == Robotic process automation (RPA) involves the deployment of attended or unattended software agents in an organization's environment. These software agents, or robots, are programmed to perform predefined structured and repetitive sets of business tasks or processes. Robotic process automation is designed to streamline workflows by delegating repetitive tasks to software agents, allowing human workers to focus on more complex and strategic activities. BPA providers typically focus on different industry sectors, but the underlying approach is generally similar in that they aim to provide the shortest route to automation by interacting with the user interface rather than modifying the application code or database behind it. == Use of artificial intelligence == Artificial intelligence software robots are used to handle unstructured data sets (like images, texts, audios) and are often deployed after implementing robotic process automation. They can, for instance, generate an automatic transcript from a video. The combination of automation and artificial intelligence (AI) enables autonomy for robots, along with the capability to perform cognitive tasks. At this stage, robots can learn and improve processes by analyzing and adapting them.