Senin, 11 Juni 2018

Sponsored Links

Google Books Ngram Viewer Knows All, Tells All | TIME.com
src: timenerdworld.files.wordpress.com

The Google Ngram Viewer or Google Books Ngram Viewer is an online search engine that maps the frequencies of each comma separated search string using the annual count n -gram is found in sources printed between 1500 and 2008 in Google's Chinese text in English, Chinese (simplified), French, German, Hebrew, Italian, Russian or Spanish. There are also some special English language corporas, such as American English, English English, English Fiction, and One Million English; and 2009 versions of most corporations are also available.

This program can search for a single word or phrase, including misspelled or nonsense. N-grams are matched to the text in the selected corpus, optionally using case-sensitive spellings, and, if found in 40 or more books, are then plotted on the graph.

The Google Ngram Viewer, starting January 2016, supports searches for replay parts and replacement characters.


Video Google Ngram Viewer



History

The program was developed by Jon Orwant and Will Brockman and was released in mid-December 2010. It was inspired by a prototype (called "Bookworm") created by Jean-Baptiste Michel and Erez Aiden of the Harvard Cultural Observatory and Yuan Shen of MIT and Steven Pinker.

The Ngram Viewer was originally based on the 2009 edition of Google Books Ngram Corpus. In January 2016, the program can search individual language corpuses in the 2009 or 2012 editions.

Maps Google Ngram Viewer



Operations and restrictions

Coma limits a user-entered search term, which shows each separate word or phrase to discover. Ngram Viewer returns a plotted line graph in seconds from the user pressing the Enter key or the "Search" button on the screen.

As an adjustment for more books that have been published for several years, the data is normalized, as a relative level, with the number of books published each year.

Google collects databases from more than 5 million books published through 2008. Therefore, in January 2016, no data matches beyond 2008, no matter whether the corpora was generated in 2009 or 2012. Due to limitations on the size of the Database Ngram, only matches found in at least 40 books indexed in the database; otherwise, the database can not store all possible combinations.

Usually, a search term can not end with punctuation, although a separate full stop (one period) can be searched. Also, a final question mark (as in "Why?") Will cause a second search for question marks separately.

Eliminating periods in abbreviations will allow matching forms, such as using "RÃ,MÃ, S" to search for "R.M.S." versus "RMS".

Learn One Thing A Day: Ngram Viewer, a wonderful Google tool!
src: 1.bp.blogspot.com


Corpora

Corporate used for search consists of total_counts, 1-gram, 2-gram, 3-gram, 4-gram, and 5-gram files for each language. The file format of each file is tab-delimited data. Each line has the following format:

  • total_counts file
    years TAB match_count TAB page_count TAB volume_count NEWLINE
  • Ngram File version 1 (created in July 2009)
    ngram TAB years TAB match_count TAB page_count TAB volume_count NEWLINE
  • Ngram File version 2 (created in July 2012)
    ngram TAB years TAB match_count TAB volume_count NEWLINE

Google Ngram Viewer uses match_count to plan charts.

For example, the word "Wikipedia" from file Version 2 of English 1-gram is stored as follows:

Graphs plotted by Google Ngram Viewer using the above data are here.

Google Books Ngram Viewer Knows All, Tells All | TIME.com
src: timenerdworld.files.wordpress.com


Criticism

The data set has been criticized for its dependence on inaccurate OCR, overwhelming scientific literature, and for incorporating large numbers of incorrectly dated and categorized texts. Because of these mistakes, and because these errors are uncontrollable due to bias (such as the increasing number of scientific literature, which causes other terms to decline in popularity), it is very risky to use this corpus to learn theoretical or test language. Because the data set does not include metadata, it may not reflect a general linguistic or cultural change and can only provide clues to such effects.

Another problem is that the corpus is actually a library, which contains one of every book. A single, prolific writer can thus be inclined to include a new phrase into Google's lexicon book, whether the author is widely read or not.

OCR issues

Optical character recognition, or OCR, is not always reliable, and some characters may not be scanned correctly. In particular, systemic errors such as confusion "s" and "f" in the pre-19th century text (due to the use of long_s similar in appearance to "f") may cause systemic bias. Although the Google Ngram Viewer claims that the results are reliable from 1800 onwards, poor OCR and insufficient data means that the frequencies given for languages ​​such as Chinese can only be accurate from 1970 onwards, with the beginning of the corpus showing no results at all for general terms, and data for several years contains more than 50% of the noise.

N-gram Language Models - ppt video online download
src: slideplayer.com


See also

  • Lexical Analysis
  • Culturomics
  • Google Trends

Google Books Ngram Viewer Knows All, Tells All | TIME.com
src: timenerdworld.files.wordpress.com


References


Writing History in the Digital Age
src: quod.lib.umich.edu


Bibliography

  • Lin, Yuri; et al. (July 2012). "Syntactic annotation for Google Books Ngram Corpus" (pdf) . Proceedings of the 50th Annual Meeting . Paper Demo. Jeju, Republic of Korea: Association of Computational Linguistics. 2 : 169-174. 2390499. White paper that presents the 2012 edition of Google Ngram Corpus Book

Google Books Ngram Viewer Knows All, Tells All | TIME.com
src: timenerdworld.files.wordpress.com


External links

  • Official website

Source of the article : Wikipedia

Comments
0 Comments