Epidemic-related sites in covid-19 media reports

Epidemic-related sites in covid-19 media reports Bin Zhao*, Kuiyun Huang and Jinming Cao School of Science, Hubei University of Technology, Wuhan, Hubei, China. School of Information and Mathematics, Yangtze University, Jingzhou, Hubei, China. *Corresponding author: Bin Zhao, School of Science, Hubei University of Technology, Wuhan, Hubei, China Tel. /Fax: +86 13


Introduction
In early 2020, the COVID-19 virus began to erupt from Wuhan, Hubei, China, and it has only become a global epidemic in just a few months. In these months, China has turned into a crisis and the epidemic has been basically controlled, while other countries have moved towards the early days in some countries, the current situation is even worse than when the epidemic situation in China is the worst. One of the main reasons why China can overcome the epidemic is the openness and transparency of information from the central government to the local governments, and the release of relevant information on the official website on time every day.
From the epidemiological point of view, the specific location information such as the residential area or place of stay where the confirmed cases of new coronary pneumonia are published, is not only conducive to the individual in the life, but also to the government to establish an epidemic transmission channel to prevent and control the spread of the epidemic. In the webpages of these epidemic reports or media reports, the place of usual residence or place of stay of the confirmed cases is mostly expressed in various forms such as the body of the page, embedded text and screenshots on the page. In order to analyze the distribution of epidemicrelated locations from these sources of information in the past, manual search, extraction, and classification of specific information in the text have been carried out in the past, but the method requires a large amount of work, is inefficient, and lacks timeliness.
In recent years, as the technology of named entity recognition has gradually matured, this work has gradually shifted from manual extraction to automatic extraction, which not only reduces human and financial resources, but also speeds up the processing of tasks. Named entity recognition refers to identifying named entities in text and dividing them into corresponding entity types 1 . The general entity types include person names, place names, dates, and organization names. Because the Chinese characters are closely arranged in the Chinese text, the sentence is composed of multiple characters and there are no spaces between the words, so the difficulty of identifying named entities is increased. In order to improve the accuracy of recognition, named entity recognition traditional statistical learning method 3-5 to the current deep learning method 6 , for some common entities basically achieved higher accuracy.

Data
From the beginning of the outbreak of COVID-19, there will be news on the Internet every day to report the activities of newly diagnosed patients, so it provides us with original unstructured data for collecting diagnosis locations. We collect official notifications and media reports on the itinerary of the activities of the diagnosed patients associated with Guangdong Province from January 29, 2020 to February 19, 2020. The data comes from 366 webpages in 152 media. The original data is obtained through web crawler technology, and the data includes the main content of the webpage, the media of the news release and the corresponding URL. The basic principle of a web crawler is to simulate a browser making HTTP requests. The crawler client sends a request to the web server through the HTTP request, downloads the web page after obtaining the corresponding server, and completes the crawling work of the crawler system 7 . Part of the data is shown in Table 1 The preprocessing of text data means that before performing named entity recognition, we must first determine whether there are missing values in the data. After checking, there are no missing values, and then we need to convert the data into a format that the model can easily handle. For example: delete some unnecessary character strings, split each press release in sentence units and remove the same sentence, etc. The part-of-speech feature is used in Peking University's14.

The model
Based on the collected unstructured text data, we tried to identify and extract the place words in the text through named entity recognition technology, and then classify the recognized place words according to a certain rule and divide them into provinces, cities, and administrative regions. As well as the four major categories of specific locations, the final analysis of the extracted location information will provide key evidence for the subsequent construction of the epidemic development model and the assessment and prediction of the source of infection, transmission speed, transmission path, and transmission risk. There are three main methods for studying Chinese place name recognition: rule-based methods, statistics-based methods, and deep learning-based methods. The rule-based method is intuitive and natural, and is easy to understand and expand by humans. However, rule writing depends on specific language knowledge and domain knowledge. The rules are more complicated, it is difficult to cover all the modes, and the portability is also poor 8, 9. Statistics-based methods do not require excessive language knowledge and domain knowledge and are highly portable, but require manual annotation of the corpus and selection of appropriate statistical learning models and parameters 10,11,12 . Deep learning-based methods do not require overly complex feature engineering, and can automatically discover information from the input to form an end-to-end model 13 . Considering that the collected text data is limited by time and data, the content of this paper is mainly based on the first two methods.
Recently, named entity recognition has achieved good results in some limited entity types. For example, the recognition effect on the names of k people, places and organizations in news corpus is remarkable. Chinese place name recognition can be regarded as a sequence labeling problem. The place name is a combination of multiple words arranged in a certain order, and the place name entity recognition is the combination of marking the correct names from these word sequences. The conditional random field model combines the advantages of the maximum entropy model and the hidden Markov model, and can be used for the labeling and segmentation of sequence data 5 . Therefore, the effective solution to the sequence labeling problem is the conditional random field model 14 . We chose the conditional random field model as the identification model for epidemic locations.

Maximum likelihood method for estimation
Suppose there is a set of sample sets D  x j , y j , j  1,2, , N for the training data, where the samples are Independent of each other, and px, y is the empirical probability of x, y in the training samples. For a certain conditional model py | x, , the maximum likelihood function Formula of the training data sample set can be expressed as: Take the logarithmic form: Since we are studying a conditional random field model, its conditional probability can be expressed as: Where  and  are the parameters to be estimated, and f and g are the vectors of theEigen functions differentiate the parameter k: Let the value of the above formula be 0 to meet some of the constraints and find k. In the same way, can also be obtained, so that can be parsed, but this method may not always be able to directly find the solution value.

. Feature selection for Chinese location recognition
The training set used in this article is the annotated corpus of the People's Daily in January 1998. It uses the part-of-speech annotation set of Peking University. All the names of people, places, and organizations have been marked. Word segmentation sentences, and in order to make the features of entity extraction more obvious, these coarse word segmentation sentences need to be finely segmented. The 5n-gram template can fully express the lexeme information of the word in the word. Among them B (the beginning word of the named entity), M (the middle word of the named entity), E (the tail of the named entity), S (the single word constitutes the named entity), N (unnamed entity). The three types and combinations of entities form various labels. The template can effectively mark the position of the word in the word, so that the system can use the position feature to identify the boundary of the word. Table 2 shows the labeled labels  The conditional random field is very dependent on the selection of features, which has a great influence on the accuracy of the final recognition. Theoretically, if more context information is collected around the current word, that is, the larger the value of the observation window, the richer the information obtained and the more accurate the judgment of the current word. But once the observation window is too large, the calculation of too much information will make the model identification efficiency inefficient and affect the operation efficiency. If the value of the window is too small, the relevant dependency information cannot be fully utilized, affecting the accuracy of recognition. So choosing the right window size is the premise of choosing the right feature template. The window size selected in this article is 2. The base feature is the most basic feature that is stronger than the character itself, including the current character or key, the position of the first character in the pre-word and part of speech. Part-of-speech features can often improve the degree of discrimination. For example, named entities are often noun part-of-speech words, while verb part-of-speech words are rarely used as named entities. For a wordbased entity tagging system, the part-of-speech feature is to use the partof-speech of the word in which the word is located. Based on these, we establish the basic features of the template as shown in Table 3.  Where t represents the position where the feature is currently being extracted, y represents the label, word represents the word, and tag represents the part of speech. Considering that there are so many ways to combine words, there are not as many binary grammatical features as part of speech

Entity relationships
The seven types of relationships between entities are: partial overall relationship, geographical location relationship, generic relationship, metaphor relationship, manufacturing use relationship, organizational structure dependency relationship, and character relationship. For this work, what is studied is the relationship between locations in an outbreak press release, so more attention is paid to the identification of geographic location relationships. According to the characteristics of the recognized text itself, the geographical location relationship is classified in detail (  The geographical location relationships that may be involved in the entry are divided into four categories: provinces, cities, administrative regions and geographical locations. The head word of a relationship can be not only a noun but also a verb. The system needs to classify the identified entity relationships. The specific division is shown in Table 5.

Entity relationships extraction method
Recognition of relational semantics is constantly evolving at any time and is divided into methods based on rule matching and methods based on machine learning. The method based on rule template matching is to define the rule template beforehand, and compare the statement with the rule template during the relationship identification. If the statement matches the characteristics of the characteristic template, it means that the entity in the statement has the relationship specified in the template Attributes. The disadvantage is that it requires more professional linguists to write a large number of feature templates, which takes a long time and has poor portability. The method based on machine learning is a method that uses various pattern recognition feature models to calculate the entity relationship features and weight values in sentences through related algorithms. There are currently two popular types of machine learning methods for dealing with entity relationships, namely kernel-based methods and feature vector-based methods. The purpose of our research is to perform location extraction. The feature template of geographic location relationship is relatively fixed and the portability is high. Therefore, we will use the rule-based matching method to extract the identified place words for relationship. There are three aspects of corpus preprocessing, rulemaking.

Corpus preprocessing
The preprocessing of corpus is mainly through the steps of word segmentation and entity recognition, which transform the sentences in the corpus into a stream of words with entity identification. Since entity relationship extraction is a relationship between two entities, the sentences with less than two place name entities in the text are filtered out, and the sentences containing two or more place name entities are used as recognition corpus.

Gazetteer
There may not be a complete place name in some sentences, for example: only the information of the administrative region, and no information of the province or city. At this time, we need to collect all the names of provinces, cities and districts in China and establish a dictionary framework. Part of the data is shown in Figure 1.
Since the object of our research is Guangzhou, China, we will not consider the COVID-19 epidemic situation of abroad. Therefore, we need to collect major foreign countries and place names from the Internet and establish a dictionary framework. Part of the data is shown in Figure 2. Rule 14: Delete words that meet the Rule 1 to Rule 9 conditions, but do not belong to any place names. For example: "Epidemic Area", "Common Youth City", "Community", "Outer Province", "Inner Province", "You Province", "Reprinted City", "Ministry of Labor", etc.
Rule 15: Delete words that are incorrectly marked in the recognition of named entities. For example: "Sputum", "when getting on the train", "from the day", "and", "more", etc. Rule 16: Delete words that belong to place names but are not related to this study. For example: "People's Republic of China", "China", "People's Hospital", "Chest Hospital", etc. Rule 17: Use foreign gazetteer to delete foreign place names Rule 17: If "administrative region" != "" and "City" == "", find the corresponding place name from the Chinese gazetteer and fill it in "City" Rule 18: If "administrative region" != "" and "Province" == "", find the corresponding place name from the Gazetteer and fill it in "Province" Rule 19: If "City" != "" and "Province" == "", find the corresponding place name from the Gazetteer and fill it in "Province" Rule 20: According to the principle of proximity, re-sort "administrative region" and "geographical location", and stitch the sorted words together.

Model evaluation criteria
For the evaluation of the model, the F1-score evaluation index is used for evaluation. For each type of named entity and relationship extraction, these three indicators are defined: for the evaluation of the model, the F1score evaluation index is used for evaluation. For each type of named entity and relationship extraction, these three indicators are defined: 2.13 Overall framework for automatic extraction of

Chinese place names
The purpose of this article is to process the text information in the web page until the entities and relationships are displayed. The implementation process is mainly divided into two modules: web page processing module and entity and relationship recognition module. The frame diagram of the automatic extraction of Chinese locations is shown in Figure 3.

Place name entity recognition result
This article uses the corpus marked by People's Daily in January 1998, of which 80% is selected as the training set, the remaining 20% is used as the closed test set, and the COVID-19 outbreak news release crawled through the Internet will be used as the open test set. The results of entity recognition are shown in Table 6. a) There are abbreviations for cities and provinces in the text, and it is possible to identify place names in ambiguous forms. For example: "Zhongshan" can be either a city in Guangdong Province or an administrative district in Dalian, Liaoning Province; b) Some place names appear in multiple cities. For example: "Baojian

Place entity extraction results
Road" is the road name of many cities. When there are multiple cities in one sentence, it is not easy to determine which city this road name belongsto; c) Place words in different places have different meanings. For example: The "Bajiao Tower" can be the name of a building or a town; d) The wrong labeling of the entity label itself leads to the wrong words in the recognition location. For example: The words "Sputum", "when getting on the train", "from the day", "and", "more" are not place names, but they are classified as place names according to the algorithm. a) Some place name words omit entity words or substitute pronouns. For example: many provinces and cities have omitted the entity words "Province" or "City"; b) Two identical entities in the text have different relationship classifications and cannot determine priority. For example: "Jilin" is both a province name and a city name; c) Multiple locations are involved in one sentence. There are multiple place words in a sentence, it is not easy to judge when recognizing the relationship, and the subordinate relationship can only be judged based on the position of the word in the sentence

Final results
As some provinces in the form are not filled in, the locations of  It can be inferred from Table 7 that the cities of the COVID-9 epidemic area in Guangdong Province are mainly concentrated in Guangzhou, Shenzhen, Zhuhai and Shantou. Among them, the epidemic areas in Guangzhou are concentrated in Yuexiu, Tianhe and Qiewan, the epidemic areas in Shenzhen are in Futian, Luohu and Longgang, the epidemic areas in Zhuhai are in Xiangzhou, and the epidemic areas in Shantou are in Chaoyang. Guangzhou has the widest epidemic area, the epidemic area of Heyuan is relatively small.

Discussion
There is no doubt that there are many aspects that reflect the severity of COVID-19 in an area. Considering the spread of the epidemic, the number of locations where the epidemic has occurred is one aspect. However, considering that COVID-19 is mainly concentrated in one place, it is concentrated, and the infection is mainly close-range infection. If considering the spread of the epidemic, it should be considered from the early stage of the COVID-19 infectious disease. However, due to the long incubation period of COVID-19 pneumonia, the infection may have spread before a case shows symptoms. However, this method can be used to find the source of the disease, so as to target areas that need attention.

Limitations
Although the conditional random field model is used in this paper to realize the entity recognition of the epidemic location, the rule-dependent model is used to extract the entity relationship of the COVID-19 pneumonia patient's itinerary. But the following aspects still need to be improved: a) The conditional random field model can rely on the information in this article for a long distance to improve the recognition accuracy, but this also increases the cost of the model and makes the recognition efficiency inefficient. The conditional random field model should be appropriately improved in the future, so that the efficiency can be improved while ensuring accuracy. b) In the process of entity recognition and relationship extraction, the entity pairs are identified in sentence units, so that once two related entity pairs exist in two sentences, or after a complex pronoun is used in the sentence it will cause mistakes in relationship extraction. In the future, we should need to increase the research on the pronoun entity.

Conflict of interest
We have no conflict of interests to disclose and the manuscript has been read and approved by all named authors.