Measuring Similarity among Legal Court Case Documents

It is important to note that our heat maps were previously ordered based on the original data frame. If we want to order our heat map by similarity, we must enable hierarchical clustering, which is built into the Heatmap.2 algorithm. Finally, the vector representation calculated above for each concept present in the judgment is used to calculate the similarity between the judgment documents. The concept of similarity between documents can sometimes not be correctly grasped by the value of individual similarity. Two documents can be similar to different degrees when viewed from different perspectives. For example, two legal documents may be similar due to similarities in the history of the case, but differ in how cases were litigated. On the other hand, two other legal documents may have nothing in common in terms of facts, but both may overturn the judgment of a lower court. In this respect, the two cases can be regarded as similar. When similarity calculations are used to assess the proximity of two documents, the context of the search may be unknown. In such cases, estimating similarities using different terms and visualizing them may be more useful for the user than getting a single similarity score. Bhattacharya P, Paul S, Ghosh K, Ghosh S, Wyner A (2019) Identification of the rhetorical roles of sentences in Indian legal judgments. In: Proceedings of the International Conference on Legal Knowledge and Information Systems (JURIX) Kumar S, Reddy PK, Reddy VB, Suri M (2013) Finding similar legal judgments under the common law system.

Springer, Berlin, p. 103–116 The recognition of cognate words in the judgment document is an important step judged by the proximity of the basic words. A conceptual graph Gi = (Vi, Ei) of a legal document Li is created using the base words s.t. Vi=⋃j∈1,nLiBLij and Ei=x,y|co−occurrences,y>3. The set of vertices Vi is the set of all words of base terms in all sentences in the document, and two nodes of conceptual words in the diagram have an edge between them if their number of simultaneous occurrences is greater than 3, that is, they appear together in at least three of the sentences. We use the number of simultaneous occurrences as the strength of the association between two conceptual words. Less than 3 simultaneous occurrences of concept words can represent pure coincidence, and so we do not consider such associations strong enough to add an advantage in the graph. Figure 1 shows a conceptual diagram created from a document fragment.

Finding similarities between legal documents, particularly between court judgments, is one of the most studied issues in the context of ITL. The methods and techniques used in LIR come from the confluence of four main technologies: artificial intelligence (AI), network analysis, machine learning and NLP (Bench-Capon et al., 2012). Legal knowledge is very complex and exists in various natural language documents. Ontology, a branch of AI, is widely used to facilitate effective knowledge management in the legal field (Saravanan, Ravindran & Raman, 2009). Knowledge engineering using the semantic web and ontology for specific subsets of law is commonly practiced (Casanovas et al., 2016) because it is easy to model legal actors, agents and relationships with these technologies. With the emergence in other technology fields, legal ontological solutions are also being updated to incorporate more scalable, reusable, context-sensitive, and user-centric approaches into the existing framework. Citations or bibliographic relevance in the legal field are extremely important for understanding interpretations and applications of the law, and a network is the most obvious representation of data for legal citation analysis. Therefore, citation network analysis explicitly remains one of the most popular techniques in LIR. Previous approaches mainly use network quality statistics and structural properties to extract legally relevant documents (Van Opijnen, 2012; Koniaris, Anagnostopoulos & Vassiliou, 2017).

Approaches are proposed that use the centrality and intermediation of a node in a network of case citations (Wagh & Anand, 2017) to find similarities between Indian court decisions. However, with recent advances in graph integration models based on deep learning (Cui et al., 2018), graphs and all their components can be represented as dense feature vectors that enable the study of new models in network analysis for LUR. (Sugathadasa et al., 2018) use node integrations obtained with the node2vec algorithm (Goyal & Ferrara, 2018; Grover and Leskovec, 2016) for case citation data to find similar legal documents. Analysis of case citation data using machine learning methods to estimate similarity between cases has also been experimented with in the past. The linkage of bibliographic information with text in the paragraph of judgments (Kumar et al., 2013) to estimate the similarity between two judgments is proposed.