Search for documents in a domain through Google. The objective is to extract metadata

Last update: Dec 16, 2022

Related tags

Overview

MetaFinder - Metadata search through Google

   _____               __             ___________ .__               .___                   
  /     \     ____   _/  |_  _____    \_   _____/ |__|   ____     __| _/   ____   _______  
 /  \ /  \  _/ __ \  \   __\ \__  \    |    __)   |  |  /    \   / __ |  _/ __ \  \_  __ \ 
/    Y    \ \  ___/   |  |    / __ \_  |     \    |  | |   |  \ / /_/ |  \  ___/   |  | \/ 
\____|__  /  \___  >  |__|   (____  /  \___  /    |__| |___|  / \____ |   \___  >  |__|    
        \/       \/               \/       \/               \/       \/       \/          
        
|_ Author: @JosueEncinar
|_ Description: Search for documents in a domain through Google. The objective is to extract metadata
|_ Usage: python3 metafinder.py -d domain.com -l 100 -o /tmp

Installation:

> pip3 install metafinder

Upgrades are also available using:

> pip3 install metafinder --upgrade

Usage

CLI

metafinder -d domain.com -l 20 -o folder [-t 10] [-v]

Parameters:

d: Specifies the target domain.
l: Specify the maximum number of results to be searched.
o: Specify the path to save the report.
t: Optional. Used to configure the threads (4 by default).
v: Optional. It is used to display the results on the screen as well.

In Code

import metafinder.extractor as metadata_extractor

documents_limit = 5
domain = "target_domain"
data = metadata_extractor.extract_metadata_from_google_search(domain, documents_limit)
for k,v in data.items():
    print(f"{k}:")
    print(f"|_ URL: {v['url']}")
    for metadata,value in v['metadata'].items():
        print(f"|__ {metadata}: {value}")

document_name = "test.pdf"
try:
    metadata_file = metadata_extractor.extract_metadata_from_document(document_name)
    for k,v in metadata_file.items():
        print(f"{k}: {v}")
except FileNotFoundError:
    print("File not found")

Author

This project has been developed by:

Josué Encinar García -- @JosueEncinar

Contributors

Félix Brezo Fernández -- @febrezo

Disclaimer!

This Software has been developed for teaching purposes and for use with permission of a potential target. The author is not responsible for any illegitimate use.

Search for documents in a domain through Google. The objective is to extract metadata

Related tags

Overview

MetaFinder - Metadata search through Google

Installation:

Usage

CLI

In Code

Author

Contributors

Disclaimer!

Owner

Josué Encinar

Fidibo.com comments Sentiment Analyser

Azure Text-to-speech service for Home Assistant

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Ongoing research training transformer language models at scale, including: BERT & GPT-2

A text augmentation tool for named entity recognition.

2021海华AI挑战赛·中文阅读理解·技术组·第三名

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

NVDA, the free and open source Screen Reader for Microsoft Windows

Telegram AI chat bot written in Python using Pyrogram

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

Interpretable Models for NLP using PyTorch

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Get list of common stop words in various languages in Python